SAS COMPRESS FUNCTION
Introduction
In the field of Clinical SAS programming, data cleaning and transformation play a crucial role in ensuring accurate and meaningful analyses. One of the most frequently used functions for text manipulation is the SAS COMPRESS function. This function is particularly useful when working with raw clinical data, where extraneous characters need to be removed for better consistency and accuracy.
This blog explores the COMPRESS function in SAS in detail, its syntax, real-world applications in clinical data processing, and best practices to optimize its usage.
What is the SAS COMPRESS Function?
The COMPRESS function is best in SAS it is used to remove specific characters from a string. It is extremely helpful in clinical data management when cleaning datasets by eliminating unwanted spaces, special characters, or numerical values.
Syntax of the COMPRESS Function
COMPRESS(source, characters-to-remove, modifier)
- source: The input string or variable from which characters need to be removed.
- characters-to-remove (optional): Specific characters that should be removed from the source.
- modifier (optional): Special flags that define how the function should behave.
Commonly Used Modifiers
Modifier | Description |
---|---|
a | Removes all alphabetic characters |
d | Removes all numeric digits |
i | Ignores case sensitivity |
k | Keeps specified characters instead of removing them |
l | Removes lowercase letters |
n | Removes digits, equivalent to d |
p | Removes punctuation marks |
s | Removes spaces |
u | Removes uppercase letters |
Applications of the COMPRESS Function
1. Removing Unwanted Spaces
Clinical datasets often contain extra spaces in variables. The COMPRESS function helps in standardizing them.
data clean_data;
set raw_data;
variable_cleaned = compress(variable, ” “);
run;
2. Eliminating Special Characters
In clinical trial data, patient IDs or medical codes may contain special characters that need removal.
data clean_data;
set raw_data;
patient_id_cleaned = compress(patient_id, “-./()”);
run;
3. Extracting Numeric Values
Some variables may have a mix of alphabets and numbers. To extract only numeric values:
data clean_data;
set raw_data;
numeric_values = compress(mixed_variable, “”, “d”);
run;
4. Keeping Only Alphabets
To extract only alphabetic characters from a string:
data clean_data;
set raw_data;
alpha_only = compress(variable, “”, “kd”);
run;
5. Standardizing Clinical Terms
Pharmaceutical and clinical research data often contain inconsistent text formatting. The COMPRESS function ensures uniformity.
data clean_data;
set raw_data;
standardized_variable = compress(original_variable, “”, “ps”);
run;
SAS Compress is Practical Use Cases in clinical SAS
1. Cleaning Adverse Event Data
Adverse event reports may contain unnecessary punctuation. Using COMPRESS ensures a clean dataset.
data cleaned_ae;
set adverse_events;
event_description = compress(event_description, “”, “p”);
run;
2. Preparing Lab Data for Analysis
Lab test results may include special characters that need removal before further statistical analysis.
data lab_clean;
set lab_data;
result_cleaned = compress(result, “mg/dL()”);
run;
3. Removing Control Characters from Raw Data
Raw clinical data collected from multiple sources often contain control characters that need removal.
data cleaned_raw;
set raw_clinical_data;
formatted_variable = compress(variable, “”, “c”);
run;
Advanced Uses Of SAS COMPRESS
1. Enhancing Data Consistency
Data consistency is a critical requirement in clinical trials to ensure accurate analysis and reporting. The COMPRESS function plays a key role in standardizing textual data by removing inconsistencies such as unwanted spaces, special characters, and case-sensitive variations. By combining COMPRESS with functions like TRANWRD, STRIP, and TRIM, programmers can improve data uniformity across multiple datasets.
For example, consider a dataset where subject IDs contain inconsistent spacing:
data cleaned_data;
set raw_data;
subject_id_cleaned = compress(subject_id, ” “);
run;
This removes all spaces from subject IDs, ensuring uniformity and preventing mismatches during data merging or validation.
2. Optimizing Dataset Size
Large clinical datasets often contain unnecessary spaces or extraneous characters, which increase storage requirements and slow down processing. The COMPRESS function can help optimize dataset size by eliminating unwanted characters, thereby reducing storage and improving efficiency.
For example, removing spaces from a dataset with millions of records can significantly decrease file size:
data optimized_data;
set large_dataset;
variable_optimized = compress(variable, ” “);
run;
Using COMPRESS effectively can make data retrieval and analysis faster, particularly in large-scale clinical trials.
3. Data Anonymization for Privacy Compliance
Clinical trial data often contain sensitive information such as patient names, contact details, or unique identifiers. To comply with privacy regulations like GDPR and HIPAA, data must be anonymized before sharing or reporting. The COMPRESS function can assist in this process by masking or removing sensitive information.
For example, if a dataset contains patient IDs with identifiable characters, you can remove letters and retain only numerical values:
data anonymized_data;
set patient_data;
anon_id = compress(patient_id, “”, “a”);
run;
This keeps only numeric values, ensuring that personal information is obscured while maintaining data integrity for analysis.
4. Extracting Specific Data Elements
The COMPRESS function is highly useful when extracting specific types of information from mixed-character variables. For instance, if lab test results contain both numeric values and units (e.g., “120mg/dL”), you can extract only the numeric portion for statistical analysis.
data extracted_data; set lab_results; numeric_value = compress(test_result, “”, “d”); run; Similarly, if you want to retain only alphabetical characters from a mixed variable, use: data text_only; set clinical_notes; alphabetic_text = compress(note_text, “”, “kd”); run; |
These techniques allow for better data structuring and improved analytical accuracy.
5. Handling Multi-Language Data in Global Clinical Trials
In international clinical trials, datasets may contain multiple languages, leading to inconsistent character encoding. The COMPRESS function can be used to remove unwanted symbols, non-printable characters, or accents from text variables.
For example, to remove special characters and punctuation from multi-language patient comments:
data cleaned_comments; set multilingual_data; cleaned_text = compress(comment, “”, “p”); run; |
This ensures text uniformity across different regions, facilitating better interpretation and reporting.
6. Improving Data Merging and Deduplication
Data merging is a common process in clinical trials where datasets from multiple sources are combined. However, discrepancies such as extra spaces, special characters, or inconsistent formatting can cause mismatches. Using COMPRESS, programmers can clean identifiers before merging, reducing errors and improving efficiency.
data merged_data; merge dataset1 (rename=(id=clean_id1)) dataset2 (rename=(id=clean_id2)); by clean_id1 clean_id2; run; |
By ensuring that key identifiers are cleaned before merging, data integrity is maintained, reducing errors in the final dataset.
Best Practices for Using SAS COMPRESS Function in Clinical SAS
Understand Your Data: Before using COMPRESS, analyze the dataset to determine which characters need removal.
Use Modifiers Wisely: Instead of specifying multiple characters manually, use appropriate modifiers for efficiency.
Combine with Other Functions: Pair COMPRESS with TRANWRD, STRIP, or TRIM for better data cleaning.
Test on Sample Data: Always validate results on a small subset before applying to an entire dataset.
Optimize for Performance: Avoid unnecessary function calls inside loops to enhance processing speed.
1. Use Explicit Character Removal for Clinical Data
- When working with clinical trial data, avoid removing all blanks (COMPRESS(var)) unless necessary, as spaces may be important in patient names or drug descriptions.
Instead, specify exactly what to remove:
sas
CopyEdit
new_var = COMPRESS(var, ‘-‘, ‘k’); /* Keeps only numbers and letters, removing dashes */ |
2. Preserve Spaces When Required
If spaces are essential, avoid removing them with COMPRESS(var), as it removes all spaces by default.
Use:
sas
CopyEdit
new_var = COMPRESS(var, , ‘kw’); /* Keeps spaces, letters, and numbers */ |
3. Handling Special Characters in Clinical Data
- In Adverse Event (AE) or Medical History (MH) datasets, some fields may contain special characters (e.g., “Hypertension – Chronic”).
If only specific symbols need to be removed, specify them:
sas
CopyEdit
new_var = COMPRESS(var, ‘.,()[]’); /* Removes punctuation but keeps spaces */ |
4. Use Modifiers for Specific Character Types
Keep Only Digits (Useful for Patient IDs, Visit Numbers, etc.):
sas
CopyEdit
numeric_var = COMPRESS(var, , ‘kd’); /* Keeps only digits */ |
Keep Only Letters (Useful for Drug Names, Diagnoses, etc.):
sas
CopyEdit
alpha_var = COMPRESS(var, , ‘ka’); /* Keeps only alphabets */
5. Be Cautious with Default COMPRESS in Large Datasets
- Using COMPRESS(var) on large clinical datasets can remove unintended spaces and change the meaning of text fields.
Example:
sas
CopyEdit
data test; var = “Drug A – High Dose”; compressed_var = COMPRESS(var); run; |
Output: “DrugA-HighDose” → Not ideal for clarity in clinical reports.
Instead, specify exactly what to remove:
sas
CopyEdit
compressed_var = COMPRESS(var, ‘-‘, ‘kw’); /* Keeps spaces and removes only dashes */ |
6. Use COMPRESS for Data Cleaning Before Merging
Clinical trial data often comes from multiple sources with extra spaces or special characters. Use COMPRESS before merging:
sas
CopyEdit
clean_id = COMPRESS(patient_id, ‘ ‘);
7. Avoid Overuse in Patient Data
- Patient names and medical descriptions often contain spaces or special characters. Overuse of COMPRESS can cause loss of information.
Example:
sas
CopyEdit
name = “devikha”; clean_name = COMPRESS(name, , ‘ka’); /* Removes everything except letters */ |
Common Use Cases of Compress in Clinical SAS |
scenario | code |
---|---|
Remove all spaces (not recommended for names) | COMPRESS(var) |
Keep only numbers (e.g., Patient ID, Visit No.) | COMPRESS(var, , ‘kd’) |
Keep only letters (e.g., Drug Names) | COMPRESS(var, , ‘ka’) |
Remove specific characters (e.g., dashes, parentheses) | COMPRESS(var, ‘-()’) |
Keep spaces while removing unwanted symbols | COMPRESS(var, ‘-()[]’, ‘kw’) |
Key Takeaways
✔ Always specify the characters to remove instead of using COMPRESS(var).
✔ Use modifiers (kd, ka, kw) for better control over text cleansing.
✔ Be cautious when removing spaces—especially in names, drug descriptions, and adverse events.
✔ Before using COMPRESS, check the original data to avoid accidental data loss.
1. Use Explicit Character Removal for Clinical Data
- When working with clinical trial data, avoid removing all blanks (COMPRESS(var)) unless necessary, as spaces may be important in patient names or drug descriptions.
Instead, specify exactly what to remove:
sas
CopyEdit
new_var = COMPRESS(var, ‘-‘, ‘k’); /* Keeps only numbers and letters, removing dashes */ |
2. Preserve Spaces When Required
- If spaces are essential, avoid removing them with COMPRESS(var), as it removes all spaces by default.
Use:
sas
CopyEdit
new_var = COMPRESS(var, , ‘kw’); /* Keeps spaces, letters, and numbers */ |
3. Handling Special Characters in Clinical Data
- In Adverse Event (AE) or Medical History (MH) datasets, some fields may contain special characters (e.g., “Hypertension – Chronic”).
If only specific symbols need to be removed, specify them:
sas
CopyEdit
new_var = COMPRESS(var, ‘.,()[]’); /* Removes punctuation but keeps spaces */ |
4. Use Modifiers for Specific Character Types
Keep Only Digits (Useful for Patient IDs, Visit Numbers, etc.):
sas
CopyEdit
numeric_var = COMPRESS(var, , ‘kd’); /* Keeps only digits */
Keep Only Letters (Useful for Drug Names, Diagnoses, etc.):
sas
CopyEdit
alpha_var = COMPRESS(var, , ‘ka’); /* Keeps only alphabets */
5. Be Cautious with Default COMPRESS in Large Datasets
- Using COMPRESS(var) on large clinical datasets can remove unintended spaces and change the meaning of text fields.
Example:
sas
CopyEdit
data test; var = “Drug A – High Dose”; compressed_var = COMPRESS(var); run; |
Output: “DrugA-HighDose” → Not ideal for clarity in clinical reports.
Instead, specify exactly what to remove:
sas
CopyEdit
compressed_var = COMPRESS(var, ‘-‘, ‘kw’); /* Keeps spaces and removes only dashes */ |
6. Use COMPRESS for Data Cleaning Before Merging
Clinical trial data often comes from multiple sources with extra spaces or special characters. Use COMPRESS before merging:
sas
CopyEdit
clean_id = COMPRESS(patient_id, ‘ ‘);
7. Avoid Overuse in Patient Data
- Patient names and medical descriptions often contain spaces or special characters. Overuse of COMPRESS can cause loss of information.
Example:
sas
CopyEdit
name = “John D. O’Brien”; clean_name = COMPRESS(name, , ‘ka’); /* Removes everything except letters */ |
Output: “JohnDObrien” (Loses important information like space and apostrophe
- SAS Documentation – COMPRESS Function: The COMPRESS function in SAS is a character function used to remove specific characters from a string. It takes three arguments: the source string, optional characters to remove, and an optional modifier. The function is commonly used for data cleaning, eliminating spaces, punctuation, or numeric values from text fields in datasets. The use of modifiers like ‘d’ for digits, ‘p’ for punctuation, and ‘s’ for spaces makes it a flexible tool for text manipulation in clinical and other SAS applications.
2. Best Practices in Clinical Data Management: Effective clinical data management ensures the accuracy, consistency, and reliability of data used in clinical trials. Best practices include thorough data cleaning, validation checks, and compliance with regulatory standards such as CDISC and FDA guidelines. Implementing automated quality control procedures, maintaining proper documentation, and using standardized data formats can enhance data integrity. Leveraging SAS functions like COMPRESS, TRANWRD, and STRIP helps eliminate inconsistencies and improve dataset usability, ultimately leading to more reliable clinical trial outcomes.
3.Regulatory Guidelines for Clinical SAS Programming: Clinical SAS programming must adhere to strict regulatory guidelines to ensure data integrity, patient safety, and compliance with industry standards. Key regulatory bodies include the FDA (Food and Drug Administration), EMA (European Medicines Agency), and ICH (International Council for Coordination). Standards such as CDISC (Clinical Data Interchange Standards Consortium) and SDTM (Study Data Tabulation Model) are widely used to structure clinical trial data. Compliance with Good Clinical Practice (GCP) and data submission requirements ensures that clinical trial results are valid, reproducible, and acceptable for regulatory review.
4.Case Studies on Data Cleaning in Clinical Trials: Effective data cleaning is critical in clinical trials to ensure accurate results and regulatory compliance. Case studies highlight common challenges such as missing data, duplicate records, and inconsistencies across sites. By leveraging SAS functions like COMPRESS, SCAN, and TRANWRD, clinical programmers can standardize data formats, remove anomalies, and enhance dataset quality. Successful case studies demonstrate how robust data cleaning strategies lead to improved data integrity, reduced processing time, and more reliable clinical trial outcomes, ultimately supporting regulatory approvals and patient safety.
Conclusion
The COMPRESS function is an essential tool in Clinical SAS programming for cleaning and standardizing textual data. Whether removing extraneous spaces, eliminating special characters, or extracting specific types of information, COMPRESS enhances data quality and improves analysis accuracy.
By incorporating best practices and leveraging its full potential, Clinical SAS programmers can streamline data processing, ensure regulatory compliance, and generate high-quality reports essential for clinical trials and research studies. Boosts performance by reducing storage space with SAS Compress
Would you like to explore more SAS functions? Stay tuned for our next post on SAS TRANWRD Function in Clinical SAS.
FAQ
The COMPRESS
function removes specified characters (including spaces) from a string.
A. By default, COMPRESS
removes all spaces from a string.
Use COMPRESS(string, 'chars_to_remove').
Yes! Just list them inside quotes.
Example:
compress("Data Science!", "a!")
/* Output: "Dt Science" */
Use the modifiers:
- ‘d’ → Removes digits
- ‘k’ → Keeps specified characters
- ‘a’ → Removes letters
- ‘p’ → Removes punctuation
Example:
compress("SAS123!!", , 'd') /* Output: "SAS!!" */