A Practitioner's Deep Dive with Worked Examples in SAS
GDPR has been in force since 25 May 2018. Most statistical programmers I work with still treat "anonymized" as a synonym for "I dropped the USUBJID column." That is not what the regulation says, and the gap matters: a dataset that is pseudonymized under Article 4(5) is still personal data, still inside the scope of GDPR, and still subject to breach notification, DPIA, and data-transfer rules. A dataset that is genuinely anonymous under Recital 26 falls outside the regulation entirely.
This article is for programmers who work on SDTM, ADaM, raw EDC extracts, and CSR appendices that include EU/EEA subjects. Each technique below gets a worked example with an input table, an output table, and a SAS snippet you can drop into a validation macro. The structure follows the three categories that cover almost everything pharma teams use in practice: core statistical methods, pseudonymization techniques, and the clinical-trial specific rules that EMA and Health Canada have layered on top of GDPR.
Four provisions do almost all of the work for statistical programmers.
Article 4(1) defines personal data as any information relating to an identified or identifiable natural person. The test is whether a person can be identified, directly or indirectly, by reference to an identifier such as a name, an ID number, location data, an online identifier, or factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that person.
Article 4(5) defines pseudonymization as processing personal data so it can no longer be attributed to a specific subject without additional information, provided that additional information is kept separately and subject to technical and organizational measures. Pseudonymized data is explicitly still personal data under Article 11 and Recital 26.
Recital 26 draws the anonymization line. Data is anonymous when the subject is no longer identifiable, accounting for "all the means reasonably likely to be used" by the controller or any other person. Only truly anonymous data is outside GDPR.
Article 89 permits processing for scientific research, statistical purposes, and public-interest archiving, provided appropriate safeguards are in place. The article explicitly names pseudonymization as a safeguard and allows further processing beyond the original purpose under that framework. Most clinical research activities rely on this article.
The practical takeaway: hashing USUBJID does not move your dataset out of GDPR. Only the full de-identification pipeline, tested against a motivated intruder, can do that.
| Attribute | Pseudonymization | Anonymization |
| GDPR scope | Still personal data | Out of scope (Recital 26) |
| Re-linkage possible | Yes, with the additional information | No, by construction |
| Key/mapping storage | Separate, controlled, documented | Destroyed or never created |
| Typical use | Internal analytics, sharing with processors, cross-dataset linking | External publication, EMA Policy 0070 packages, academic data requests |
| Technical safeguards | Access control, encryption in transit and at rest | Risk model against a motivated intruder, residual risk below threshold |
| Article basis | Art. 4(5), 25, 32, 89 | Recital 26 |
| Breach-notification rules apply? | Yes | No |
Source: GDPR text; ICO guidance on anonymisation, 2021.
A hash of USUBJID is not anonymization. It is pseudonymization at best, and for small studies it is weak pseudonymization. Consider a phase 2 trial with 280 subjects and a well-known protocol number. An attacker who knows the protocol can guess that USUBJIDs follow the pattern STUDYX-<site>-<seq>, enumerate every plausible value, hash each one, and match against your "anonymized" file. Hashing without a long secret salt collapses in minutes.
Anonymization is a property of the released dataset as a whole, not of any one column. You test it by asking: given the quasi-identifiers in the file, the auxiliary data an attacker can plausibly get (ClinicalTrials.gov enrollment counts, social media posts, local press releases about trial sites), and the rare events in the sensitive columns, can a motivated intruder re-identify anyone with probability above an accepted threshold?
EMA Policy 0070 and Health Canada PRCI both settle on a residual re-identification risk ceiling of 0.09 (9%) for clinical study reports and patient-level datasets released under their transparency programs. The number traces back to cell-suppression literature and HIPAA Expert Determination, not to GDPR itself, but European data protection authorities have accepted it as the working benchmark for research disclosures.
Generalization and suppression are the building blocks. k-anonymity, l-diversity, and t-closeness are metrics that tell you when you have generalized and suppressed enough.
Replace a specific value with a broader category. Define the generalization hierarchy before you run the de-identification pass. For AGE, 5-year or 10-year bins are the pharma default. For dates, year-only works for demographic variables, and day-offset from a subject reference start works for event variables.
| SDTM variable | Input value | Output value | Hierarchy rule |
| AGE | 53 | 50 to 54 | 5-year bins; 90+ collapsed |
| BRTHDTC | 1971-04-18 | 1971 | Year only |
| COUNTRY | Malta | Southern Europe | EU subregion mapping |
| SITEID | 7421 | EU-West | Regional recode; sites with less than 3 subjects collapsed |
| RACE | Native Hawaiian | Other | Categories with cell count less than 5 collapsed |
| ETHNIC | Hispanic or Latino | Hispanic or Latino | Kept |
Figure/Table 2. Example generalization hierarchy for the DM domain.
SAS implementation:
data dm_gen;
set raw.dm;
length age_grp $7 brthyr 8;
brthyr = year(input(brthdtc, yymmdd10.));
select;
when (age < 18) age_grp = '< 18';
when (age < 25) age_grp = '18-24';
when (age < 35) age_grp = '25-34';
when (age < 45) age_grp = '35-44';
when (age < 55) age_grp = '45-54';
when (age < 65) age_grp = '55-64';
when (age < 75) age_grp = '65-74';
otherwise age_grp = '75+';
end;
drop age brthdtc;
run;
Remove values or records that stand out. Two flavors: cell suppression blanks an individual value, record suppression drops an entire row. Cell suppression targets outliers (a 97-year-old in a trial where the next oldest is 78 can be identified from AGE alone even after generalization). Record suppression handles cases that cannot be protected by further generalization, typically rare-disease cohorts or single-subject sites.
| Variable | Trigger | Action |
| AGE | Value at or above 90 | Collapse to "90+" category |
| HEIGHT | Outside p1 to p99 of trial distribution | Cell-suppress |
| WEIGHT | Outside p1 to p99 of trial distribution | Cell-suppress |
| RACE | Category with less than 5 records | Recode to "Other" |
| SITEID | Less than 3 subjects at site | Record-suppress or recode to region |
| AEDECOD | MedDRA Preferred Term with less than 3 subjects | Roll up to SOC |
| MHDECOD | MedDRA PT with less than 3 subjects | Roll up to SOC |
| COUNTRY | Less than 5 subjects | Recode to region |
SAS implementation:
/* Cell-suppress extreme ages */
data dm_sup;
set dm_gen;
if age_grp = '75+' and age >= 90 then age_grp = '90+';
run;
/* Record-suppress sites with fewer than 3 subjects */
proc sql;
create table site_counts as
select siteid, count(distinct usubjid) as n_subj
from dm_gen
group by siteid;
quit;
proc sql;
create table dm_sup2 as
select d.*
from dm_sup d
inner join site_counts s
on d.siteid = s.siteid
where s.n_subj >= 3;
quit;
A dataset is k-anonymous when every combination of quasi-identifier values appears at least k times. If an attacker knows the quasi-identifiers for one target (age, sex, country), that target cannot be distinguished from at least k minus 1 other subjects.
Input (quasi-identifiers: AGE, SEX, COUNTRY; sensitive attribute: MHTERM)
| USUBJID | AGE | SEX | COUNTRY | MHTERM |
| 101 | 28 | M | Ireland | HTN |
| 102 | 29 | M | Ireland | DM |
| 103 | 52 | F | Germany | HTN |
| 104 | 53 | F | Germany | DM |
| 105 | 71 | M | Portugal | HTN |
| 106 | 72 | M | Portugal | DM |
After generalization (AGE to 10-year bin, COUNTRY to EU subregion):
| USUBJID | AGE | SEX | REGION | MHTERM |
| 101 | 20-29 | M | Western EU | HTN |
| 102 | 20-29 | M | Western EU | DM |
| 103 | 50-59 | F | Western EU | HTN |
| 104 | 50-59 | F | Western EU | DM |
| 105 | 70-79 | M | Southern EU | HTN |
| 106 | 70-79 | M | Southern EU | DM |
Three equivalence classes of size 2 each. The release is 2-anonymous. Typical thresholds: k = 5 for internal sharing, k = 11 for external publication under HIPAA Safe Harbor, k = 3 minimum for EMA Policy 0070 packages combined with other safeguards.
SAS check macro:
proc sql;
create table qid_counts as
select age_grp, sex, region, count(*) as k_val
from dm_gen
group by age_grp, sex, region;
quit;
proc sql;
create table k_report as
select min(k_val) as k_min,
mean(k_val) as k_mean,
sum(case when k_val < 3 then 1 else 0 end) as n_unsafe_classes
from qid_counts;
quit;
title 'k-anonymity report';
proc print data=k_report noobs; run;
k-anonymity alone does not stop attribute disclosure. If every subject in a 2-anonymous equivalence class has the same sensitive value, an attacker who knows the quasi-identifiers learns the sensitive value without re-identifying anyone by name. l-diversity requires each class to contain at least l distinct sensitive values.
Distinct l-diversity counts unique values. Entropy l-diversity requires the Shannon entropy of the group's sensitive-value distribution to exceed log(l). Entropy is stronger because it rejects heavily skewed groups.
| Equivalence class | Sensitive values present | distinct l |
| 20-29 / M / Western EU | HTN, DM | 2 |
| 50-59 / F / Western EU | HTN, DM | 2 |
| 70-79 / M / Southern EU | HTN, DM | 2 |
Each class has l = 2, so the output is 2-anonymous and 2-diverse.
SAS check:
proc sql;
create table l_check as
select age_grp, sex, region,
count(*) as k_val,
count(distinct mhterm) as l_val
from dm_anon
group by age_grp, sex, region
having l_val < 2;
quit;
/* Any rows in l_check are classes that need further treatment */
l-diverse groups can still leak information if the distribution of sensitive values inside the group differs sharply from the overall population distribution. If 10% of trial subjects have a positive tumor biomarker but a specific 60-69/F/rural equivalence class shows 50% positive, membership in that class tells the attacker the subject has elevated risk. t-closeness bounds the distance between each class distribution and the overall distribution below a threshold t.
The standard distance metric is Earth Mover's Distance (EMD) for numeric and ordinal sensitive variables, and a variation distance for nominal variables. SAS does not ship a t-closeness proc. You implement it as a macro that computes per-class distributions, compares each to the overall distribution, and flags classes above t. El Emam's book (listed in References) gives a worked implementation.
| Positive | Negative | Distance to overall | |
| Overall population | 10% | 90% | 0.00 (baseline) |
| Class 60-69/F/Rural | 50% | 50% | 0.40 (fails t = 0.20) |
| Class 40-49/M/Urban | 12% | 88% | 0.02 (passes) |
A group that deviates from the overall distribution by more than t fails t-closeness and needs additional generalization or suppression.
| Metric | Typical internal floor | External publication floor | Reference |
| k | 5 | 11 | HIPAA Expert Determination (accepted by EU DPAs) |
| l | 2 | 3 | Machanavajjhala et al., 2007 |
| t | 0.35 | 0.20 | Li et al., 2007 |
| Residual re-id risk | 0.20 | 0.09 | EMA Policy 0070 and Health Canada PRCI |
Pseudonymization does not take your data out of GDPR scope, but Article 32 explicitly lists it as an appropriate technical measure, and Article 89 treats it as a valid safeguard for research processing. The three primitives below cover almost every pharma use case: hashing for non-reversible linking, tokenization for reversible linking via a vault, and encryption for protection at rest or in transit.
A one-way function that maps an identifier to a fixed-length digest. Three rules apply.
| USUBJID input | Salt (stored in vault, never shipped) | HMAC-SHA-256 output (truncated) |
| STUDYX-001-00145 | 9f3a2c4d6e8b1f0a | 8b4c2e9f...a703d218 |
| STUDYX-001-00146 | 9f3a2c4d6e8b1f0a | e1d8b3f4...2119c6a8 |
| STUDYX-001-00147 | 9f3a2c4d6e8b1f0a | 4a76c0e2...b8955d31 |
SAS implementation using the sha256hex function:
/* Salt is read from an encrypted credentials file, never hard-coded */
%let SALT = %sysget(GDPR_PSEUDO_SALT);
data dm_ps;
set raw.dm;
length usubjid_h $64;
usubjid_h = sha256hex(cats("&SALT", usubjid));
drop usubjid;
run;
/* Verify that the mapping is deterministic and distinct */
proc sql;
select count(distinct usubjid_h) as n_unique
from dm_ps;
quit;
Replace an identifier with an unrelated random token and keep the mapping in a separate relational table under tighter access control than the data warehouse. Tokenization is usually better than hashing when you need to re-identify subjects later (pharmacovigilance follow-up, long-term safety extensions). With hashing, if you lose the salt you lose the mapping; with tokenization, the vault is the source of truth and the data can be reissued.
| Original USUBJID | Token |
| STUDYX-001-00145 | SX-TKN-A81F92 |
| STUDYX-001-00146 | SX-TKN-7C30D4 |
| STUDYX-001-00147 | SX-TKN-B19E55 |
| STUDYX-001-00148 | SX-TKN-F4A071 |
SAS implementation:
/* One-time token generation */
data token_map;
set raw.dm(keep=usubjid);
length token $13;
call streaminit(20260411);
token = cats("SX-TKN-", put(rand("integer", 1, 16777215), hex6.));
run;
/* Apply tokens and drop original identifier */
proc sql;
create table dm_tok as
select t.token as usubjid_tok,
d.age, d.sex, d.country, d.brthyr
from raw.dm d
inner join token_map t
on d.usubjid = t.usubjid;
quit;
/* The token_map dataset is moved to the vault and removed from project lib */
Reversible, keyed, and auditable. Use this when you need to pseudonymize at rest but recover the original value in a controlled environment. AES-256 in GCM mode is the current default. ECB mode is never appropriate for structured data because identical plaintexts give identical ciphertexts, which defeats the purpose. Format-preserving encryption (FF3-1) is useful when a downstream system requires the pseudonymized value to keep the same format as the original, for example when a legacy database column is fixed-width.
| Technique | Reversible? | Secret needed | Best fit |
| Salted HMAC (SHA-256) | No | Long secret salt, vault-stored | Cross-dataset linking without re-identification; stable IDs across submissions |
| Tokenization | Yes, via vault | Lookup table | Subject IDs that need to round-trip (PV, LTE) |
| AES-256-GCM | Yes, with key | Symmetric key | Protecting identifiers at rest and in transit |
| FF3-1 (format-preserving) | Yes, with key | Symmetric key and tweak | Legacy systems needing the original field layout |
GDPR does not mention clinical trials directly. EMA Policy 0070 and Health Canada PRCI turn the anonymization provisions into concrete requirements for regulatory submissions, and most EU sponsors treat those two policies as the operational ceiling for GDPR compliance on shared clinical data.
Applies to clinical study reports, protocols, and (originally, for Phase 1 of the policy) patient-level datasets submitted to EMA. The external guidance document (EMA/90915/2016) asks sponsors to submit an anonymisation report that demonstrates residual re-identification risk is below a defensible threshold. 0.09 is the accepted ceiling. The sponsor must name the quasi-identifiers, the generalization hierarchies, the k values achieved, the distributional controls applied, and the motivated-intruder scenario used for risk estimation.
| Step | Artifact | Responsible |
| 1. Classify variables | Variable inventory (direct / quasi / sensitive / low) | Stat programming + biostatistics |
| 2. Pick attacker model | Risk model document (prosecutor, journalist, marketer) | Biostatistics + DPO |
| 3. Build hierarchies | Generalization tables per variable | Stat programming |
| 4. Apply transformations | Anonymized ADaM/SDTM | Stat programming |
| 5. Measure risk | k/l/t report, residual risk calculation | Stat programming + QC |
| 6. Iterate | Updated dataset, diff report | Stat programming |
| 7. Document | Anonymisation report (EMA/90915/2016 format) | Regulatory + DPO |
| 8. Submit | Anonymized package to EMA transparency portal | Regulatory |
Public Release of Clinical Information. Similar to Policy 0070 but applies to drug and device submissions filed with Health Canada under the Food and Drug Regulations. Same 0.09 ceiling. Health Canada also asks for an explicit statement of the motivated-intruder scenario and for the sponsor to name the public auxiliary sources considered in the risk model (ClinicalTrials.gov, EUCTR, peer-reviewed literature, press releases).
| Domain | Variable | Classification | Action |
| DM | USUBJID | Direct | Replace with study-specific random token |
| DM | SUBJID | Direct | Drop or replace with token |
| DM | BRTHDTC | Quasi | Year only; if age at baseline is 90+ truncate year |
| DM | AGE | Quasi | 5-year bin; collapse to "90+" at or above 90 |
| DM | SEX | Quasi | Keep |
| DM | RACE | Quasi | Keep if cell count is at least 5; else recode to Other |
| DM | ETHNIC | Quasi | Keep or collapse by region |
| DM | COUNTRY | Quasi | Keep if cell counts support k; else EU region |
| DM | SITEID | Quasi | Recode to region; drop sites with less than 3 subjects |
| EX | EXSTDTC | Quasi | Day offset from RFSTDTC |
| EX | EXDOSE | Low risk | Keep |
| AE | AEDECOD | Sensitive | Keep at PT; collapse rare PTs to SOC |
| AE | AESTDTC | Quasi | Day offset from RFSTDTC |
| AE | AETERM | Direct (free text) | Drop; keep AEDECOD only |
| MH | MHDECOD | Sensitive | Keep at PT; collapse rare to SOC |
| LB | LBORRES | Low risk | Keep |
| LB | LBDTC | Quasi | Day offset from RFSTDTC |
| VS | HEIGHT | Quasi | Round to nearest cm; cell-suppress outliers |
| VS | WEIGHT | Quasi | Round to nearest kg; cell-suppress outliers |
| CO | COVAL | Direct (free text) | Drop |
Sponsors usually prefer day offsets from a subject reference start instead of year-only dates, because downstream analyses (time-to-event, exposure-adjusted rates, Kaplan-Meier) need relative timing. The per-subject reference date is destroyed in the released dataset. For small populations, the per-subject offset can itself be a quasi-identifier (if only one subject enrolled on a given Monday at a given site), so a small random jitter is common.
SAS implementation of deterministic date shifting:
data ex_anon;
merge dm_ps(in=a keep=usubjid_h rfstdtc)
raw.ex(in=b keep=usubjid_h exstdtc exdose);
by usubjid_h;
if a and b;
/* Day offset from subject reference start */
exstday = input(exstdtc, yymmdd10.) - input(rfstdtc, yymmdd10.);
/* Optional jitter: seeded per subject so the offset is stable */
/* across releases of the same dataset */
call streaminit(abs(input(substr(usubjid_h,1,8), hex8.)));
exstday = exstday + round(rand('uniform', -3, 3));
drop exstdtc rfstdtc;
run;
Policy 0070 and PRCI want a number in the anonymisation report. The standard calculation treats each equivalence class as an attacker's best guess and computes the expected probability of correct re-identification across the file.
For a prosecutor attacker (knows the target is in the file), the risk per class is 1 / F_i, where F_i is the class size in the released dataset. The average risk across the file is the mean of those per-class risks weighted by class size.
For a journalist attacker (does not know whether the target is in the file), multiply the prosecutor risk by the sampling fraction from the trial population to the plausible external population.
| Class | Quasi-identifier combo | Size F_i | 1 / F_i |
| C1 | 20-29 / M / Western EU | 4 | 0.250 |
| C2 | 30-39 / M / Western EU | 6 | 0.167 |
| C3 | 40-49 / M / Western EU | 11 | 0.091 |
| C4 | 50-59 / F / Western EU | 8 | 0.125 |
| C5 | 60-69 / F / Western EU | 5 | 0.200 |
| Total | N = 34; sum(1/F_i) = 0.833 | ||
| Prosecutor risk | 0.833 / 34 = 0.025 (2.5%) |
Under Policy 0070's 0.09 ceiling this release passes. Add one new class of size 2 and 1/F = 0.50, the total jumps, and that class needs more generalization before submission.
Hashing alone is not anonymization. Worth saying twice, because most teams still make this mistake on the first review pass.
Free-text fields destroy your k. AETERM, CMTRT verbatim, COVAL, and any "Other, specify" field all carry identifying information that collapses equivalence classes. Drop them or replace with the coded term before you compute metrics.
Rare events. One subject with a rare MedDRA PT cannot be protected by generalization on quasi-identifiers alone. Roll the PT up to SOC or suppress the record.
Open-source auxiliary data. Clinical trial registries publish per-site enrollment counts. A dataset with SITEID and COUNTRY is linkable to the registry, then to local press releases about the site. Model the full auxiliary data pipeline, not just the dataset in your hand.
Token vault leakage. The vault is the weakest link in any tokenization scheme. Treat it as critical infrastructure with the same controls as a production database, including audit logging and regular access review.
Releasing date shifts without jitter. Deterministic day offsets are themselves quasi-identifiers for small populations; add a small random jitter seeded per subject so the offset stays stable across releases but is not a unique fingerprint.
Published by clinstandards.org. Written for statistical programmers; not legal advice. Consult your sponsor DPO for regulated submissions.
No comments yet. Be the first!
