A Practitioner's Deep Dive with Worked Examples in SAS

Why this matters

GDPR has been in force since 25 May 2018. Most statistical programmers I work with still treat "anonymized" as a synonym for "I dropped the USUBJID column." That is not what the regulation says, and the gap matters: a dataset that is pseudonymized under Article 4(5) is still personal data, still inside the scope of GDPR, and still subject to breach notification, DPIA, and data-transfer rules. A dataset that is genuinely anonymous under Recital 26 falls outside the regulation entirely.

This article is for programmers who work on SDTM, ADaM, raw EDC extracts, and CSR appendices that include EU/EEA subjects. Each technique below gets a worked example with an input table, an output table, and a SAS snippet you can drop into a validation macro. The structure follows the three categories that cover almost everything pharma teams use in practice: core statistical methods, pseudonymization techniques, and the clinical-trial specific rules that EMA and Health Canada have layered on top of GDPR.

1. What GDPR actually says

Four provisions do almost all of the work for statistical programmers.

Article 4(1) defines personal data as any information relating to an identified or identifiable natural person. The test is whether a person can be identified, directly or indirectly, by reference to an identifier such as a name, an ID number, location data, an online identifier, or factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that person.

Article 4(5) defines pseudonymization as processing personal data so it can no longer be attributed to a specific subject without additional information, provided that additional information is kept separately and subject to technical and organizational measures. Pseudonymized data is explicitly still personal data under Article 11 and Recital 26.

Recital 26 draws the anonymization line. Data is anonymous when the subject is no longer identifiable, accounting for "all the means reasonably likely to be used" by the controller or any other person. Only truly anonymous data is outside GDPR.

Article 89 permits processing for scientific research, statistical purposes, and public-interest archiving, provided appropriate safeguards are in place. The article explicitly names pseudonymization as a safeguard and allows further processing beyond the original purpose under that framework. Most clinical research activities rely on this article.

The practical takeaway: hashing USUBJID does not move your dataset out of GDPR. Only the full de-identification pipeline, tested against a motivated intruder, can do that.

Table 1. Pseudonymization vs anonymization under GDPR

Attribute	Pseudonymization	Anonymization
GDPR scope	Still personal data	Out of scope (Recital 26)
Re-linkage possible	Yes, with the additional information	No, by construction
Key/mapping storage	Separate, controlled, documented	Destroyed or never created
Typical use	Internal analytics, sharing with processors, cross-dataset linking	External publication, EMA Policy 0070 packages, academic data requests
Technical safeguards	Access control, encryption in transit and at rest	Risk model against a motivated intruder, residual risk below threshold
Article basis	Art. 4(5), 25, 32, 89	Recital 26
Breach-notification rules apply?	Yes	No

Source: GDPR text; ICO guidance on anonymisation, 2021.

2. Pseudonymization vs anonymization in practice

A hash of USUBJID is not anonymization. It is pseudonymization at best, and for small studies it is weak pseudonymization. Consider a phase 2 trial with 280 subjects and a well-known protocol number. An attacker who knows the protocol can guess that USUBJIDs follow the pattern STUDYX-<site>-<seq>, enumerate every plausible value, hash each one, and match against your "anonymized" file. Hashing without a long secret salt collapses in minutes.

Anonymization is a property of the released dataset as a whole, not of any one column. You test it by asking: given the quasi-identifiers in the file, the auxiliary data an attacker can plausibly get (ClinicalTrials.gov enrollment counts, social media posts, local press releases about trial sites), and the rare events in the sensitive columns, can a motivated intruder re-identify anyone with probability above an accepted threshold?

EMA Policy 0070 and Health Canada PRCI both settle on a residual re-identification risk ceiling of 0.09 (9%) for clinical study reports and patient-level datasets released under their transparency programs. The number traces back to cell-suppression literature and HIPAA Expert Determination, not to GDPR itself, but European data protection authorities have accepted it as the working benchmark for research disclosures.

3. Core statistical methods

Generalization and suppression are the building blocks. k-anonymity, l-diversity, and t-closeness are metrics that tell you when you have generalized and suppressed enough.

3.1 Generalization

Replace a specific value with a broader category. Define the generalization hierarchy before you run the de-identification pass. For AGE, 5-year or 10-year bins are the pharma default. For dates, year-only works for demographic variables, and day-offset from a subject reference start works for event variables.

Table 2. Generalization applied to SDTM DM domain

SDTM variable	Input value	Output value	Hierarchy rule
AGE	53	50 to 54	5-year bins; 90+ collapsed
BRTHDTC	1971-04-18	1971	Year only
COUNTRY	Malta	Southern Europe	EU subregion mapping
SITEID	7421	EU-West	Regional recode; sites with less than 3 subjects collapsed
RACE	Native Hawaiian	Other	Categories with cell count less than 5 collapsed
ETHNIC	Hispanic or Latino	Hispanic or Latino	Kept

Figure/Table 2. Example generalization hierarchy for the DM domain.

SAS implementation:



data dm_gen;set raw.dm;length age_grp $7 brthyr 8;brthyr = year(input(brthdtc, yymmdd10.));select;when (age < 18) age_grp = '< 18';when (age < 25) age_grp = '18-24';when (age < 35) age_grp = '25-34';when (age < 45) age_grp = '35-44';when (age < 55) age_grp = '45-54';when (age < 65) age_grp = '55-64';when (age < 75) age_grp = '65-74';otherwise age_grp = '75+';end;drop age brthdtc;run;

3.2 Suppression

Remove values or records that stand out. Two flavors: cell suppression blanks an individual value, record suppression drops an entire row. Cell suppression targets outliers (a 97-year-old in a trial where the next oldest is 78 can be identified from AGE alone even after generalization). Record suppression handles cases that cannot be protected by further generalization, typically rare-disease cohorts or single-subject sites.

Table 3. Suppression rules commonly applied to DM and AE domains

Variable	Trigger	Action
AGE	Value at or above 90	Collapse to "90+" category
HEIGHT	Outside p1 to p99 of trial distribution	Cell-suppress
WEIGHT	Outside p1 to p99 of trial distribution	Cell-suppress
RACE	Category with less than 5 records	Recode to "Other"
SITEID	Less than 3 subjects at site	Record-suppress or recode to region
AEDECOD	MedDRA Preferred Term with less than 3 subjects	Roll up to SOC
MHDECOD	MedDRA PT with less than 3 subjects	Roll up to SOC
COUNTRY	Less than 5 subjects	Recode to region

SAS implementation:


 
/* Cell-suppress extreme ages */
data dm_sup;set dm_gen;if age_grp = '75+' and age >= 90 then age_grp = '90+';run;

/* Record-suppress sites with fewer than 3 subjects */
proc sql;create table site_counts asselect siteid, count(distinct usubjid) as n_subjfrom dm_gengroup by siteid;quit;

proc sql;create table dm_sup2 asselect d.*
from dm_sup dinner join site_counts son d.siteid = s.siteidwhere s.n_subj >= 3;quit;

3.3 k-anonymity

A dataset is k-anonymous when every combination of quasi-identifier values appears at least k times. If an attacker knows the quasi-identifiers for one target (age, sex, country), that target cannot be distinguished from at least k minus 1 other subjects.

Table 4. Worked k-anonymity example, k = 2

Input (quasi-identifiers: AGE, SEX, COUNTRY; sensitive attribute: MHTERM)

USUBJID	AGE	SEX	COUNTRY	MHTERM
101	28	M	Ireland	HTN
102	29	M	Ireland	DM
103	52	F	Germany	HTN
104	53	F	Germany	DM
105	71	M	Portugal	HTN
106	72	M	Portugal	DM

After generalization (AGE to 10-year bin, COUNTRY to EU subregion):

USUBJID	AGE	SEX	REGION	MHTERM
101	20-29	M	Western EU	HTN
102	20-29	M	Western EU	DM
103	50-59	F	Western EU	HTN
104	50-59	F	Western EU	DM
105	70-79	M	Southern EU	HTN
106	70-79	M	Southern EU	DM

Three equivalence classes of size 2 each. The release is 2-anonymous. Typical thresholds: k = 5 for internal sharing, k = 11 for external publication under HIPAA Safe Harbor, k = 3 minimum for EMA Policy 0070 packages combined with other safeguards.

SAS check macro:

proc sql;

create table qid_counts as

select age_grp, sex, region, count(*) as k_val

from dm_gen

group by age_grp, sex, region;

quit;

proc sql;

create table k_report as

select min(k_val) as k_min,

mean(k_val) as k_mean,

sum(case when k_val < 3 then 1 else 0 end) as n_unsafe_classes

from qid_counts;

quit;

title 'k-anonymity report';

proc print data=k_report noobs; run;

3.4 l-diversity

k-anonymity alone does not stop attribute disclosure. If every subject in a 2-anonymous equivalence class has the same sensitive value, an attacker who knows the quasi-identifiers learns the sensitive value without re-identifying anyone by name. l-diversity requires each class to contain at least l distinct sensitive values.

Distinct l-diversity counts unique values. Entropy l-diversity requires the Shannon entropy of the group's sensitive-value distribution to exceed log(l). Entropy is stronger because it rejects heavily skewed groups.

Table 5. l-diversity check on the k-anonymous output from Table 4

Equivalence class	Sensitive values present	distinct l
20-29 / M / Western EU	HTN, DM	2
50-59 / F / Western EU	HTN, DM	2
70-79 / M / Southern EU	HTN, DM	2

Each class has l = 2, so the output is 2-anonymous and 2-diverse.

SAS check:

proc sql;

create table l_check as

select age_grp, sex, region,

count(*) as k_val,

count(distinct mhterm) as l_val

from dm_anon

group by age_grp, sex, region

having l_val < 2;

quit;

/* Any rows in l_check are classes that need further treatment */

3.5 t-closeness

l-diverse groups can still leak information if the distribution of sensitive values inside the group differs sharply from the overall population distribution. If 10% of trial subjects have a positive tumor biomarker but a specific 60-69/F/rural equivalence class shows 50% positive, membership in that class tells the attacker the subject has elevated risk. t-closeness bounds the distance between each class distribution and the overall distribution below a threshold t.

The standard distance metric is Earth Mover's Distance (EMD) for numeric and ordinal sensitive variables, and a variation distance for nominal variables. SAS does not ship a t-closeness proc. You implement it as a macro that computes per-class distributions, compares each to the overall distribution, and flags classes above t. El Emam's book (listed in References) gives a worked implementation.

Figure 1. t-closeness intuition (conceptual)

Positive	Negative	Distance to overall
Overall population	10%	90%	0.00 (baseline)
Class 60-69/F/Rural	50%	50%	0.40 (fails t = 0.20)
Class 40-49/M/Urban	12%	88%	0.02 (passes)

A group that deviates from the overall distribution by more than t fails t-closeness and needs additional generalization or suppression.

Table 6. Threshold summary for statistical metrics

Metric	Typical internal floor	External publication floor	Reference
k	5	11	HIPAA Expert Determination (accepted by EU DPAs)
l	2	3	Machanavajjhala et al., 2007
t	0.35	0.20	Li et al., 2007
Residual re-id risk	0.20	0.09	EMA Policy 0070 and Health Canada PRCI

4. Pseudonymization techniques

Pseudonymization does not take your data out of GDPR scope, but Article 32 explicitly lists it as an appropriate technical measure, and Article 89 treats it as a valid safeguard for research processing. The three primitives below cover almost every pharma use case: hashing for non-reversible linking, tokenization for reversible linking via a vault, and encryption for protection at rest or in transit.

4.1 Hashing

A one-way function that maps an identifier to a fixed-length digest. Three rules apply.

Never hash a short, predictable value without a long secret salt. A 16-character USUBJID from a known trial has a tiny input space; brute force takes seconds.
Use HMAC with a secret key, or a random salt of at least 16 bytes, stored separately under access control.
Use SHA-256 or stronger. MD5 and SHA-1 are dead for any new work.

Table 7. Salted SHA-256 hashing of USUBJID

USUBJID input	Salt (stored in vault, never shipped)	HMAC-SHA-256 output (truncated)
STUDYX-001-00145	9f3a2c4d6e8b1f0a	8b4c2e9f...a703d218
STUDYX-001-00146	9f3a2c4d6e8b1f0a	e1d8b3f4...2119c6a8
STUDYX-001-00147	9f3a2c4d6e8b1f0a	4a76c0e2...b8955d31

SAS implementation using the sha256hex function:

/* Salt is read from an encrypted credentials file, never hard-coded */

%let SALT = %sysget(GDPR_PSEUDO_SALT);

data dm_ps;

set raw.dm;

length usubjid_h $64;

usubjid_h = sha256hex(cats("&SALT", usubjid));

drop usubjid;

run;

/* Verify that the mapping is deterministic and distinct */

proc sql;

select count(distinct usubjid_h) as n_unique

from dm_ps;

quit;

4.2 Tokenization

Replace an identifier with an unrelated random token and keep the mapping in a separate relational table under tighter access control than the data warehouse. Tokenization is usually better than hashing when you need to re-identify subjects later (pharmacovigilance follow-up, long-term safety extensions). With hashing, if you lose the salt you lose the mapping; with tokenization, the vault is the source of truth and the data can be reissued.

Table 8. Tokenization mapping (vault-only, never shipped)

Original USUBJID	Token
STUDYX-001-00145	SX-TKN-A81F92
STUDYX-001-00146	SX-TKN-7C30D4
STUDYX-001-00147	SX-TKN-B19E55
STUDYX-001-00148	SX-TKN-F4A071

SAS implementation:

/* One-time token generation */

data token_map;

set raw.dm(keep=usubjid);

length token $13;

call streaminit(20260411);

token = cats("SX-TKN-", put(rand("integer", 1, 16777215), hex6.));

run;

/* Apply tokens and drop original identifier */

proc sql;

create table dm_tok as

select t.token as usubjid_tok,

d.age, d.sex, d.country, d.brthyr

from raw.dm d

inner join token_map t

on d.usubjid = t.usubjid;

quit;

/* The token_map dataset is moved to the vault and removed from project lib */

4.3 Encryption

Reversible, keyed, and auditable. Use this when you need to pseudonymize at rest but recover the original value in a controlled environment. AES-256 in GCM mode is the current default. ECB mode is never appropriate for structured data because identical plaintexts give identical ciphertexts, which defeats the purpose. Format-preserving encryption (FF3-1) is useful when a downstream system requires the pseudonymized value to keep the same format as the original, for example when a legacy database column is fixed-width.

Table 9. Comparison of pseudonymization primitives

Technique	Reversible?	Secret needed	Best fit
Salted HMAC (SHA-256)	No	Long secret salt, vault-stored	Cross-dataset linking without re-identification; stable IDs across submissions
Tokenization	Yes, via vault	Lookup table	Subject IDs that need to round-trip (PV, LTE)
AES-256-GCM	Yes, with key	Symmetric key	Protecting identifiers at rest and in transit
FF3-1 (format-preserving)	Yes, with key	Symmetric key and tweak	Legacy systems needing the original field layout

5. Clinical-trial specific rules

GDPR does not mention clinical trials directly. EMA Policy 0070 and Health Canada PRCI turn the anonymization provisions into concrete requirements for regulatory submissions, and most EU sponsors treat those two policies as the operational ceiling for GDPR compliance on shared clinical data.

5.1 EMA Policy 0070

Applies to clinical study reports, protocols, and (originally, for Phase 1 of the policy) patient-level datasets submitted to EMA. The external guidance document (EMA/90915/2016) asks sponsors to submit an anonymisation report that demonstrates residual re-identification risk is below a defensible threshold. 0.09 is the accepted ceiling. The sponsor must name the quasi-identifiers, the generalization hierarchies, the k values achieved, the distributional controls applied, and the motivated-intruder scenario used for risk estimation.

Figure 2. EMA Policy 0070 anonymization workflow

Step	Artifact	Responsible
1. Classify variables	Variable inventory (direct / quasi / sensitive / low)	Stat programming + biostatistics
2. Pick attacker model	Risk model document (prosecutor, journalist, marketer)	Biostatistics + DPO
3. Build hierarchies	Generalization tables per variable	Stat programming
4. Apply transformations	Anonymized ADaM/SDTM	Stat programming
5. Measure risk	k/l/t report, residual risk calculation	Stat programming + QC
6. Iterate	Updated dataset, diff report	Stat programming
7. Document	Anonymisation report (EMA/90915/2016 format)	Regulatory + DPO
8. Submit	Anonymized package to EMA transparency portal	Regulatory

5.2 Health Canada PRCI

Public Release of Clinical Information. Similar to Policy 0070 but applies to drug and device submissions filed with Health Canada under the Food and Drug Regulations. Same 0.09 ceiling. Health Canada also asks for an explicit statement of the motivated-intruder scenario and for the sponsor to name the public auxiliary sources considered in the risk model (ClinicalTrials.gov, EUCTR, peer-reviewed literature, press releases).

5.3 Re-identification risk thresholds in practice

Table 10. Anonymization actions applied to common SDTM variables for Policy 0070 / PRCI packages

Domain	Variable	Classification	Action
DM	USUBJID	Direct	Replace with study-specific random token
DM	SUBJID	Direct	Drop or replace with token
DM	BRTHDTC	Quasi	Year only; if age at baseline is 90+ truncate year
DM	AGE	Quasi	5-year bin; collapse to "90+" at or above 90
DM	SEX	Quasi	Keep
DM	RACE	Quasi	Keep if cell count is at least 5; else recode to Other
DM	ETHNIC	Quasi	Keep or collapse by region
DM	COUNTRY	Quasi	Keep if cell counts support k; else EU region
DM	SITEID	Quasi	Recode to region; drop sites with less than 3 subjects
EX	EXSTDTC	Quasi	Day offset from RFSTDTC
EX	EXDOSE	Low risk	Keep
AE	AEDECOD	Sensitive	Keep at PT; collapse rare PTs to SOC
AE	AESTDTC	Quasi	Day offset from RFSTDTC
AE	AETERM	Direct (free text)	Drop; keep AEDECOD only
MH	MHDECOD	Sensitive	Keep at PT; collapse rare to SOC
LB	LBORRES	Low risk	Keep
LB	LBDTC	Quasi	Day offset from RFSTDTC
VS	HEIGHT	Quasi	Round to nearest cm; cell-suppress outliers
VS	WEIGHT	Quasi	Round to nearest kg; cell-suppress outliers
CO	COVAL	Direct (free text)	Drop

5.4 Date shifting in practice

Sponsors usually prefer day offsets from a subject reference start instead of year-only dates, because downstream analyses (time-to-event, exposure-adjusted rates, Kaplan-Meier) need relative timing. The per-subject reference date is destroyed in the released dataset. For small populations, the per-subject offset can itself be a quasi-identifier (if only one subject enrolled on a given Monday at a given site), so a small random jitter is common.

SAS implementation of deterministic date shifting:

data ex_anon;

merge dm_ps(in=a keep=usubjid_h rfstdtc)

raw.ex(in=b keep=usubjid_h exstdtc exdose);

by usubjid_h;

if a and b;

/* Day offset from subject reference start */

exstday = input(exstdtc, yymmdd10.) - input(rfstdtc, yymmdd10.);

/* Optional jitter: seeded per subject so the offset is stable */

/* across releases of the same dataset */

call streaminit(abs(input(substr(usubjid_h,1,8), hex8.)));

exstday = exstday + round(rand('uniform', -3, 3));

drop exstdtc rfstdtc;

run;

6. Measuring residual risk

Policy 0070 and PRCI want a number in the anonymisation report. The standard calculation treats each equivalence class as an attacker's best guess and computes the expected probability of correct re-identification across the file.

For a prosecutor attacker (knows the target is in the file), the risk per class is 1 / F_i, where F_i is the class size in the released dataset. The average risk across the file is the mean of those per-class risks weighted by class size.

For a journalist attacker (does not know whether the target is in the file), multiply the prosecutor risk by the sampling fraction from the trial population to the plausible external population.

Table 11. Worked residual risk calculation on a 34-subject release

Class	Quasi-identifier combo	Size F_i	1 / F_i
C1	20-29 / M / Western EU	4	0.250
C2	30-39 / M / Western EU	6	0.167
C3	40-49 / M / Western EU	11	0.091
C4	50-59 / F / Western EU	8	0.125
C5	60-69 / F / Western EU	5	0.200
Total	N = 34; sum(1/F_i) = 0.833
Prosecutor risk	0.833 / 34 = 0.025 (2.5%)

Under Policy 0070's 0.09 ceiling this release passes. Add one new class of size 2 and 1/F = 0.50, the total jumps, and that class needs more generalization before submission.

7. Workflow checklist

7.1 Start of project

Identify the legal basis for processing under GDPR (research exemption under Article 89, explicit consent under Article 9, or legitimate interest under Article 6).
Classify every variable as direct identifier, quasi-identifier, sensitive, or low risk. Maintain the inventory as a versioned spreadsheet.
Agree thresholds with the sponsor DPO and data sharing committee: k, l, t, residual risk ceiling. Document them in the SAP addendum or data sharing plan.
Pick the attacker model: prosecutor, journalist, or marketer. Policy 0070 submissions almost always use journalist.

7.2 Per release

Remove or pseudonymize direct identifiers.
Apply generalization hierarchies to quasi-identifiers.
Apply suppression rules to outliers and small cells.
Compute k, l, t, and residual risk. Run the check as a validation program with a baseline output.
If any metric fails, iterate generalization or suppression on the failing classes. Never edit the raw data; always rebuild from the source dataset with updated rules.
Produce the anonymisation report using the EMA template or the equivalent Health Canada template.

7.3 Documentation

Log the salt and key-management process for any hashing. Salts live in a vault, never in programs.
Keep token mapping tables under restricted access. Audit every read.
Record the generalization hierarchy as a data artifact, not just as program logic, so an auditor can reproduce the transform.
Version the risk report alongside the data release. When a release is updated, the report is updated.

8. Common pitfalls

Hashing alone is not anonymization. Worth saying twice, because most teams still make this mistake on the first review pass.

Free-text fields destroy your k. AETERM, CMTRT verbatim, COVAL, and any "Other, specify" field all carry identifying information that collapses equivalence classes. Drop them or replace with the coded term before you compute metrics.

Rare events. One subject with a rare MedDRA PT cannot be protected by generalization on quasi-identifiers alone. Roll the PT up to SOC or suppress the record.

Open-source auxiliary data. Clinical trial registries publish per-site enrollment counts. A dataset with SITEID and COUNTRY is linkable to the registry, then to local press releases about the site. Model the full auxiliary data pipeline, not just the dataset in your hand.

Token vault leakage. The vault is the weakest link in any tokenization scheme. Treat it as critical infrastructure with the same controls as a production database, including audit logging and regular access review.

Releasing date shifts without jitter. Deterministic day offsets are themselves quasi-identifiers for small populations; add a small random jitter seeded per subject so the offset stays stable across releases but is not a unique fingerprint.

References

Regulation (EU) 2016/679 (GDPR). Articles 4, 11, 25, 32, 89; Recital 26.
European Medicines Agency. External guidance on the implementation of the European Medicines Agency policy on the publication of clinical data for medicinal products for human use. EMA/90915/2016, 2018.
Health Canada. Guidance document: Public Release of Clinical Information. Ottawa, 2019.
Article 29 Data Protection Working Party. Opinion 05/2014 on Anonymisation Techniques. WP216, 2014.
Samarati P., Sweeney L. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. SRI International, 1998.
Machanavajjhala A., Kifer D., Gehrke J., Venkitasubramaniam M. l-Diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data, 2007.
Li N., Li T., Venkatasubramanian S. t-Closeness: Privacy beyond k-anonymity and l-diversity. Proc. ICDE, 2007.
El Emam K., Arbuckle L. Anonymizing Health Data: Case Studies and Methods to Get You Started. O'Reilly Media, 2013.
Information Commissioner's Office (UK). Anonymisation: managing data protection risk code of practice, 2021.

Published by clinstandards.org. Written for statistical programmers; not legal advice. Consult your sponsor DPO for regulated submissions.

GDPR Anonymization for Statistical Programmers

Why this matters

1. What GDPR actually says

Table 1. Pseudonymization vs anonymization under GDPR

2. Pseudonymization vs anonymization in practice

3. Core statistical methods

3.1 Generalization

Table 2. Generalization applied to SDTM DM domain

3.2 Suppression

Table 3. Suppression rules commonly applied to DM and AE domains

3.3 k-anonymity

Table 4. Worked k-anonymity example, k = 2

3.4 l-diversity

Table 5. l-diversity check on the k-anonymous output from Table 4

3.5 t-closeness

Figure 1. t-closeness intuition (conceptual)

Table 6. Threshold summary for statistical metrics

4. Pseudonymization techniques

4.1 Hashing

Table 7. Salted SHA-256 hashing of USUBJID

4.2 Tokenization

Table 8. Tokenization mapping (vault-only, never shipped)

4.3 Encryption

Table 9. Comparison of pseudonymization primitives

5. Clinical-trial specific rules

5.1 EMA Policy 0070

Figure 2. EMA Policy 0070 anonymization workflow

5.2 Health Canada PRCI

5.3 Re-identification risk thresholds in practice

Table 10. Anonymization actions applied to common SDTM variables for Policy 0070 / PRCI packages

5.4 Date shifting in practice

6. Measuring residual risk

Table 11. Worked residual risk calculation on a 34-subject release

7. Workflow checklist

7.1 Start of project

7.2 Per release

7.3 Documentation

8. Common pitfalls

References

Discussion (0)