For years, statistical programmers in the pharmaceutical industry have grappled with one of the most labor-intensive tasks in the clinical submission workflow: creating Study Data Tabulation Model (SDTM) datasets from raw clinical data. While the ADaM side of the equation has benefited from the open-source {admiral} package since its introduction through pharmaverse, a critical gap persisted on the SDTM side — there was no equivalent open-source R solution for SDTM programming that could work across different Electronic Data Capture (EDC) systems and data collection standards.
Enter {sdtm.oak} — the SDTM Data Transformation Engine. Sponsored by CDISC COSA (Open Source Alliance) and developed through a collaborative effort led by Roche/Genentech with contributions from Pfizer, GSK, Vertex, Merck, Pattern Institute, Transition Technologies Science, and Atorus Research, {sdtm.oak} addresses this gap head-on. It is part of the pharmaverse ecosystem and is available on CRAN.
This article provides a comprehensive deep dive into {sdtm.oak} — its origins, architecture, core concepts, practical usage, companion packages, and its roadmap toward full SDTM automation. If you are a statistical programmer wondering how this package fits into your workflow — or if you are evaluating whether to adopt {sdtm.oak} for your organization — this article is for you.
The story of {sdtm.oak} begins inside Roche/Genentech. Roche developed an internal R package called {roak} as part of their broader SDTM automation platform known as OAK Garden. This internal system was metadata-driven, tightly integrated with Roche's Metadata Repository (MDR), and achieved impressive results: using 22 reusable algorithms, Roche automated approximately 13,000 SDTM mappings across 6 different Therapeutic Area standards. Based on studies initiated after 2019, a median of 82% of data standards could be automated out of the box.
Recognizing the industry-wide value of this approach, Roche approached CDISC COSA with a proposal to open-source the core concepts and algorithms from {roak}. The proposal was accepted, and the open-source initiative was born under the pharmaverse GitHub organization. However, the open-source version required significant re-engineering — the source code had to be redeveloped from scratch to be both EDC-agnostic and data standards-agnostic, removing any dependencies on Roche's proprietary infrastructure.
The leadership team for the open-source initiative includes Rammprasad Ganapathy and Edgar Manukyan (Roche/Genentech), Yogesh Gupta and Lisa Hourteloot (Pfizer), Bhaskar Ponugoti (Merck), Aditya Parankusham (GSK), Susheel Arkala (Vertex), with CDISC/COSA representation from Charles Shadle, Sam Hume, and Omar Garcia.
Before diving into the technical details of {sdtm.oak}, it is important to understand why SDTM programming presents distinct challenges compared to ADaM programming — challenges that {sdtm.oak} is specifically designed to address.
Raw Data Structure Variability: ADaM programming has a well-defined starting point: SDTM datasets with a standardized structure. SDTM programming, by contrast, starts from raw datasets produced by EDC systems. These raw datasets vary significantly in structure — different variable names, different dataset names, different data types — depending on which EDC system (Rave, Veeva, InForm, REDCap, etc.) a sponsor uses. Even the exact same eCRF, when configured in different EDC systems, can produce raw data with entirely different structures.
Data Collection Standards Variability: Although CDISC has established CDASH as the data collection standard, CDASH adoption is not mandated by the FDA. Many pharmaceutical companies use proprietary data collection standards that differ from CDASH, sometimes significantly. This means the starting vocabulary, variable naming conventions, and data structures upstream of SDTM vary across companies.
Lack of a Common Framework: Due to the above factors, it historically seemed impossible to develop a common open-source approach for SDTM programming that could work across the industry. Each company built its own bespoke mapping processes, often heavily dependent on SAS macros tied to specific EDC configurations.
{sdtm.oak} addresses these challenges through two key design principles: it is EDC-agnostic and data standards-agnostic. The package achieves this through the concept of reusable algorithms that abstract the mapping logic away from any specific raw data format or data collection standard.
The intellectual foundation of {sdtm.oak} is the reusable algorithms concept. This is the single most important idea to understand about the package, and it is what distinguishes {sdtm.oak} from ad-hoc SDTM programming approaches.
In {sdtm.oak}, SDTM mappings are defined as algorithms that transform collected source data (from eCRF or external data transfers) into the target SDTM data model. These mapping algorithms form the backbone of the {sdtm.oak} transformation engine.
Key properties of these algorithms:
assign_no_ct) can be applied to map variables in CM, AE, VS, MH, or any other SDTM domain. The algorithm logic does not change — only the parameters (which raw variable, which target variable) change.The current release of {sdtm.oak} implements six core algorithms as R functions:
1. assign_no_ct() — Direct Assignment Without Controlled Terminology
This is the simplest and most commonly used algorithm. It performs a one-to-one mapping of a raw variable value directly to a target SDTM variable, without any controlled terminology recoding. Think of it as the SDTM equivalent of a simple variable rename with appropriate record selection.
Use cases: Mapping raw medication text to CMTRT, raw adverse event term to AETERM, raw vital sign result to VSORRES.
# Map raw medication name to CMTRTassign_no_ct(
raw_dat = cm_raw,
raw_var = "MDRAW",
tgt_var = "CMTRT"
)
2. assign_ct() — Assignment With Controlled Terminology
Similar to assign_no_ct(), but additionally applies controlled terminology recoding. The function checks whether the collected value exists in the study's controlled terminology specification and, if found, applies the standard submission value.
Use cases: Mapping dose units (where collected values like "mg", "Gram" must be recoded to CDISC controlled terminology values), route of administration, body position.
# Map dose units with CT recodingassign_ct(
raw_dat = cm_raw,
raw_var = "DOSU",
tgt_var = "CMDOSU",
ct_spec = study_ct,
ct_clst = "C71620",
id_vars = oak_id_vars()
)
3. hardcode_no_ct() — Hardcoded Value Without Controlled Terminology
Assigns a fixed (hardcoded) value to the target SDTM variable based on the presence or state of a source variable, without controlled terminology recoding. This is commonly used with conditional logic.
Use cases: Setting CMSTTPT to "SCREENING" when a prior medication flag is set, setting AEOUT to a fixed value based on a condition.
# If MDPRIOR == 1, set CMSTTPT = 'SCREENING'hardcode_no_ct(
raw_dat = condition_add(cm_raw, MDPRIOR == "1"),
raw_var = "MDPRIOR",
tgt_var = "CMSTTPT",
tgt_val = "SCREENING",
id_vars = oak_id_vars()
)
4. hardcode_ct() — Hardcoded Value With Controlled Terminology
Assigns a fixed value to the target SDTM variable with controlled terminology validation. This is heavily used when mapping topic and qualifier variables in Findings domains.
Use cases: Setting VSTESTCD = "SYSBP", VSTEST = "Systolic Blood Pressure", VSORRESU = "mmHg" for vital signs.
# Set VSTESTCD for Systolic Blood Pressurehardcode_ct(
raw_dat = vs_raw,
raw_var = "SYS_BP",
tgt_var = "VSTESTCD",
tgt_val = "SYSBP",
ct_spec = study_ct,
ct_clst = "C66741"
)
5. assign_datetime() — Date/Time Mapping to ISO 8601
Maps one or more variables with date/time components in a raw dataset to a target SDTM variable following ISO 8601 format. This handles the complex parsing and formatting required for SDTM date-time variables.
Use cases: Deriving --DTC variables such as AESTDTC, VSDTC, CMDTC from raw date and time fields.
6. condition_add() — Conditional Processing
Creates a conditioned data frame — a special tibble class in {sdtm.oak} that carries a logical filtering vector. When a conditioned data frame is passed to any of the above algorithms, the mapping is applied only to records meeting the condition.
Use cases: Applying different mappings based on conditional logic (if-then-else scenarios), such as mapping different dose forms based on route of administration.
# Create a conditioned data framecondition_add(cm_raw, MDPRIOR == "1")
{sdtm.oak} supports a two-level algorithm structure. For certain SDTM mappings, a condition must be evaluated first (primary algorithm), and the actual mapping executes only when that condition is met (sub-algorithm). The condition_add() function serves as the primary algorithm, while any of the other mapping functions act as sub-algorithms.
The programming workflow in {sdtm.oak} is deliberately designed to mirror the conceptual structure of SDTM itself. The key principle is: map the topic variable first, then map its qualifiers and identifiers. This workflow is generic across SDTM domain classes — Events, Interventions, and Findings.
Step 1: Read in Raw Datasets and Generate OAK ID Variables
Every {sdtm.oak} workflow begins by reading raw data and generating the OAK identifier variables (oak_id, raw_source, patient_number). These identifiers serve as the crucial link between raw datasets and mapped SDTM domains, enabling the merging of individually derived SDTM variables.
library(sdtm.oak)
library(dplyr)
# Read raw datacm_raw <- read.csv("cm_raw.csv")
# Generate OAK ID variablescm_raw <- cm_raw %>%
generate_oak_id_vars(
pat_var = "PATNUM",
raw_src = "cm_raw"
)
Step 2: Map the Topic Variable
The topic variable is the primary variable that defines the core observation in an SDTM domain (e.g., CMTRT for CM, AETERM for AE, VSTESTCD for VS).
cm <- assign_no_ct(
raw_dat = cm_raw,
raw_var = "MDRAW",
tgt_var = "CMTRT"
)
Step 3: Map Qualifier Variables
After the topic variable is derived, qualifier variables are mapped using the pipe operator. Each subsequent mapping is merged with the previous result using the oak_id_vars.
cm <- cm %>%
assign_ct(
raw_dat = cm_raw,
raw_var = "DOSU",
tgt_var = "CMDOSU",
ct_spec = study_ct,
ct_clst = "C71620",
id_vars = oak_id_vars()
) %>%
assign_no_ct(
raw_dat = cm_raw,
raw_var = "DOSE",
tgt_var = "CMDOSE",
id_vars = oak_id_vars()
)
Step 4: Repeat for Additional Raw Sources
If data for a single SDTM domain comes from multiple raw datasets (a common scenario), Steps 1-3 are repeated for each raw source, and the results are combined.
Step 5: Derive Standard Variables
After all mappings are complete, standard SDTM derivations are applied using built-in functions:
derive_seq() — Derives the sequence number (--SEQ) variable.derive_study_day() — Calculates study day variables.derive_blfl() — Derives the baseline flag (--BLFL) or last observation before exposure flag (--LOBXFL).Step 6: Add Labels and Attributes
Apply SDTM labels, data types, and variable lengths as required by the SDTM Implementation Guide.
To illustrate the full workflow, here is an end-to-end example of creating a Vital Signs (VS) domain using {sdtm.oak}. This example demonstrates how the reusable algorithms concept works in practice for a Findings-class domain.
library(sdtm.oak)
library(dplyr)
# Load raw data and controlled terminologyvs_raw <- read_domain_example("vs_raw")
study_ct <- read_ct_spec_example("ct_spec")
# Map Systolic Blood Pressurevs_sysbp <- hardcode_ct(
raw_dat = vs_raw,
raw_var = "SYS_BP",
tgt_var = "VSTESTCD",
tgt_val = "SYSBP",
ct_spec = study_ct,
ct_clst = "C66741"
) %>%
dplyr::filter(!is.na(.data$VSTESTCD)) %>%
hardcode_ct(
raw_dat = vs_raw,
raw_var = "SYS_BP",
tgt_var = "VSTEST",
tgt_val = "Systolic Blood Pressure",
ct_spec = study_ct,
ct_clst = "C67153",
id_vars = oak_id_vars()
) %>%
assign_no_ct(
raw_dat = vs_raw,
raw_var = "SYS_BP",
tgt_var = "VSORRES",
id_vars = oak_id_vars()
) %>%
hardcode_ct(
raw_dat = vs_raw,
raw_var = "SYS_BP",
tgt_var = "VSORRESU",
tgt_val = "mmHg",
ct_spec = study_ct,
ct_clst = "C66770",
id_vars = oak_id_vars()
)
# Repeat for Diastolic BP, Pulse, etc.
# Then combine all test resultsvs <- dplyr::bind_rows(vs_sysbp, vs_diabp, vs_pulse)
# Derive sequence numbervs <- vs %>%
derive_seq(tgt_var = "VSSEQ")
# Derive study dayvs <- vs %>%
derive_study_day(
sdtm_in = .,
dm_domain = dm,
tgt_var = "VSDY",
ref_var = "RFSTDTC"
)
Notice the pattern: for each vital sign test, the topic variable (VSTESTCD) is mapped first with hardcode_ct(), followed by its qualifiers (VSTEST, VSORRES, VSORRESU) — each chained with the pipe operator and each using the appropriate algorithm. The same functions are reused for every test parameter; only the arguments change.
{sdtm.oak} does not operate in isolation. It is part of a carefully designed ecosystem of companion packages, each serving a distinct role in the SDTM programming pipeline:
The {pharmaverseraw} package provides example raw datasets that serve as input for SDTM programming with {sdtm.oak}. These datasets were created through reverse engineering — the team started with finalized SDTM datasets in {pharmaversesdtm} and worked backward to construct plausible raw datasets that could reasonably produce those SDTM outputs.
Critically, the {pharmaverseraw} datasets are intentionally designed to be both EDC-agnostic and data standards-agnostic. Some variables follow CDASH naming conventions, while others deliberately do not, reflecting the real-world variability across companies. Annotated case report forms (aCRFs) corresponding to the raw datasets are included in the package, providing transparency into the mapping logic.
The {mint} package (Metadata INTerface) is responsible for preparing study SDTM mapping metadata in the format that {sdtm.oak} expects. It serves as the bridge between a sponsor's Metadata Repository (or the CDISC Library) and the {sdtm.oak} transformation engine. In the automation vision, {mint} reads standards and study-specific metadata and produces the standardized SDTM specification that drives automated code generation in {sdtm.oak}.
The existing {pharmaversesdtm} package provides finalized SDTM test datasets derived from the CDISC pilot project and other sources. These serve as reference outputs and test fixtures.
EDC / Raw Data --> {pharmaverseraw} (test data)
|
MDR / CDISC Library --> {mint} (metadata preparation)
|
{sdtm.oak} (SDTM transformation)
|
{pharmaversesdtm} (output / validation)
|
{admiral} (ADaM creation)
Controlled terminology management is a critical aspect of SDTM programming that {sdtm.oak} handles through dedicated functions:
read_ct_spec() — Reads a controlled terminology specification file into a format usable by {sdtm.oak}.read_ct_spec_example() — Loads example controlled terminology specifications bundled with the package.ct_map() — Performs the actual controlled terminology recoding.ct_spec_example() — Returns the path to example CT specification files.The CT specification is structured as a tibble with columns for codelist code, term code, term value, collected value, preferred term, and synonyms. When assign_ct() or hardcode_ct() is called with a ct_spec argument, the function validates the collected value against the specified codelist and applies the appropriate standard submission value. If a collected value cannot be mapped according to the controlled terms, the function alerts the user — a valuable quality control feature.
# Example controlled terminology specification structurestudy_ct <- tibble::tribble(
~codelist_code, ~term_code, ~term_value, ~collected_value, ~term_preferred_term, ~term_synonyms,
"C71620", "C25613", "%", "%", "Percentage", "Percentage",
"C71620", "C28253", "mg", "mg", "Milligram", "Milligram",
"C71620", "C48155", "g", "g", "Gram", "Gram"
)
Statistical programmers familiar with {admiral} will find the contribution model and philosophy of {sdtm.oak} familiar, but the technical challenges and design patterns differ significantly. Understanding these differences is important for organizations evaluating adoption.
Source Data Predictability: {admiral} benefits from a well-defined input — SDTM datasets with standardized structure, variable names, and controlled terminology. {sdtm.oak} must handle wildly varying raw data structures, which is why the algorithms accept raw dataset and variable names as explicit parameters.
Algorithm vs. Derivation Focus: {admiral} focuses on complex derivations (baseline flags, treatment-emergent flags, period/phase variables, BDS structures). {sdtm.oak}'s core algorithms are comparatively simpler in logic (assignment, hardcoding, date-time parsing) but must be applied hundreds of times across a study with different parameters — making reusability and automation the primary value proposition.
Path to Automation: {admiral} is primarily a manual programming framework — programmers write R scripts that call admiral functions. {sdtm.oak} has a clear vision toward metadata-driven automation where the specification itself generates the code or directly produces SDTM datasets. This automation path is what makes the {mint} companion package essential.
Contribution Model: Both packages follow an open-source contribution model with volunteer developers from multiple pharmaceutical companies. Both are hosted under the pharmaverse GitHub organization.
The V0.2.0 release of {sdtm.oak} (released on CRAN in May 2025) marked a significant milestone by adding support for the Demographics (DM) domain — the one domain that was explicitly excluded from the V0.1.0 release due to its unique derivation requirements.
calc_min_max_date() — Calculates minimum and maximum dates across multiple date variables, essential for deriving reference dates.oak_calc_ref_dates() — Derives reference date variables (RFSTDTC, RFENDTC, RFXSTDTC, RFXENDTC, etc.) in ISO 8601 character format based on input dates and times.V0.2.0 also introduced the generate_sdtm_supp() function for creating SUPP-- domains (supplemental qualifier datasets). This function takes an SDTM domain output and splits non-standard qualifier variables into the appropriate supplemental domain structure with QNAM, QLABEL, QVAL, QORIG, and IDVAR variables.
With V0.2.0, {sdtm.oak} supports:
The current release of {sdtm.oak} provides a framework for manual (but modular) programming of SDTM. However, the long-term vision is metadata-driven SDTM automation — and this is where {sdtm.oak}'s architecture truly differentiates itself.
The key insight is that if SDTM mappings can be expressed as a standardized specification (defining source dataset, source variable, target domain, target variable, algorithm, and algorithm parameters), then an automated process can read this specification and generate the appropriate {sdtm.oak} function calls — producing either the R code or the SDTM datasets directly.
Unlike conventional SDTM specifications (which define target-to-source mappings, one tab per domain), the {sdtm.oak} specification defines source-to-target relationships. For each source, the SDTM mapping, algorithm, and associated metadata are defined. This inversion is deliberate — it aligns with the automation engine's processing model, which iterates through raw sources and applies transformations.
The automation framework relies on two types of metadata:
Standards Metadata — Sourced from the CDISC Library or a sponsor's MDR. This includes the relationship between data collection standards (eCRF/eDT), SDTM mappings, controlled terminology, and the algorithms and parameters required for automation of standards.
Study Definition Metadata — Specific to each clinical study. This includes eCRF design metadata (forms, fields, data dictionaries, visits) fetched from the EDC system, as well as external data transfer metadata.
The metadata-driven code generation feature is planned for subsequent releases. The OAK team is working with CDISC to add algorithm metadata to the CDISC Library, which would enable automation of CDISC standard eCRFs. Sponsors will need to establish tools to generate the {sdtm.oak} SDTM specification from their own MDR to fully leverage the automation capabilities.
If you are a SAS programmer evaluating {sdtm.oak}, the conceptual mapping is straightforward. At its core, each {sdtm.oak} algorithm function is essentially a specialized dplyr::mutate() operation with built-in controlled terminology handling, record selection, and merging logic. If you understand SAS DATA step MERGE operations and conditional assignment statements, you will find the {sdtm.oak} workflow intuitive — the paradigm shift is from writing bespoke SAS macros to calling standardized R functions with parameters.
The pipe operator (%>%) in {sdtm.oak} is analogous to sequential DATA steps or PROC SQL operations — each step takes the result of the previous step and adds or modifies a variable.
Several factors are worth considering:
EDC Integration: {sdtm.oak} is designed to work with any EDC system, but your organization will need to establish the data extraction pipeline that produces raw data in a format {sdtm.oak} can consume. The package itself does not handle EDC data extraction.
Metadata Repository: To leverage the full automation potential, your organization will benefit from having a machine-readable metadata repository. Organizations already using structured SDTM specifications will have a head start.
Sponsor-Specific Logic: {sdtm.oak} explicitly acknowledges that it may not handle sponsor-specific details related to LAB test metadata management, unit conversions, and coding information (e.g., MedDRA, WHODrug). These processes vary significantly across companies and typically require custom solutions.
Validation Requirements: As with any open-source package used in a regulated environment, your organization will need to establish appropriate validation and qualification processes for {sdtm.oak}.
# Install from CRANinstall.packages("sdtm.oak")
# Or install the development version from GitHubremotes::install_github("pharmaverse/sdtm.oak", ref = "main")
# Install companion test data packageinstall.packages("pharmaverseraw")
{sdtm.oak} ships with built-in example datasets and domain templates that are excellent for learning:
library(sdtm.oak)
# List available raw dataset examplesdomain_example()
# Load a specific raw dataset examplecm_raw <- read_domain_example("cm_raw")
# Load controlled terminology examplestudy_ct <- read_ct_spec_example("ct_spec")
# Explore template scripts for domain creation
# Templates are available for CM, VS, AE, and more
The {sdtm.oak} team can be reached through multiple channels:
The development team has outlined several priorities for upcoming releases:
{sdtm.oak} represents a paradigm shift in how the pharmaceutical industry approaches SDTM programming. By abstracting SDTM mappings into reusable, parameterized algorithms and making the framework EDC-agnostic and data standards-agnostic, it addresses the fundamental challenge that has long prevented industry-level standardization of SDTM creation.
For statistical programmers, the immediate value is a modular, well-structured framework for SDTM programming in R that integrates cleanly with the broader pharmaverse ecosystem. The longer-term value — metadata-driven automation — has the potential to dramatically reduce the manual effort involved in SDTM dataset creation, freeing programmers to focus on complex derivations, quality control, and analysis rather than repetitive mapping tasks.
Whether you adopt {sdtm.oak} for its current programming framework or position your organization to leverage its future automation capabilities, understanding this package is increasingly essential for statistical programmers working in clinical data standards.
Note: All claims in this article are based on primary CDISC and pharmaverse documentation.
No comments yet. Be the first!