A working guide for statistical programmers: the build flow, the CDISC Library API, real-time study creation, and where it breaks

Most of us learned the trade backwards. We collect data on a CRF designed by one team, receive it from an EDC configured by another, then spend months mapping it into SDTM, writing value level metadata by hand, and reconciling the Define-XML against what the protocol actually said. Every handoff loses information, and every team rebuilds the same knowledge in its own format. A blood pressure is a blood pressure whether it sits on a CRF, in an EDC export, or in the VS domain, yet we re-describe it three or four times and then write specs to translate between the descriptions.

Biomedical Concepts (BCs) are CDISC's response to that inefficiency. A BC is a single, machine-readable definition of what a measurement is, independent of any one standard, with enough detail attached that the same definition can drive a CRF, configure an EDC, and produce SDTM with its value level metadata. The work is run through the COSMoS project (Conceptual and Operational Standards Metadata Services), published in the CDISC Library, and reachable through a REST API. This article walks the full path: what a BC contains, how it threads protocol design to CRF/EDC build to SDTM, how to pull it from the API, how it underpins a real-time study build, and the problems you will hit putting it into practice.

What a Biomedical Concept actually contains

It helps to stop thinking of a BC as a single object and instead as three connected layers. The top layer is conceptual and standards-agnostic. The middle layer breaks the concept into its properties. The bottom layer is the implementation that programmers care about: pre-built SDTM variable metadata you can drop into a spec.

Layer	What it is	Example (Systolic Blood Pressure)
Biomedical Concept	The semantic definition, anchored to an NCI Thesaurus concept. Names, synonyms, category, result scale, and external codes (LOINC) live here.	Concept C25298, category Vital Signs, result scale Quantitative
Data Element Concepts	The named properties the concept needs to be unambiguous, each typed and bound to controlled terminology.	Test, result value, unit, body position, laterality, timing
SDTM Dataset Specialization	The implementation layer: the BC expressed as SDTM variables with roles, codelists, assigned terms, relationships, and VLM targets.	Domain VS; VSTESTCD=SYSBP, VSORRES, VSORRESU=mmHg, VSPOS, VSDTC

Table 1. The three layers of a Biomedical Concept and where each one is used.

The split matters. The conceptual layer gives you a stable anchor that does not change when SDTM versions do; the NCIt code for systolic blood pressure is the same regardless of whether you submit under SDTMIG 3.2 or 3.4. The Dataset Specialization layer is where that stability turns into something you can submit. It is a configured building block: it already knows that the unit belongs in VSORRESU, that mmHg is the assigned term, that VSORRESU is the unit for the value in VSORRESN, and that VSTESTCD takes the value SYSBP. That is the value level metadata you would otherwise type by hand for every Define-XML.

Diagram

Figure 2. One concept resolves into typed properties and into ready-made SDTM variable metadata.

The full process: protocol to CRF/EDC to SDTM

The reason BCs are worth the trouble is that one definition feeds the whole study lifecycle instead of being re-authored at each stage. Read the flow left to right.

Diagram

At protocol design, the schedule of activities names what gets measured and when. Each activity (vital signs, chemistry panel, ECG) decomposes into specific tests, and each test is a Biomedical Concept. Choosing the BC at this stage is the decision that propagates everywhere downstream.

The build steps then read from the same source:

CRF and EDC build. The BC's Data Element Concepts tell you exactly which fields a form needs: a result, a unit, a position, a timing. Because each property carries its data type and codelist, the form's question, allowed answers, and edit checks can be generated rather than hand-drawn. CDISC is publishing draft CRF Specializations to support this directly.
SDTM and value level metadata. The Dataset Specialization already maps the properties to VS (or LB, EG, and so on), assigns terms, and flags which variables are VLM targets. Generating the SDTM shell and the Define-XML value level metadata becomes a lookup rather than an authoring task.
Define-XML. Because the specialization traces each variable back to its concept and codelist, the origin and the controlled terminology for Define-XML come from the same record. Tools such as the open-source SAS Clinical Standards Toolkit have demonstrated building Define-XML v2.1 value level metadata straight from SDTM Dataset Specializations.

The payoff for a programmer is the removal of remapping. When the collection metadata and the submission metadata are two views of one concept, there is no ALB_SERUM_RES-to-LBORRES translation step to specify, validate, and maintain. The mapping was never created because the identity was never lost.

A worked example: systolic blood pressure

Take SYSBP through the layers. The BC carries the semantics and terminology; the Dataset Specialization carries the SDTM detail. Table 2 shows how the concept's properties land as VS variables.

BC property	SDTM variable	How the specialization fills it
Test	VSTESTCD / VSTEST	Assigned term SYSBP / Systolic Blood Pressure
Result value	VSORRES, VSSTRESN	Origin Collected; VSSTRESN is the numeric standardized result
Unit	VSORRESU	Assigned term mmHg; relationship: is the unit for the value
Body position	VSPOS	Codelist Position; e.g., SITTING, STANDING, SUPINE
Timing	VSDTC	ISO 8601 date/time of measurement

Table 2. Systolic Blood Pressure BC properties resolved to SDTM VS variables by the Dataset Specialization.

Nothing in that table was a judgement call by the study programmer. The assigned terms, the codelist references, and the unit-to-value relationship are all carried in the published specialization, which is what makes the metadata reproducible across studies and sponsors.

Getting it from the CDISC Library API

The Library is a REST service that returns JSON. Content is organised into dated packages, and you address a concept or a specialization inside a package. The current BC and Dataset Specialization content lives under API version 2; requests must include the version or the service returns 404.

Access and authentication

Authentication is by API key. Request an account through the CDISC Library API Portal, retrieve your subscription key, and send it in the api-key request header. Basic Auth was deprecated in 2020. The base URL for production is https://library.cdisc.org/api.

Purpose	Endpoint (relative to base URL)
List BC packages	/mdr/bc/packages
List BCs in a package	/mdr/bc/packages/{package}/biomedicalconcepts
Get one BC	/mdr/bc/packages/{package}/biomedicalconcepts/{conceptId}
List SDTM specialization packages	/mdr/specializations/sdtm/packages
List specializations in a package	/mdr/specializations/sdtm/packages/{package}/datasetspecializations
Get one specialization	/mdr/specializations/sdtm/packages/{package}/datasetspecializations/{id}

Table 3. Core CDISC Library endpoints for Biomedical Concepts and SDTM Dataset Specializations.

A request for a single BC looks like this. Note that BC content is served under the cosmos/v2 path; confirm the exact path in the API Portal console for your subscription:

curl.exe -i -H "api-key: $env:CDISC_LIBRARY_KEY" \

"https://library.cdisc.org/api/cosmos/v2/mdr/bc/packages"

The response for a single BC (trimmed) shows the concept, its external coding, and its Data Element Concepts:

{

"conceptId": "C49676",

"shortName": "Pulse Rate",

"category": ["Vital Signs"],

"resultScale": "Quantitative",

"coding": [{ "code": "39156-5", "system": "http://loinc.org/",

"systemName": "LOINC" }],

"dataElementConcepts": [

{ "conceptId": "C123975", "shortName": "Vital Signs Laterality",

"dataType": "string", "exampleSet": ["Left", "Right"] }

]

}

The matching SDTM Dataset Specialization is where the submission metadata lives. Note the assigned term, the codelist reference, the variable relationship, and the VLM target flag:

{

"datasetSpecializationId": "SYSBP",

"domain": "VS",

"shortName": "Systolic Blood Pressure",

"variables": [

{ "name": "VSTESTCD", "assignedTerm": { "value": "SYSBP" },

"role": "Topic", "mandatoryVariable": true },

{ "name": "VSORRESU", "assignedTerm": { "value": "mmHg" },

"codelist": { "conceptId": "C66770", "submissionValue": "VSRESU" },

"relationship": { "predicateTerm": "IS_UNIT_FOR",

"object": "VSSTRESN" }, "vlmTarget": true }

]

}

From a programmer's seat the practical move is to pull these once per study, cache them, and treat them as the source of your spec. A thin Python wrapper is enough:

import os, requests

BASE = "https://library.cdisc.org/api/cosmos/v2"

H = {"api-key": os.environ["CDISC_LIBRARY_KEY"], "Accept": "application/json"}

def get_specialization(pkg, spec_id):

url = f"{BASE}/mdr/specializations/sdtm/packages/{pkg}/datasetspecializations/{spec_id}"

r = requests.get(url, headers=H, timeout=30)

r.raise_for_status()

return r.json()

vlm = [v for v in get_specialization("2022-10-26", "SYSBP")["variables"]

if v.get("vlmTarget")]

No subscription? Use the open exports

API access to Biomedical Concept and SDTM Dataset Specialization content sits behind a paid CDISC membership. An unentitled key returns a 401 with a members-only message even when the path and header are correct, which is a useful signal that your request reached the right resource. You do not need the live API to work with the content, though. CDISC publishes the same metadata as openly licensed (CC-BY-4.0) files in the COSMoS GitHub repository, kept current with the latest versions. For learning and prototyping these are a complete substitute; you only give up live, versioned calls.

Biomedical Concepts: Excel and CSV
SDTM Dataset Specializations: Excel and CSV
Per-concept YAML, the same structure the API returns: GitHub

Building a study in real time

The same metadata that removes remapping also lets a study be configured close to the moment the protocol is finalised, rather than months later. This is the goal of CDISC's Digital Data Flow (DDF) work and its Unified Study Definitions Model (USDM), which carries the protocol as structured data. USDM v2.0 added a biomedical concept layer specifically to support CRF automation, so the digital protocol does not just describe the schedule of activities, it points at the BCs being collected. This is also the foundation of CDISC 360i, which aims to digitise study design through analysis and automate the build of study artefacts from the digital protocol and its BCs.

With a protocol expressed in USDM and its activities linked to BCs, a study build becomes a series of selections and resolutions rather than hand authoring:

Select the BCs implied by the schedule of activities in the digital protocol.
Resolve each BC to its Dataset Specialization to obtain the SDTM-ready variable metadata.
Generate the EDC forms and edit checks from the BC properties, and emit the data contracts that downstream providers will return with their data.

A data contract is the idea that ties this together. It is a unique address for one data point: a specific property of a specific BC at a specific timepoint for a study. If the EDC and the lab return that identifier alongside each value, you no longer map ALB_SERUM_RES to LBORRES; you look up where the contract said the value belongs. The data4knowledge technology demonstrator built on the DDF prototype shows this end to end, loading values keyed by contract into a graph store and reading SDTM straight back out:

LOAD CSV WITH HEADERS FROM 'file:///{filename}' AS row

MATCH (dc:DataContract {uri: row['DC_URI']})

MERGE (d:DataPoint {uri: row['DATAPOINT_URI'], value: row['VALUE']})

MERGE (s:Subject {identifier: row['USUBJID']})

MERGE (dc)<-[:FOR_DC_REL]-(d)

MERGE (d)-[:FOR_SUBJECT_REL]->(s)

The point is not the graph database, which is one implementation choice. The point is that when collection is keyed to the same concepts that define submission, a study can be stood up and its SDTM produced on demand, because the structure was decided when the protocol was, not rebuilt afterward.

Technical challenges

The vision is clean; the practice is not yet. These are the obstacles worth planning around.

Challenge	What it means in practice
Coverage gaps	The published BC and specialization library does not cover every test, domain, or therapeutic-area nuance. You will still author specializations for the long tail, and CRF Specializations are still draft.
Versioning and governance	Content is package-dated and concepts evolve. You have to pin a package per study, track what changed between packages, and decide when to adopt a new one mid-programme.
NCIt and CT dependence	Definitions are anchored to NCI Thesaurus and CDISC Controlled Terminology. Concepts awaiting codes, or local terms with no NCIt match, fall outside the model and need a fallback.
Tooling maturity	Much of the end-to-end automation lives in prototypes and open-source demonstrators rather than validated, GxP-ready production systems. Validation and qualification effort is real.
Subscription and access	The Library API gates BC and specialization content behind paid membership. Teams without that tier rely on the open COSMoS exports, which lack live, versioned calls.
EDC and vendor adoption	The mapping-free promise depends on EDC and lab vendors honouring data contracts or specialization identifiers. Without that, you reintroduce a mapping layer at the boundary.
Organisational change	The model assumes data managers, programmers, and protocol authors work from one shared definition. That is a process change, not just a technical one, and it crosses team boundaries that today are siloed.

Table 4. Practical obstacles to BC-driven automation and what each one costs.

None of these is a reason to wait. Coverage, terminology, and tooling are all improving release over release, and the parts that are stable today, such as pulling published specializations to seed value level metadata and Define-XML, already remove real manual effort. The sensible posture is to adopt BCs where the library covers your tests, build a local process for the gaps, and pin versions deliberately so you are never surprised by a package change.

Where to start

If you want to get hands-on without committing a programme, download the open COSMoS exports linked above, pull the SDTM Dataset Specializations for a handful of vital signs and chemistry tests you know well, and compare the generated value level metadata against a Define-XML you wrote by hand. The differences will tell you quickly where the library fits your work and where your local conventions diverge. From there, the move toward USDM-linked protocols and real-time builds is an extension of the same idea rather than a separate project: define the concept once, and let collection and submission be two readings of it.

Sources

CDISC Biomedical Concepts (overview). https://www.cdisc.org/cdisc-biomedical-concepts
CDISC 360i initiative. https://www.cdisc.org/standards/cdisc-360i
COSMoS project documentation. https://cdisc-org.github.io/COSMoS/
COSMoS open export - Biomedical Concepts (Excel). https://cdisc-org.github.io/COSMoS/export/cdisc_biomedical_concepts_latest.xlsx
COSMoS open export - SDTM Dataset Specializations (Excel). https://cdisc-org.github.io/COSMoS/export/cdisc_sdtm_dataset_specializations_latest.xlsx
COSMoS per-concept YAML files. https://github.com/cdisc-org/COSMoS/tree/main/yaml
BC Starter Package and API guidance. https://cdisc-org.github.io/COSMoS/bc_starter_package/
COSMoS OpenAPI definition (endpoints and schemas). https://github.com/cdisc-org/COSMoS/blob/main/openapi/cosmos.yaml
CDISC Library API Portal. https://api.developer.library.cdisc.org/
CDISC Library - Getting Started. https://www.cdisc.org/cdisc-library/getting-started
Digital Data Flow (DDF) and USDM. https://www.cdisc.org/ddf
USDM in action: from protocol to SDTM (data4knowledge). https://d4k.dk/2024/08/09/usdm-in-action_-from-protocol-to-sdtm/
Define-XML VLM from Dataset Specializations (PharmaSUG 2023, SS-140). https://www.lexjansen.com/pharmasug/2023/SS/PharmaSUG-2023-SS-140.pdf

CDISC Biomedical Concepts: One Metadata Spine from Protocol to Submission