A working guide for statistical programmers: the build flow, the CDISC Library API, real-time study creation, and where it breaks
Most of us learned the trade backwards. We collect data on a CRF designed by one team, receive it from an EDC configured by another, then spend months mapping it into SDTM, writing value level metadata by hand, and reconciling the Define-XML against what the protocol actually said. Every handoff loses information, and every team rebuilds the same knowledge in its own format. A blood pressure is a blood pressure whether it sits on a CRF, in an EDC export, or in the VS domain, yet we re-describe it three or four times and then write specs to translate between the descriptions.
Biomedical Concepts (BCs) are CDISC's response to that inefficiency. A BC is a single, machine-readable definition of what a measurement is, independent of any one standard, with enough detail attached that the same definition can drive a CRF, configure an EDC, and produce SDTM with its value level metadata. The work is run through the COSMoS project (Conceptual and Operational Standards Metadata Services), published in the CDISC Library, and reachable through a REST API. This article walks the full path: what a BC contains, how it threads protocol design to CRF/EDC build to SDTM, how to pull it from the API, how it underpins a real-time study build, and the problems you will hit putting it into practice.
It helps to stop thinking of a BC as a single object and instead as three connected layers. The top layer is conceptual and standards-agnostic. The middle layer breaks the concept into its properties. The bottom layer is the implementation that programmers care about: pre-built SDTM variable metadata you can drop into a spec.
| Layer | What it is | Example (Systolic Blood Pressure) |
| Biomedical Concept | The semantic definition, anchored to an NCI Thesaurus concept. Names, synonyms, category, result scale, and external codes (LOINC) live here. | Concept C25298, category Vital Signs, result scale Quantitative |
| Data Element Concepts | The named properties the concept needs to be unambiguous, each typed and bound to controlled terminology. | Test, result value, unit, body position, laterality, timing |
| SDTM Dataset Specialization | The implementation layer: the BC expressed as SDTM variables with roles, codelists, assigned terms, relationships, and VLM targets. | Domain VS; VSTESTCD=SYSBP, VSORRES, VSORRESU=mmHg, VSPOS, VSDTC |
Table 1. The three layers of a Biomedical Concept and where each one is used.
The split matters. The conceptual layer gives you a stable anchor that does not change when SDTM versions do; the NCIt code for systolic blood pressure is the same regardless of whether you submit under SDTMIG 3.2 or 3.4. The Dataset Specialization layer is where that stability turns into something you can submit. It is a configured building block: it already knows that the unit belongs in VSORRESU, that mmHg is the assigned term, that VSORRESU is the unit for the value in VSORRESN, and that VSTESTCD takes the value SYSBP. That is the value level metadata you would otherwise type by hand for every Define-XML.
Figure 2. One concept resolves into typed properties and into ready-made SDTM variable metadata.
The reason BCs are worth the trouble is that one definition feeds the whole study lifecycle instead of being re-authored at each stage. Read the flow left to right.
At protocol design, the schedule of activities names what gets measured and when. Each activity (vital signs, chemistry panel, ECG) decomposes into specific tests, and each test is a Biomedical Concept. Choosing the BC at this stage is the decision that propagates everywhere downstream.
The build steps then read from the same source:
The payoff for a programmer is the removal of remapping. When the collection metadata and the submission metadata are two views of one concept, there is no ALB_SERUM_RES-to-LBORRES translation step to specify, validate, and maintain. The mapping was never created because the identity was never lost.
Take SYSBP through the layers. The BC carries the semantics and terminology; the Dataset Specialization carries the SDTM detail. Table 2 shows how the concept's properties land as VS variables.
| BC property | SDTM variable | How the specialization fills it |
| Test | VSTESTCD / VSTEST | Assigned term SYSBP / Systolic Blood Pressure |
| Result value | VSORRES, VSSTRESN | Origin Collected; VSSTRESN is the numeric standardized result |
| Unit | VSORRESU | Assigned term mmHg; relationship: is the unit for the value |
| Body position | VSPOS | Codelist Position; e.g., SITTING, STANDING, SUPINE |
| Timing | VSDTC | ISO 8601 date/time of measurement |
Table 2. Systolic Blood Pressure BC properties resolved to SDTM VS variables by the Dataset Specialization.
Nothing in that table was a judgement call by the study programmer. The assigned terms, the codelist references, and the unit-to-value relationship are all carried in the published specialization, which is what makes the metadata reproducible across studies and sponsors.
The Library is a REST service that returns JSON. Content is organised into dated packages, and you address a concept or a specialization inside a package. The current BC and Dataset Specialization content lives under API version 2; requests must include the version or the service returns 404.
Authentication is by API key. Request an account through the CDISC Library API Portal, retrieve your subscription key, and send it in the api-key request header. Basic Auth was deprecated in 2020. The base URL for production is https://library.cdisc.org/api.
| Purpose | Endpoint (relative to base URL) |
| List BC packages | /mdr/bc/packages |
| List BCs in a package | /mdr/bc/packages/{package}/biomedicalconcepts |
| Get one BC | /mdr/bc/packages/{package}/biomedicalconcepts/{conceptId} |
| List SDTM specialization packages | /mdr/specializations/sdtm/packages |
| List specializations in a package | /mdr/specializations/sdtm/packages/{package}/datasetspecializations |
| Get one specialization | /mdr/specializations/sdtm/packages/{package}/datasetspecializations/{id} |
Table 3. Core CDISC Library endpoints for Biomedical Concepts and SDTM Dataset Specializations.
A request for a single BC looks like this. Note that BC content is served under the cosmos/v2 path; confirm the exact path in the API Portal console for your subscription:
| curl.exe -i -H "api-key: $env:CDISC_LIBRARY_KEY" \ | "https://library.cdisc.org/api/cosmos/v2/mdr/bc/packages" |
The response for a single BC (trimmed) shows the concept, its external coding, and its Data Element Concepts:
| { | "conceptId": "C49676", | "shortName": "Pulse Rate", | "category": ["Vital Signs"], | "resultScale": "Quantitative", | "coding": [{ "code": "39156-5", "system": "http://loinc.org/", | "systemName": "LOINC" }], | "dataElementConcepts": [ | { "conceptId": "C123975", "shortName": "Vital Signs Laterality", | "dataType": "string", "exampleSet": ["Left", "Right"] } | ] | } |
The matching SDTM Dataset Specialization is where the submission metadata lives. Note the assigned term, the codelist reference, the variable relationship, and the VLM target flag:
| { | "datasetSpecializationId": "SYSBP", | "domain": "VS", | "shortName": "Systolic Blood Pressure", | "variables": [ | { "name": "VSTESTCD", "assignedTerm": { "value": "SYSBP" }, | "role": "Topic", "mandatoryVariable": true }, | { "name": "VSORRESU", "assignedTerm": { "value": "mmHg" }, | "codelist": { "conceptId": "C66770", "submissionValue": "VSRESU" }, | "relationship": { "predicateTerm": "IS_UNIT_FOR", | "object": "VSSTRESN" }, "vlmTarget": true } | ] | } |
From a programmer's seat the practical move is to pull these once per study, cache them, and treat them as the source of your spec. A thin Python wrapper is enough:
| import os, requests | BASE = "https://library.cdisc.org/api/cosmos/v2" | H = {"api-key": os.environ["CDISC_LIBRARY_KEY"], "Accept": "application/json"} | def get_specialization(pkg, spec_id): | url = f"{BASE}/mdr/specializations/sdtm/packages/{pkg}/datasetspecializations/{spec_id}" | r = requests.get(url, headers=H, timeout=30) | r.raise_for_status() | return r.json() | vlm = [v for v in get_specialization("2022-10-26", "SYSBP")["variables"] | if v.get("vlmTarget")] |
API access to Biomedical Concept and SDTM Dataset Specialization content sits behind a paid CDISC membership. An unentitled key returns a 401 with a members-only message even when the path and header are correct, which is a useful signal that your request reached the right resource. You do not need the live API to work with the content, though. CDISC publishes the same metadata as openly licensed (CC-BY-4.0) files in the COSMoS GitHub repository, kept current with the latest versions. For learning and prototyping these are a complete substitute; you only give up live, versioned calls.
The same metadata that removes remapping also lets a study be configured close to the moment the protocol is finalised, rather than months later. This is the goal of CDISC's Digital Data Flow (DDF) work and its Unified Study Definitions Model (USDM), which carries the protocol as structured data. USDM v2.0 added a biomedical concept layer specifically to support CRF automation, so the digital protocol does not just describe the schedule of activities, it points at the BCs being collected. This is also the foundation of CDISC 360i, which aims to digitise study design through analysis and automate the build of study artefacts from the digital protocol and its BCs.
With a protocol expressed in USDM and its activities linked to BCs, a study build becomes a series of selections and resolutions rather than hand authoring:
A data contract is the idea that ties this together. It is a unique address for one data point: a specific property of a specific BC at a specific timepoint for a study. If the EDC and the lab return that identifier alongside each value, you no longer map ALB_SERUM_RES to LBORRES; you look up where the contract said the value belongs. The data4knowledge technology demonstrator built on the DDF prototype shows this end to end, loading values keyed by contract into a graph store and reading SDTM straight back out:
| LOAD CSV WITH HEADERS FROM 'file:///{filename}' AS row | MATCH (dc:DataContract {uri: row['DC_URI']}) | MERGE (d:DataPoint {uri: row['DATAPOINT_URI'], value: row['VALUE']}) | MERGE (s:Subject {identifier: row['USUBJID']}) | MERGE (dc)<-[:FOR_DC_REL]-(d) | MERGE (d)-[:FOR_SUBJECT_REL]->(s) |
The point is not the graph database, which is one implementation choice. The point is that when collection is keyed to the same concepts that define submission, a study can be stood up and its SDTM produced on demand, because the structure was decided when the protocol was, not rebuilt afterward.
The vision is clean; the practice is not yet. These are the obstacles worth planning around.
| Challenge | What it means in practice |
| Coverage gaps | The published BC and specialization library does not cover every test, domain, or therapeutic-area nuance. You will still author specializations for the long tail, and CRF Specializations are still draft. |
| Versioning and governance | Content is package-dated and concepts evolve. You have to pin a package per study, track what changed between packages, and decide when to adopt a new one mid-programme. |
| NCIt and CT dependence | Definitions are anchored to NCI Thesaurus and CDISC Controlled Terminology. Concepts awaiting codes, or local terms with no NCIt match, fall outside the model and need a fallback. |
| Tooling maturity | Much of the end-to-end automation lives in prototypes and open-source demonstrators rather than validated, GxP-ready production systems. Validation and qualification effort is real. |
| Subscription and access | The Library API gates BC and specialization content behind paid membership. Teams without that tier rely on the open COSMoS exports, which lack live, versioned calls. |
| EDC and vendor adoption | The mapping-free promise depends on EDC and lab vendors honouring data contracts or specialization identifiers. Without that, you reintroduce a mapping layer at the boundary. |
| Organisational change | The model assumes data managers, programmers, and protocol authors work from one shared definition. That is a process change, not just a technical one, and it crosses team boundaries that today are siloed. |
Table 4. Practical obstacles to BC-driven automation and what each one costs.
None of these is a reason to wait. Coverage, terminology, and tooling are all improving release over release, and the parts that are stable today, such as pulling published specializations to seed value level metadata and Define-XML, already remove real manual effort. The sensible posture is to adopt BCs where the library covers your tests, build a local process for the gaps, and pin versions deliberately so you are never surprised by a package change.
If you want to get hands-on without committing a programme, download the open COSMoS exports linked above, pull the SDTM Dataset Specializations for a handful of vital signs and chemistry tests you know well, and compare the generated value level metadata against a Define-XML you wrote by hand. The differences will tell you quickly where the library fits your work and where your local conventions diverge. From there, the move toward USDM-linked protocols and real-time builds is an extension of the same idea rather than a separate project: define the concept once, and let collection and submission be two readings of it.
Sources
No comments yet. Be the first!
