The Future of Clinical Data Exchange: CDISC Dataset-JSON v1.1

After more than 35 years of relying on SAS V5 XPORT (XPT) files for regulatory submissions, the clinical data landscape is on the cusp of a transformative change. CDISC Dataset-JSON v1.1 is poised to become the new standard for electronic study data exchange, bringing clinical data infrastructure into the modern era.

The End of an Era for XPT

Since 1989, the SAS V5 XPORT format has been the mandated transport mechanism for submitting clinical trial data to regulatory agencies. While XPT served its purpose well during the early days of electronic submissions, its limitations have become increasingly problematic as clinical trials grow more complex and data volumes expand exponentially.

The FDA formally acknowledged these limitations in their 2022 assessment, which evaluated SAS version 8 XPT, XML, and JSON formats against agency requirements. The weighted evaluation identified JSON as the optimal modern format with the potential to serve as a replacement for SAS version 5 XPT.

Key Limitations of XPT That Dataset-JSON Addresses

Variable Name Restrictions: XPT imposes an 8-character limit on variable names, forcing programmers to use cryptic abbreviations that reduce code readability and increase documentation burden.

Character String Constraints: Variable values cannot exceed 200 characters when using US-ASCII encoding, and this limit shrinks further when attempting workarounds for other character sets.

Label Length Limits: Labels are restricted to 40 characters, often insufficient for providing meaningful descriptions of complex derived variables.

No Unicode Support: XPT was built for US-ASCII encoding and provides no native support for other character sets. This creates significant challenges for global trials involving data in Chinese, Japanese, European languages, or any non-English text.

Binary Format: The proprietary binary format limits interoperability and requires specialized software for reading and manipulation.

Inefficient Storage: Fixed-length character fields waste storage space by padding with blanks up to the maximum defined length, making XPT particularly unsuitable for modern data sources like ePRO or wearable devices.

Flat Structure Only: XPT supports only two-dimensional tabular data, constraining the structural possibilities for SDTM, ADaM, and SEND datasets.

FDA Timeline and Regulatory Status

On April 9, 2025, the FDA published a Federal Register notice requesting public comments on CDISC Dataset-JSON v1.1 as a new exchange standard. The comment period closed on June 9, 2025, with 42 comments submitted to the docket (FDA-2025-N-0129).

The FDA is exploring Dataset-JSON as a potential replacement for XPT in submissions to both CBER (Center for Biologics Evaluation and Research) and CDER (Center for Drug Evaluation and Research). This initiative aligns with the FDA's Data Modernization Action Plan and represents a significant step toward building a fully digital, standards-driven regulatory ecosystem.

Current FDA Data Standards Catalog Status

As of this writing, Dataset-JSON has not yet been added to the FDA Data Standards Catalog as an accepted format. However, the successful pilot results and formal request for comments signal that official support is likely forthcoming. According to FDA guidance, any adoption of Dataset-JSON would include a lag period of at least two years before becoming a requirement, during which time both XPT and Dataset-JSON would be accepted.

Pilot Program Results

From September 2023 through April 2024, the FDA collaborated with CDISC and PHUSE to execute a comprehensive pilot testing the feasibility of Dataset-JSON for regulatory submissions. The pilot demonstrated several key findings:

Data integrity was maintained throughout all conversions between formats
Minimal effort was required for conversion using Python-based tools
No showstopper issues were identified during test submissions through ESG to CDER and CBER
Integer date conversion findings were documented and subsequently addressed in v1.1

The pilot confirmed that Dataset-JSON can serve as a drop-in replacement for XPT with no disruption to existing business processes.

Collaborative Development: Agencies Driving the Change

Primary Stakeholders

FDA (United States): As the primary driver of this initiative, the FDA has invested significantly in evaluating Dataset-JSON. The agency's Office of Computational Science conducted internal testing and coordinated external test submissions from industry participants.

CDISC: The Clinical Data Interchange Standards Consortium developed and maintains the Dataset-JSON specification. CDISC's Data Exchange Standards team, led by Sam Hume (Vice President of Data Science), has been instrumental in advancing this initiative.

PHUSE: The Pharmaceutical Users Software Exchange co-led the pilot program with CDISC and FDA. Stuart Malcom of Veramed served as PHUSE's pilot lead, coordinating industry participation and feedback.

Global Regulatory Alignment

Japan PMDA: As a Platinum Member of CDISC, Japan's Pharmaceuticals and Medical Devices Agency requires CDISC standards for regulatory submissions and maintains close alignment with FDA data standards requirements.

China NMPA: CDISC standards are the preferred standards for electronic data submission in China as stipulated in their eCTD Guidance (September 2019).

European Medicines Agency (EMA): While primarily focused on eCTD v4.0 transition, EMA participates in ICH harmonization efforts and coordinates with FDA and PMDA on data standards convergence.

The involvement of multiple regulatory agencies through organizations like the International Council for Harmonisation (ICH) supports eventual global adoption of Dataset-JSON.

Technical Advantages of Dataset-JSON v1.1

File Size Efficiency

Dataset-JSON produces significantly smaller files than both XPT and Dataset-XML. Comparative analysis using CDISC pilot datasets shows dramatic reductions:

Dataset XPT (KB) Dataset-XML (KB) Dataset-JSON (KB)
SDTM FT	5,917	4,287	858
SDTM LB	2,699	4,104	640
SDTM VS	784	1,372	229
ADaM ADLBC	33,441	145,575	24,942
ADaM ADVS	13,313	51,646	8,257

These size reductions become particularly significant for large datasets approaching the FDA's 5GB limit, potentially reducing or eliminating the need for dataset splitting.

Universal Language Support

Dataset-JSON is based on UTF-8 encoding, providing native support for over 120,000 characters across virtually all languages. This enables proper handling of Japanese AETERM values, European special characters, Chinese patient names, and other international data without transcoding issues or character corruption.

Modern Data Exchange Capabilities

API-based data exchange: JSON is the de facto standard for REST APIs, enabling programmatic data exchange and integration with modern systems
Human-readable format: Text-based format allows direct inspection without specialized software
Universal library support: JSON parsing is built into virtually every programming language
Streaming support: NDJSON variant enables record-by-record processing without loading entire datasets into memory
Compressed format: DSJC (Dataset-JSON Compressed) provides standardized zLib compression for storage-optimized scenarios

Enhanced Metadata Integration

Dataset-JSON can optionally reference Define-XML documents for complete metadata while including essential column-level metadata (names, labels, data types, lengths) within the file itself. This provides a balance between self-describing files and comprehensive metadata documentation.

Removal of XPT Constraints on Standards Development

Perhaps the most significant long-term benefit is enabling future versions of CDISC Foundational Standards (SDTM, ADaM, SEND) to evolve beyond the artificial constraints imposed by XPT limitations. Variable names could exceed 8 characters, labels could properly describe complex derivations, and data structures could become more sophisticated.

Converting XPT and SAS7BDAT Files to Dataset-JSON

Multiple open-source tools are available for converting existing datasets to Dataset-JSON format. These tools were developed through CDISC's Dataset-JSON Hackathon and continue to be maintained by the community.

SAS Implementation (Lex Jansen)

The most comprehensive SAS solution is available at the GitHub repository lexjansen/dataset-json-sas. This implementation provides macros for bidirectional conversion and has been updated for Dataset-JSON v1.1.

Writing Dataset-JSON from XPT:


%write_datasetjson(
    xptpath=/path/to/file.xpt,
    jsonpath=/path/to/output.json,
    usemetadata=N,
    datasetJSONVersion=1.1,
    fileOID=STUDY001,
    originator=Sponsor Name,
    sourceSystem=SAS,
    sourceSystemVersion=9.4
);

Reading Dataset-JSON back to SAS:

Starting with SAS 9.4 TS1M4, you can use the JSON engine to read Dataset-JSON files. A JSON map file defines the data set structures:


FILENAME jsonfile "/path/to/dataset.json";FILENAME mapfile "/path/to/mapfile.map";LIBNAME jsonfile JSON FILEREF=jsonfile MAP=mapfile AUTOMAP=CREATE;

Python Implementation

Several Python tools exist for Dataset-JSON conversion:

swhume/dataset-json: A CLI utility that transforms XPT, SAS7BDAT, CSV, Pandas DataFrames, and Parquet files into Dataset-JSON format. Requires Define-XML for metadata:


python dsjconvert.py -x -v --define-xml /path/to/define.xml

dostiep/Dataset-JSON-Python: Provides a GUI application for Windows environments to convert SAS7BDAT or XPT files to Dataset-JSON.

R Implementation (Atorus Research and Johnson & Johnson)

The R conversion package provides functionality for reading and writing Dataset-JSON files, enabling integration with tidyverse workflows and R data frames.

General Conversion Workflow

Prepare Define-XML: Most conversion tools require Define-XML metadata to properly construct Dataset-JSON files with complete column definitions
Convert datasets: Run the conversion tool of choice on your XPT or SAS7BDAT files
Validate output: Use the JSON schema from the CDISC GitHub repository to validate compliance with the Dataset-JSON v1.1 specification
Verify data integrity: Compare row counts, checksums, or round-trip the data to confirm no information loss

Dataset-JSON File Structure

A Dataset-JSON file contains a single dataset with the following structure:


{
  "datasetJSONCreationDateTime": "2026-02-04T12:00:00",
  "datasetJSONVersion": "1.1",
  "fileOID": "STUDY001.ADAE",
  "originator": "Sponsor Name",
  "sourceSystem": "SAS",
  "sourceSystemVersion": "9.4",
  "studyOID": "STUDY001",
  "metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.ADAM.1.1",
  "metaDataRef": "define.xml",
  "itemGroupOID": "IG.ADAE",
  "records": 1523,
  "name": "ADAE",
  "label": "Adverse Events Analysis Dataset",
  "columns": [
    {"itemOID": "IT.ADAE.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12},
    {"itemOID": "IT.ADAE.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 20},
    {"itemOID": "IT.ADAE.AESEQ", "name": "AESEQ", "label": "Sequence Number", "dataType": "integer"}
  ],
  "rows": [
    ["STUDY001", "STUDY001-001-001", 1],
    ["STUDY001", "STUDY001-001-002", 1]
  ]
}

NDJSON Variant

For streaming large datasets, the NDJSON (Newline Delimited JSON) format places metadata on line 1 and each data row on subsequent lines:


{"datasetJSONVersion":"1.1","columns":[...],"name":"ADAE",...}
["STUDY001", "STUDY001-001-001", 1]
["STUDY001", "STUDY001-001-002", 1]

This format enables processing datasets larger than available memory by reading and processing one row at a time.

Preparing for the Transition

Immediate Steps

Familiarize your team: Begin education on JSON fundamentals and the Dataset-JSON specification
Evaluate tooling: Test the available open-source conversion tools with your existing datasets
Monitor FDA announcements: Watch for updates to the Data Standards Catalog and Technical Conformance Guide
Participate in PHUSE/CDISC events: Attend workshops and hackathons to gain hands-on experience

Medium-Term Planning

Assess systems impact: Evaluate how Dataset-JSON will affect your EDC, data management, and analysis environments
Update validation procedures: Develop or adapt conformance rules for Dataset-JSON outputs
Plan dual-format capability: Prepare to generate both XPT and Dataset-JSON during the transition period

Long-Term Considerations

Standards evolution: Consider how removing XPT constraints might enable improved SDTM/ADaM structures
API integration: Explore opportunities for automated data exchange using Dataset-JSON APIs
Global alignment: Coordinate with regulatory strategies for submissions to PMDA, NMPA, and other agencies

Conclusion

Dataset-JSON v1.1 represents the most significant modernization of clinical data exchange formats since electronic submissions began. The successful pilot, FDA's formal request for comments, and broad industry support indicate that Dataset-JSON will likely become an accepted submission format in the near future.

For clinical data programmers, this transition offers an opportunity to move beyond the constraints that have shaped CDISC implementations for decades. While the immediate impact will be a simple format change—same content, different container—the long-term implications for standards development and data exchange are profound.

Organizations that begin preparing now will be well-positioned to adopt Dataset-JSON smoothly when FDA support becomes official, maintaining their competitive edge in an increasingly data-driven regulatory environment.