After more than 35 years of relying on SAS V5 XPORT (XPT) files for regulatory submissions, the clinical data landscape is on the cusp of a transformative change. CDISC Dataset-JSON v1.1 is poised to become the new standard for electronic study data exchange, bringing clinical data infrastructure into the modern era.
Since 1989, the SAS V5 XPORT format has been the mandated transport mechanism for submitting clinical trial data to regulatory agencies. While XPT served its purpose well during the early days of electronic submissions, its limitations have become increasingly problematic as clinical trials grow more complex and data volumes expand exponentially.
The FDA formally acknowledged these limitations in their 2022 assessment, which evaluated SAS version 8 XPT, XML, and JSON formats against agency requirements. The weighted evaluation identified JSON as the optimal modern format with the potential to serve as a replacement for SAS version 5 XPT.
Variable Name Restrictions: XPT imposes an 8-character limit on variable names, forcing programmers to use cryptic abbreviations that reduce code readability and increase documentation burden.
Character String Constraints: Variable values cannot exceed 200 characters when using US-ASCII encoding, and this limit shrinks further when attempting workarounds for other character sets.
Label Length Limits: Labels are restricted to 40 characters, often insufficient for providing meaningful descriptions of complex derived variables.
No Unicode Support: XPT was built for US-ASCII encoding and provides no native support for other character sets. This creates significant challenges for global trials involving data in Chinese, Japanese, European languages, or any non-English text.
Binary Format: The proprietary binary format limits interoperability and requires specialized software for reading and manipulation.
Inefficient Storage: Fixed-length character fields waste storage space by padding with blanks up to the maximum defined length, making XPT particularly unsuitable for modern data sources like ePRO or wearable devices.
Flat Structure Only: XPT supports only two-dimensional tabular data, constraining the structural possibilities for SDTM, ADaM, and SEND datasets.
On April 9, 2025, the FDA published a Federal Register notice requesting public comments on CDISC Dataset-JSON v1.1 as a new exchange standard. The comment period closed on June 9, 2025, with 42 comments submitted to the docket (FDA-2025-N-0129).
The FDA is exploring Dataset-JSON as a potential replacement for XPT in submissions to both CBER (Center for Biologics Evaluation and Research) and CDER (Center for Drug Evaluation and Research). This initiative aligns with the FDA's Data Modernization Action Plan and represents a significant step toward building a fully digital, standards-driven regulatory ecosystem.
As of this writing, Dataset-JSON has not yet been added to the FDA Data Standards Catalog as an accepted format. However, the successful pilot results and formal request for comments signal that official support is likely forthcoming. According to FDA guidance, any adoption of Dataset-JSON would include a lag period of at least two years before becoming a requirement, during which time both XPT and Dataset-JSON would be accepted.
From September 2023 through April 2024, the FDA collaborated with CDISC and PHUSE to execute a comprehensive pilot testing the feasibility of Dataset-JSON for regulatory submissions. The pilot demonstrated several key findings:
The pilot confirmed that Dataset-JSON can serve as a drop-in replacement for XPT with no disruption to existing business processes.
FDA (United States): As the primary driver of this initiative, the FDA has invested significantly in evaluating Dataset-JSON. The agency's Office of Computational Science conducted internal testing and coordinated external test submissions from industry participants.
CDISC: The Clinical Data Interchange Standards Consortium developed and maintains the Dataset-JSON specification. CDISC's Data Exchange Standards team, led by Sam Hume (Vice President of Data Science), has been instrumental in advancing this initiative.
PHUSE: The Pharmaceutical Users Software Exchange co-led the pilot program with CDISC and FDA. Stuart Malcom of Veramed served as PHUSE's pilot lead, coordinating industry participation and feedback.
Japan PMDA: As a Platinum Member of CDISC, Japan's Pharmaceuticals and Medical Devices Agency requires CDISC standards for regulatory submissions and maintains close alignment with FDA data standards requirements.
China NMPA: CDISC standards are the preferred standards for electronic data submission in China as stipulated in their eCTD Guidance (September 2019).
European Medicines Agency (EMA): While primarily focused on eCTD v4.0 transition, EMA participates in ICH harmonization efforts and coordinates with FDA and PMDA on data standards convergence.
The involvement of multiple regulatory agencies through organizations like the International Council for Harmonisation (ICH) supports eventual global adoption of Dataset-JSON.
Dataset-JSON produces significantly smaller files than both XPT and Dataset-XML. Comparative analysis using CDISC pilot datasets shows dramatic reductions:
| Dataset XPT (KB) Dataset-XML (KB) Dataset-JSON (KB) | |||
| SDTM FT | 5,917 | 4,287 | 858 |
| SDTM LB | 2,699 | 4,104 | 640 |
| SDTM VS | 784 | 1,372 | 229 |
| ADaM ADLBC | 33,441 | 145,575 | 24,942 |
| ADaM ADVS | 13,313 | 51,646 | 8,257 |
These size reductions become particularly significant for large datasets approaching the FDA's 5GB limit, potentially reducing or eliminating the need for dataset splitting.
Dataset-JSON is based on UTF-8 encoding, providing native support for over 120,000 characters across virtually all languages. This enables proper handling of Japanese AETERM values, European special characters, Chinese patient names, and other international data without transcoding issues or character corruption.
Dataset-JSON can optionally reference Define-XML documents for complete metadata while including essential column-level metadata (names, labels, data types, lengths) within the file itself. This provides a balance between self-describing files and comprehensive metadata documentation.
Perhaps the most significant long-term benefit is enabling future versions of CDISC Foundational Standards (SDTM, ADaM, SEND) to evolve beyond the artificial constraints imposed by XPT limitations. Variable names could exceed 8 characters, labels could properly describe complex derivations, and data structures could become more sophisticated.
Multiple open-source tools are available for converting existing datasets to Dataset-JSON format. These tools were developed through CDISC's Dataset-JSON Hackathon and continue to be maintained by the community.
The most comprehensive SAS solution is available at the GitHub repository lexjansen/dataset-json-sas. This implementation provides macros for bidirectional conversion and has been updated for Dataset-JSON v1.1.
Writing Dataset-JSON from XPT:
%write_datasetjson(
xptpath=/path/to/file.xpt,
jsonpath=/path/to/output.json,
usemetadata=N,
datasetJSONVersion=1.1,
fileOID=STUDY001,
originator=Sponsor Name,
sourceSystem=SAS,
sourceSystemVersion=9.4
);
Reading Dataset-JSON back to SAS:
Starting with SAS 9.4 TS1M4, you can use the JSON engine to read Dataset-JSON files. A JSON map file defines the data set structures:
FILENAME jsonfile "/path/to/dataset.json";FILENAME mapfile "/path/to/mapfile.map";LIBNAME jsonfile JSON FILEREF=jsonfile MAP=mapfile AUTOMAP=CREATE;
Several Python tools exist for Dataset-JSON conversion:
swhume/dataset-json: A CLI utility that transforms XPT, SAS7BDAT, CSV, Pandas DataFrames, and Parquet files into Dataset-JSON format. Requires Define-XML for metadata:
python dsjconvert.py -x -v --define-xml /path/to/define.xml
dostiep/Dataset-JSON-Python: Provides a GUI application for Windows environments to convert SAS7BDAT or XPT files to Dataset-JSON.
The R conversion package provides functionality for reading and writing Dataset-JSON files, enabling integration with tidyverse workflows and R data frames.
A Dataset-JSON file contains a single dataset with the following structure:
{
"datasetJSONCreationDateTime": "2026-02-04T12:00:00",
"datasetJSONVersion": "1.1",
"fileOID": "STUDY001.ADAE",
"originator": "Sponsor Name",
"sourceSystem": "SAS",
"sourceSystemVersion": "9.4",
"studyOID": "STUDY001",
"metaDataVersionOID": "MDV.MSGv2.0.SDTMIG.3.3.ADAM.1.1",
"metaDataRef": "define.xml",
"itemGroupOID": "IG.ADAE",
"records": 1523,
"name": "ADAE",
"label": "Adverse Events Analysis Dataset",
"columns": [
{"itemOID": "IT.ADAE.STUDYID", "name": "STUDYID", "label": "Study Identifier", "dataType": "string", "length": 12},
{"itemOID": "IT.ADAE.USUBJID", "name": "USUBJID", "label": "Unique Subject Identifier", "dataType": "string", "length": 20},
{"itemOID": "IT.ADAE.AESEQ", "name": "AESEQ", "label": "Sequence Number", "dataType": "integer"}
],
"rows": [
["STUDY001", "STUDY001-001-001", 1],
["STUDY001", "STUDY001-001-002", 1]
]
}
For streaming large datasets, the NDJSON (Newline Delimited JSON) format places metadata on line 1 and each data row on subsequent lines:
{"datasetJSONVersion":"1.1","columns":[...],"name":"ADAE",...}
["STUDY001", "STUDY001-001-001", 1]
["STUDY001", "STUDY001-001-002", 1]
This format enables processing datasets larger than available memory by reading and processing one row at a time.
Dataset-JSON v1.1 represents the most significant modernization of clinical data exchange formats since electronic submissions began. The successful pilot, FDA's formal request for comments, and broad industry support indicate that Dataset-JSON will likely become an accepted submission format in the near future.
For clinical data programmers, this transition offers an opportunity to move beyond the constraints that have shaped CDISC implementations for decades. While the immediate impact will be a simple format change—same content, different container—the long-term implications for standards development and data exchange are profound.
Organizations that begin preparing now will be well-positioned to adopt Dataset-JSON smoothly when FDA support becomes official, maintaining their competitive edge in an increasingly data-driven regulatory environment.
I've got an ongoing (working on Part 3) series of blog posts on Dataset-JSON at https://brianrepko.github.io/blog/index.html
Thanks Varun for this great summary of JSON files and how to read and write JSON files!