A guardrailed clinical table shell compiler for validated R and SAS output

What the Tool Does

TLF Compiler V2 supports a clinical programming workflow where table shells, AdaM metadata, and programming conventions must stay aligned. It separates interpretation from code writing: the LLM parses and proposes structure, but code is only generated after the recipe passes validation.

Architecture

The design uses a compiler-like pipeline. Instead of asking the model to write a full program directly, the system asks for a structured recipe. This gives the application a place to validate, repair, and fall back before emitting R or SAS.

Recipe Schema

{

"approach": "tplyr",

"dataset_var": "adae",

"pre_filters": ["SAFFL == 'Y'", "TRTEMFL == 'Y'"],

"derived_vars": [{"dataset_var": "adae", "name": "ANY_EVENT", "expr": "'Yes'"}],

"tables": [{

"table_var": "t1",

"dataset_var": "adae",

"treatment_var": "TRTP",

"add_total": true,

"layers": [

{"type": "group_count", "var": "ANY_EVENT", "nested_var": null, "by_var": null, "distinct_by": "USUBJID"},

{"type": "group_count", "var": "AEBODSYS", "nested_var": "AEDECOD", "by_var": null, "distinct_by": "USUBJID"}

]

}],

"combine_method": "bind_rows"

}

Guardrails and Repair

V2 validates at multiple levels. The system checks whether required fields exist, whether values are plausible variable names, whether treatment and layer variables are known, and whether adverse-event tables use the required nested SOC/PT pattern.

R and SAS Generation

The same validated recipe now drives both program outputs. R uses Tplyr-oriented assembly; SAS uses PROC FREQ, PROC MEANS, DATA steps, and PROC LIFETEST for survival-style outputs. Keeping both outputs tied to the same recipe reduces drift.

Example: AE SOC/PT Table

A common failure mode is treating the visible PT examples on the shell as the complete output. V2 instead marks SOC and PT rows as dynamic and produces a nested layer so all observed SOC/PT values in ADAE can appear.

add_layer(

group_count(vars(AEBODSYS, AEDECOD)) %>%

set_format_strings("n (%)" = f_str("xx (xx.x%)", n, pct)) %>%

set_distinct_by(USUBJID)

)

Evaluation Harness

The evaluation harness runs golden cases under controlled settings. It records route accuracy, recipe issue counts before and after repair, fallback use, and AE nested-layer accuracy when applicable.

Session Logs

The app logs session and run identifiers, event sequence, status, and structured details. When a case fails, the log includes route, expected route, recipe issue counts, repair attempts, fallback use, and error text.

{

"event": "eval_case_completed",

"status": "WARNING",

"details": {

"case_id": "ae_soc_pt_core",

"route": "ae",

"expected_route": "ae",

"post_repair_recipe_issue_count": 1,

"used_deterministic_fallback": true

}

Operational Recommendations

Keep shell parsing temperature flexible enough for visual interpretation, but keep recipe normalization and repair lower temperature.
Use heuristic routing for baseline testing, then benchmark LLM or consensus routing once the recipe path is stable.
Treat deterministic fallback as a safety net, not a replacement for improving the LLM recipe prompt.
Add real sponsor/project shells to eval_cases over time; the harness becomes more valuable as the golden set becomes less toy-like.
Review both R and SAS outputs side by side because they share intent but differ in implementation details.

AI TLF-Generator