If you're a statistical programmer, your daily workflow probably looks something like this:
This works. But it's manual, fragile, and hard to trace. If an auditor asks, "Which version of a_adae.sas produced the table in your CSR?" — can you answer that with certainty?
This article shows you a better setup. We're going to use three tools together:
The end result: you write one SAS program that automatically pulls data from the cloud, processes it, pushes results back, and saves a record of your code changes. No manual copying, no email attachments, no filename versioning.
Here's the big picture. Your SAS session sits in the middle, and PROC PYTHON is the bridge that talks to both AWS and GitHub:
+------------------+
| AWS S3 | Cloud storage: raw data in, results out
| (Data Storage) |
+--------+---------+
|
| boto3 (Python library for AWS)
|
+--------+---------+
| SAS Session | Your home base: DATA steps,
| + PROC PYTHON | PROC SQL, macros — all normal SAS
+--------+---------+
|
| git commands (via Python subprocess)
|
+--------+---------+
| GitHub | Version control: tracks every change
| (Code Storage) | to your .sas files
+------------------+The workflow runs in four steps:
| Step | What Happens | How |
|---|---|---|
| 1. Pull Data | Download raw data files from S3 to your SAS server | PROC PYTHON runs boto3 to download .xpt / .csv files from an S3 bucket |
| 2. Process | Run your normal SAS programs | Standard SAS: DATA steps, PROC SQL, macros — nothing changes here |
| 3. Push Results | Upload outputs back to S3 | PROC PYTHON runs boto3 to upload .xpt / .pdf / .rtf files to S3 |
| 4. Save Code | Commit and push your programs to GitHub | PROC PYTHON runs git commands to record what changed and sync to GitHub |
PROC PYTHON lets you write Python code directly inside a SAS program. It was introduced in SAS 9.4M7 and is available in SAS Viya. When you run it, SAS starts a Python session in the background, runs your code, and gives you back the results.
What you need to get started:
PYTHON_HOME option set in your SAS config to point to the Python installationboto3 (for AWS) and pandas (for data handling)Here's the simplest possible example. This just prints a message from Python inside your SAS log:
proc python;
submit;
print('Hello from Python inside SAS!')
endsubmit;
run;Everything between submit; and endsubmit; is Python code. Outside of that, it's regular SAS. The output from print() shows up in your SAS log.
This is where it gets useful. You can read SAS macro variables in Python and send values back.
/* Define SAS macro variables as usual */
%let study_id = ABC-001;
%let bucket_name = my-clinical-data;
proc python;
submit;
# Read SAS macro variables into Python variables
study = SAS.symget('study_id') # study = 'ABC-001'
bucket = SAS.symget('bucket_name') # bucket = 'my-clinical-data'
print(f'Working on study: {study}')
print(f'Data bucket: {bucket}')
endsubmit;
run;proc python;
submit;
# Do something in Python...
file_count = 15
# Send the result back to SAS as a macro variable
SAS.symput('n_files', str(file_count))
endsubmit;
run;
/* Now use it in SAS */
%put NOTE: Python found &n_files files;You can also move entire datasets back and forth. SAS.sd2df() converts a SAS dataset into a Python pandas DataFrame. SAS.df2sd() goes the other direction.
proc python;
submit;
import pandas as pd
# Pull a SAS dataset into Python
df = SAS.sd2df('WORK.ADSL')
print(f'ADSL has {len(df)} subjects')
# Do something with it in Python
# (here we filter to safety population)
df_safe = df[df['SAFFL'] == 'Y']
# Push the filtered data back to SAS
SAS.df2sd(df_safe, 'WORK.ADSL_SAFE')
endsubmit;
run;
/* ADSL_SAFE is now a regular SAS dataset in WORK */
proc freq data=work.adsl_safe;
tables trt01p;
run;SAS.symget/symput is like %let for Python. SAS.sd2df/df2sd is like PROC COPY between SAS and Python.
Amazon S3 (Simple Storage Service) is cloud-based file storage. Think of it like a network drive, but hosted by Amazon. Your files live in "buckets" (like top-level folders), and each file has a "key" (its full path inside the bucket).
For example, a raw DM dataset might be stored at:
Bucket: pharma-clinical-dataKey: studies/ABC-001/raw/dm.xptBefore you can read or write files on S3, your SAS server needs permission. There are three ways to set this up (your IT/cloud team will tell you which one to use):
If your SAS server runs on an AWS machine (EC2 instance), your IT team can attach an "IAM role" that gives it automatic S3 access. In this case, you don't need to do anything in your code — it just works:
proc python;
submit;
import boto3
s3 = boto3.client('s3') # automatically authenticated
endsubmit;
run;Your SAS admin sets AWS credentials as environment variables on the server:
options;
options;
options;A file at ~/.aws/credentials on the SAS server stores the keys. The boto3 library reads it automatically.
This downloads one .xpt file from S3 to your local directory:
%let s3_bucket = pharma-clinical-data;
%let study_id = ABC-001;
proc python;
submit;
import boto3
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study = SAS.symget('study_id')
# Download dm.xpt from S3 to local path
s3.download_file(
Bucket = bucket,
Key = f'studies/{study}/raw/dm.xpt',
Filename = '/sas/data/raw/dm.xpt'
)
print('Downloaded dm.xpt')
endsubmit;
run;
/* Now read it in SAS as usual */
libname raw xport '/sas/data/raw/dm.xpt';
proc contents data=raw.dm; run;This downloads every file under a given S3 prefix (folder path):
%let s3_bucket = pharma-clinical-data;
%let s3_prefix = studies/ABC-001/raw/;
%let local_dir = /sas/data/raw/;
proc python;
submit;
import boto3, os
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
prefix = SAS.symget('s3_prefix')
local_dir = SAS.symget('local_dir')
os.makedirs(local_dir, exist_ok=True)
# List all files in the S3 "folder"
paginator = s3.get_paginator('list_objects_v2')
count = 0
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get('Contents', []):
filename = obj['Key'].split('/')[-1]
if filename: # skip empty keys
local_path = os.path.join(local_dir, filename)
s3.download_file(bucket, obj['Key'], local_path)
count += 1
print(f' Downloaded: {filename}')
SAS.symput('files_downloaded', str(count))
print(f'\nTotal: {count} files downloaded')
endsubmit;
run;
%put NOTE: Downloaded &files_downloaded files from S3;For smaller files (like a CSV), you can skip saving to disk entirely. This reads a CSV from S3 straight into a pandas DataFrame and then into a SAS dataset:
proc python;
submit;
import boto3, pandas as pd
from io import BytesIO
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
# Read CSV directly from S3 into a DataFrame
obj = s3.get_object(
Bucket=bucket,
Key='studies/ABC-001/raw/ae.csv'
)
df = pd.read_csv(BytesIO(obj['Body'].read()))
print(f'Read {len(df)} AE records from S3')
# Push it into SAS as a WORK dataset
SAS.df2sd(df, 'WORK.AE_RAW')
endsubmit;
run;
/* AE_RAW is now a regular SAS dataset */
proc freq data=work.ae_raw; tables aedecod; run;After your SAS program creates an output dataset, export it to XPT and upload:
/* Step A: Export SAS dataset to XPT */
libname xptout xport '/sas/output/adsl.xpt';
proc copy in=work out=xptout;
select adsl;
run;
libname xptout clear;
/* Step B: Upload XPT to S3 */
proc python;
submit;
import boto3
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study = SAS.symget('study_id')
s3.upload_file(
Filename = '/sas/output/adsl.xpt',
Bucket = bucket,
Key = f'studies/{study}/adam/adsl.xpt'
)
print('Uploaded adsl.xpt to S3')
endsubmit;
run;This loops through your output folder and uploads everything, then creates a manifest file (a log of what was uploaded) for audit purposes:
proc python;
submit;
import boto3, os, json
from datetime import datetime
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study = SAS.symget('study_id')
outdir = '/sas/output/'
# Track what we upload
manifest = []
for fname in os.listdir(outdir):
if fname.endswith(('.xpt', '.sas7bdat', '.pdf', '.rtf')):
local_path = os.path.join(outdir, fname)
s3_key = f'studies/{study}/output/{fname}'
s3.upload_file(local_path, bucket, s3_key)
manifest.append({
'file': fname,
's3_path': s3_key,
'size': os.path.getsize(local_path),
'uploaded': datetime.now().isoformat()
})
print(f' Uploaded: {fname}')
# Save manifest (audit trail)
s3.put_object(
Bucket=bucket,
Key=f'studies/{study}/output/_manifest.json',
Body=json.dumps(manifest, indent=2)
)
print(f'\nDone: {len(manifest)} files uploaded')
SAS.symput('files_uploaded', str(len(manifest)))
endsubmit;
run;Git is a tool that tracks changes to files over time. GitHub is a website that hosts your Git-tracked code and makes it easy to collaborate. Here's why it matters for statistical programmers:
| Problem Today | How Git Solves It |
|---|---|
| "Which version of the program made this table?" | Every save ("commit") is timestamped and tied to a specific person. You can pull up the exact code used for any past run. |
| "Someone overwrote my changes" | Git tracks every change separately. Two people can edit the same file and merge their work. |
| "We need to go back to last week's version" | One command to revert to any previous version. Nothing is ever truly lost. |
| "Did anyone QC this program?" | GitHub Pull Requests let a reviewer approve code before it goes to production. |
Your SAS server needs to be able to talk to GitHub. The most common methods:
An SSH key is like a digital ID card for your server. Generate one on the SAS server, then add the public part to your GitHub account:
# Run these once on the SAS server (in a terminal, not SAS):
ssh-keygen -t ed25519 -C 'sas-server@company.com'
# Then copy the contents of ~/.ssh/id_ed25519.pub
# and add it to GitHub > Settings > SSH and GPG KeysGo to GitHub → Settings → Developer Settings → Personal Access Tokens, create a token with "repo" access, and store it as an environment variable on your SAS server. Your admin can help with this.
This downloads the latest version of your study's code repository to the SAS server:
%let git_repo = /sas/projects/study-ABC001;
%let git_remote = git@github.com:myorg/study-ABC001.git;
proc python;
submit;
import subprocess, os
repo_dir = SAS.symget('git_repo')
remote = SAS.symget('git_remote')
if os.path.exists(os.path.join(repo_dir, '.git')):
# Already cloned before — just get the latest updates
result = subprocess.run(
['git', '-C', repo_dir, 'pull', 'origin', 'main'],
capture_output=True, text=True
)
print(f'Updated: {result.stdout}')
else:
# First time — clone (download) the entire repository
result = subprocess.run(
['git', 'clone', remote, repo_dir],
capture_output=True, text=True
)
print(f'Cloned: {result.stdout}')
endsubmit;
run;After you've made changes to your SAS programs, this commits (saves a snapshot) and pushes (uploads) them to GitHub:
proc python;
submit;
import subprocess
from datetime import datetime
repo_dir = SAS.symget('git_repo')
study = SAS.symget('study_id')
def git(args):
"""Helper: run a git command in the repo directory"""
result = subprocess.run(
['git', '-C', repo_dir] + args,
capture_output=True, text=True
)
if result.returncode != 0:
print(f' Error: {result.stderr}')
return result
# Tell git who you are (only needed once per server)
git(['config', 'user.email', 'programmer@company.com'])
git(['config', 'user.name', 'Clinical Programmer'])
# Stage changes (tell git which files to include)
git(['add', 'programs/*.sas']) # all SAS programs
git(['add', 'macros/*.sas']) # all macros
# Commit (save a snapshot with a message)
ts = datetime.now().strftime('%Y-%m-%d %H:%M')
msg = f'[{study}] Updated ADAE derivation - {ts}'
result = git(['commit', '-m', msg])
if 'nothing to commit' in (result.stdout or ''):
print('No changes to save')
else:
# Push (upload to GitHub)
git(['push', 'origin', 'main'])
print(f'Code pushed to GitHub: {msg}')
endsubmit;
run;git add) = select which files to include in the snapshot. "Commit" = save the snapshot with a description. "Push" = upload your commits to GitHub. Think of it like: select → save → share.
When multiple programmers work on the same study, branches let everyone work in their own copy without stepping on each other. When your work is ready, you create a Pull Request on GitHub for someone to review before it gets merged into the main codebase.
proc python;
submit;
import subprocess
repo = SAS.symget('git_repo')
def git(args):
return subprocess.run(
['git', '-C', repo] + args,
capture_output=True, text=True
)
# Create your own branch (like a personal copy)
git(['checkout', '-b', 'feature/adae-update-john'])
# ... make your changes to .sas files ...
# Save and upload your branch
git(['add', '-A'])
git(['commit', '-m', 'Updated ADAE baseline flag logic'])
git(['push', '-u', 'origin', 'feature/adae-update-john'])
print('Branch pushed! Create a Pull Request on GitHub.')
endsubmit;
run;Here's everything together in one SAS program. This is a template you can adapt for your studies. It runs all four steps in sequence:
/*************************************************************
* FULL PIPELINE: Pull from S3 → Process → Push to S3 → Git
* Study: ABC-001
*************************************************************/
/* === CONFIGURATION (edit these for your study) === */
%let study_id = ABC-001;
%let s3_bucket = pharma-clinical-data;
%let s3_raw = studies/ABC-001/raw/;
%let local_raw = /sas/data/raw;
%let local_output = /sas/output;
%let git_repo = /sas/projects/study-ABC001;
/* ==================================================
STEP 1: PULL RAW DATA FROM S3
================================================== */
proc python;
submit;
import boto3, os
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
prefix = SAS.symget('s3_raw')
local = SAS.symget('local_raw')
os.makedirs(local, exist_ok=True)
paginator = s3.get_paginator('list_objects_v2')
count = 0
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
for obj in page.get('Contents', []):
fname = obj['Key'].split('/')[-1]
if fname:
s3.download_file(bucket, obj['Key'],
os.path.join(local, fname))
count += 1
print(f' Downloaded: {fname}')
SAS.symput('pull_count', str(count))
print(f'STEP 1 DONE: {count} files pulled')
endsubmit;
run;
/* ==================================================
STEP 2: PROCESS IN SAS (your normal programs)
================================================== */
libname raw xport "&local_raw/dm.xpt";
data work.adsl;
set raw.dm;
length TRT01P $40 TRT01A $40;
TRT01P = ARM;
TRT01A = ARM;
SAFFL = 'Y';
ITTFL = 'Y';
run;
libname raw clear;
%put NOTE: STEP 2 DONE - ADSL created;
/* ==================================================
STEP 3: PUSH RESULTS TO S3
================================================== */
libname xptout xport "&local_output/adsl.xpt";
proc copy in=work out=xptout; select adsl; run;
libname xptout clear;
proc python;
submit;
import boto3, os
s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study = SAS.symget('study_id')
outdir = SAS.symget('local_output')
uploaded = 0
for f in os.listdir(outdir):
if f.endswith('.xpt'):
s3.upload_file(
os.path.join(outdir, f),
bucket,
f'studies/{study}/adam/{f}'
)
uploaded += 1
print(f' Uploaded: {f}')
SAS.symput('push_count', str(uploaded))
print(f'STEP 3 DONE: {uploaded} files pushed')
endsubmit;
run;
/* ==================================================
STEP 4: COMMIT & PUSH CODE TO GITHUB
================================================== */
proc python;
submit;
import subprocess
from datetime import datetime
repo = SAS.symget('git_repo')
study = SAS.symget('study_id')
def git(args):
return subprocess.run(
['git', '-C', repo] + args,
capture_output=True, text=True
)
git(['add', 'programs/*.sas', 'macros/*.sas'])
ts = datetime.now().strftime('%Y-%m-%d %H:%M')
git(['commit', '-m', f'[{study}] Pipeline run - {ts}'])
result = git(['push', 'origin', 'main'])
status = 'PUSHED' if result.returncode == 0 else 'FAILED'
SAS.symput('git_status', status)
print(f'STEP 4 DONE: Git {status}')
endsubmit;
run;
/* === PIPELINE SUMMARY === */
%put NOTE: ==========================================;
%put NOTE: PIPELINE COMPLETE FOR &study_id;
%put NOTE: Files pulled from S3 : &pull_count;
%put NOTE: Files pushed to S3 : &push_count;
%put NOTE: Git status : &git_status;
%put NOTE: ==========================================;Organize your S3 bucket the way you'd organize a study on a shared drive — but cleaner:
s3://pharma-clinical-data/
studies/
ABC-001/
raw/ # Source data from EDC
dm.xpt
ae.xpt
lb.xpt
sdtm/ # SDTM datasets
dm.xpt
ae.xpt
adam/ # ADaM datasets
adsl.xpt
adae.xpt
output/ # TFLs
tables/
figures/
listings/
logs/ # SAS logs for each runYour Git repository tracks only code, not data. Data stays in S3.
study-ABC001/ # One repo per study
programs/
sdtm/ # SDTM programs
s_dm.sas
s_ae.sas
adam/ # ADaM programs
a_adsl.sas
a_adae.sas
tfl/ # TFL programs
t_14_1_1.sas
f_14_2_1.sas
pipeline/ # This pipeline script
master_pipeline.sas
macros/ # Shared macros
specs/ # Mapping specs, SAPs
.gitignore # Tells git what to ignore
README.md # Study overviewThis file tells Git to ignore data files and other things that shouldn't be tracked:
# .gitignore for SAS Clinical Projects
# DATA FILES (these live in S3, not Git)
*.sas7bdat
*.sas7bcat
*.xpt
# OUTPUTS (regenerated from code)
*.log
*.lst
*.pdf
*.rtf
# SYSTEM FILES
*.lck
*.bak
*.tmp
.aws/
__pycache__/| What You See | What's Wrong | How to Fix It |
|---|---|---|
| "ERROR: Procedure PYTHON not found" | SAS version is too old, or Python isn't configured | Ask your SAS admin to verify SAS 9.4M7+ and set PYTHON_HOME in sasv9.cfg |
ModuleNotFoundError: No module named 'boto3' | boto3 isn't installed in the Python that SAS uses | Run: pip install boto3 in the Python pointed to by PYTHON_HOME |
| "AccessDenied" when downloading from S3 | The SAS server doesn't have permission to read the S3 bucket | Ask your cloud admin to check IAM permissions for s3:GetObject |
| "AccessDenied" when uploading to S3 | Missing upload permission | Need s3:PutObject permission in the IAM policy |
git push rejected | Someone else pushed changes since your last pull | Run git pull first to get their changes, then push again |
| "Permission denied (publickey)" for GitHub | SSH key not set up, or wrong key | Test with: ssh -T git@github.com. If it fails, re-add your SSH key to GitHub. |
SAS.sd2df() runs out of memory | Dataset is too large for Python's memory | Use file-based transfer (export to XPT) instead of in-memory DataFrame |
These rules are especially important in clinical settings where data is regulated:
Here's what we covered: PROC PYTHON lets you run Python code inside SAS. With the boto3 library, that Python code can download data from AWS S3 and upload results back. With Git commands, it can save your SAS programs to GitHub with a full change history.
The result is a single SAS program that handles the entire cycle: pull data, process it, push results, and version-control the code. No manual copying, no mystery file versions, no "which shared drive was that on?" moments.
You don't have to implement all of this at once. A good starting point is to pick one piece — maybe just the S3 pull step — and try it in a sandbox environment. Once you're comfortable with PROC PYTHON and boto3, the rest follows naturally.
No comments yet. Be the first!