Using PROC PYTHON to Connect SAS, AWS S3, and GitHub

1. The Problem We're Solving

If you're a statistical programmer, your daily workflow probably looks something like this:

Someone emails or FTPs you a data extract. Or you download it from a shared drive.
You copy the files into your SAS working directory.
You run your SAS programs.
You copy the outputs somewhere — maybe another shared folder, maybe an email attachment.
You save your programs in a folder named something like "v3_final_REVIEWED" and hope nobody overwrites them.

This works. But it's manual, fragile, and hard to trace. If an auditor asks, "Which version of a_adae.sas produced the table in your CSR?" — can you answer that with certainty?

This article shows you a better setup. We're going to use three tools together:

AWS S3 — a cloud storage service where your clinical data lives (think of it as a shared drive in the cloud, but more powerful)
SAS + PROC PYTHON — your SAS session, with the ability to run Python code inside it (this is the glue that connects everything)
GitHub — a platform for tracking every change to your SAS programs (who changed what, when, and why)

The end result: you write one SAS program that automatically pulls data from the cloud, processes it, pushes results back, and saves a record of your code changes. No manual copying, no email attachments, no filename versioning.

2. How the Pieces Fit Together

Here's the big picture. Your SAS session sits in the middle, and PROC PYTHON is the bridge that talks to both AWS and GitHub:

  +------------------+
  |    AWS S3         |   Cloud storage: raw data in, results out
  |  (Data Storage)   |
  +--------+---------+
           |
           |  boto3 (Python library for AWS)
           |
  +--------+---------+
  |   SAS Session     |   Your home base: DATA steps,
  |  + PROC PYTHON    |   PROC SQL, macros — all normal SAS
  +--------+---------+
           |
           |  git commands (via Python subprocess)
           |
  +--------+---------+
  |    GitHub         |   Version control: tracks every change
  |  (Code Storage)   |   to your .sas files
  +------------------+

The workflow runs in four steps:

Step	What Happens	How
1. Pull Data	Download raw data files from S3 to your SAS server	PROC PYTHON runs boto3 to download .xpt / .csv files from an S3 bucket
2. Process	Run your normal SAS programs	Standard SAS: DATA steps, PROC SQL, macros — nothing changes here
3. Push Results	Upload outputs back to S3	PROC PYTHON runs boto3 to upload .xpt / .pdf / .rtf files to S3
4. Save Code	Commit and push your programs to GitHub	PROC PYTHON runs git commands to record what changed and sync to GitHub

Key Insight: Steps 1, 3, and 4 all happen through PROC PYTHON. Step 2 is pure SAS — your existing programs work as-is. You're not replacing SAS, you're extending it.

3. PROC PYTHON: Running Python Inside SAS

3.1 What It Is (and What You Need)

PROC PYTHON lets you write Python code directly inside a SAS program. It was introduced in SAS 9.4M7 and is available in SAS Viya. When you run it, SAS starts a Python session in the background, runs your code, and gives you back the results.

What you need to get started:

SAS 9.4 Maintenance 7 (M7) or later, or SAS Viya
Python 3.x installed on the same server where SAS runs
The PYTHON_HOME option set in your SAS config to point to the Python installation
Python packages installed: boto3 (for AWS) and pandas (for data handling)

Don't have PROC PYTHON? Check with your SAS admin. They need to install Python on the SAS server and set the PYTHON_HOME option in sasv9.cfg. It's a one-time setup.

3.2 Your First PROC PYTHON Program

Here's the simplest possible example. This just prints a message from Python inside your SAS log:

proc python;
submit;

print('Hello from Python inside SAS!')

endsubmit;
run;

Everything between submit; and endsubmit; is Python code. Outside of that, it's regular SAS. The output from print() shows up in your SAS log.

3.3 Passing Values Between SAS and Python

This is where it gets useful. You can read SAS macro variables in Python and send values back.

Reading SAS macro variables in Python

/* Define SAS macro variables as usual */
%let study_id = ABC-001;
%let bucket_name = my-clinical-data;

proc python;
submit;

# Read SAS macro variables into Python variables
study  = SAS.symget('study_id')      # study = 'ABC-001'
bucket = SAS.symget('bucket_name')    # bucket = 'my-clinical-data'

print(f'Working on study: {study}')
print(f'Data bucket: {bucket}')

endsubmit;
run;

Sending values from Python back to SAS

proc python;
submit;

# Do something in Python...
file_count = 15

# Send the result back to SAS as a macro variable
SAS.symput('n_files', str(file_count))

endsubmit;
run;

/* Now use it in SAS */
%put NOTE: Python found &n_files files;

3.4 Moving Datasets Between SAS and Python

You can also move entire datasets back and forth. SAS.sd2df() converts a SAS dataset into a Python pandas DataFrame. SAS.df2sd() goes the other direction.

proc python;
submit;

import pandas as pd

# Pull a SAS dataset into Python
df = SAS.sd2df('WORK.ADSL')
print(f'ADSL has {len(df)} subjects')

# Do something with it in Python
# (here we filter to safety population)
df_safe = df[df['SAFFL'] == 'Y']

# Push the filtered data back to SAS
SAS.df2sd(df_safe, 'WORK.ADSL_SAFE')

endsubmit;
run;

/* ADSL_SAFE is now a regular SAS dataset in WORK */
proc freq data=work.adsl_safe;
  tables trt01p;
run;

Think of it this way: SAS.symget/symput is like %let for Python. SAS.sd2df/df2sd is like PROC COPY between SAS and Python.

4. Pulling and Pushing Data to AWS S3

4.1 What Is S3?

Amazon S3 (Simple Storage Service) is cloud-based file storage. Think of it like a network drive, but hosted by Amazon. Your files live in "buckets" (like top-level folders), and each file has a "key" (its full path inside the bucket).

For example, a raw DM dataset might be stored at:

Bucket:  pharma-clinical-dataKey:     studies/ABC-001/raw/dm.xpt

4.2 Setting Up AWS Access

Before you can read or write files on S3, your SAS server needs permission. There are three ways to set this up (your IT/cloud team will tell you which one to use):

Option A: IAM Role (Best for Production)

If your SAS server runs on an AWS machine (EC2 instance), your IT team can attach an "IAM role" that gives it automatic S3 access. In this case, you don't need to do anything in your code — it just works:

proc python;
submit;

import boto3
s3 = boto3.client('s3')  # automatically authenticated

endsubmit;
run;

Option B: Environment Variables

Your SAS admin sets AWS credentials as environment variables on the server:

options set=AWS_ACCESS_KEY_ID='your-access-key';
options set=AWS_SECRET_ACCESS_KEY='your-secret-key';
options set=AWS_DEFAULT_REGION='us-east-1';

Option C: Credentials File

A file at ~/.aws/credentials on the SAS server stores the keys. The boto3 library reads it automatically.

Warning: Never put your AWS access keys directly in a SAS program that gets saved or shared. Use one of the three methods above so credentials stay separate from code.

4.3 Pulling Data: S3 → SAS Server

Download a Single File

This downloads one .xpt file from S3 to your local directory:

%let s3_bucket = pharma-clinical-data;
%let study_id  = ABC-001;

proc python;
submit;

import boto3

s3     = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study  = SAS.symget('study_id')

# Download dm.xpt from S3 to local path
s3.download_file(
    Bucket   = bucket,
    Key      = f'studies/{study}/raw/dm.xpt',
    Filename = '/sas/data/raw/dm.xpt'
)
print('Downloaded dm.xpt')

endsubmit;
run;

/* Now read it in SAS as usual */
libname raw xport '/sas/data/raw/dm.xpt';
proc contents data=raw.dm; run;

Download All Files in a Folder

This downloads every file under a given S3 prefix (folder path):

%let s3_bucket = pharma-clinical-data;
%let s3_prefix = studies/ABC-001/raw/;
%let local_dir = /sas/data/raw/;

proc python;
submit;

import boto3, os

s3        = boto3.client('s3')
bucket    = SAS.symget('s3_bucket')
prefix    = SAS.symget('s3_prefix')
local_dir = SAS.symget('local_dir')

os.makedirs(local_dir, exist_ok=True)

# List all files in the S3 "folder"
paginator = s3.get_paginator('list_objects_v2')
count = 0

for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
    for obj in page.get('Contents', []):
        filename = obj['Key'].split('/')[-1]
        if filename:  # skip empty keys
            local_path = os.path.join(local_dir, filename)
            s3.download_file(bucket, obj['Key'], local_path)
            count += 1
            print(f'  Downloaded: {filename}')

SAS.symput('files_downloaded', str(count))
print(f'\nTotal: {count} files downloaded')

endsubmit;
run;

%put NOTE: Downloaded &files_downloaded files from S3;

Shortcut: Stream Directly into a SAS Dataset

For smaller files (like a CSV), you can skip saving to disk entirely. This reads a CSV from S3 straight into a pandas DataFrame and then into a SAS dataset:

proc python;
submit;

import boto3, pandas as pd
from io import BytesIO

s3 = boto3.client('s3')
bucket = SAS.symget('s3_bucket')

# Read CSV directly from S3 into a DataFrame
obj = s3.get_object(
    Bucket=bucket,
    Key='studies/ABC-001/raw/ae.csv'
)
df = pd.read_csv(BytesIO(obj['Body'].read()))
print(f'Read {len(df)} AE records from S3')

# Push it into SAS as a WORK dataset
SAS.df2sd(df, 'WORK.AE_RAW')

endsubmit;
run;

/* AE_RAW is now a regular SAS dataset */
proc freq data=work.ae_raw; tables aedecod; run;

When to use this shortcut: Great for CSVs and small files. For large .xpt files, download to disk first (the previous method), then read with a LIBNAME — it's more memory-efficient.

4.4 Pushing Results: SAS Server → S3

Upload a Single Output

After your SAS program creates an output dataset, export it to XPT and upload:

/* Step A: Export SAS dataset to XPT */
libname xptout xport '/sas/output/adsl.xpt';
proc copy in=work out=xptout;
  select adsl;
run;
libname xptout clear;

/* Step B: Upload XPT to S3 */
proc python;
submit;

import boto3

s3     = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study  = SAS.symget('study_id')

s3.upload_file(
    Filename = '/sas/output/adsl.xpt',
    Bucket   = bucket,
    Key      = f'studies/{study}/adam/adsl.xpt'
)
print('Uploaded adsl.xpt to S3')

endsubmit;
run;

Bulk Upload: All Outputs at Once

This loops through your output folder and uploads everything, then creates a manifest file (a log of what was uploaded) for audit purposes:

proc python;
submit;

import boto3, os, json
from datetime import datetime

s3     = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study  = SAS.symget('study_id')
outdir = '/sas/output/'

# Track what we upload
manifest = []

for fname in os.listdir(outdir):
    if fname.endswith(('.xpt', '.sas7bdat', '.pdf', '.rtf')):
        local_path = os.path.join(outdir, fname)
        s3_key = f'studies/{study}/output/{fname}'

        s3.upload_file(local_path, bucket, s3_key)

        manifest.append({
            'file': fname,
            's3_path': s3_key,
            'size': os.path.getsize(local_path),
            'uploaded': datetime.now().isoformat()
        })
        print(f'  Uploaded: {fname}')

# Save manifest (audit trail)
s3.put_object(
    Bucket=bucket,
    Key=f'studies/{study}/output/_manifest.json',
    Body=json.dumps(manifest, indent=2)
)

print(f'\nDone: {len(manifest)} files uploaded')
SAS.symput('files_uploaded', str(len(manifest)))

endsubmit;
run;

What's a manifest? It's a simple JSON file that lists every file you uploaded, its size, and when it was uploaded. Think of it like a packing list. It's useful for audits — you can prove exactly what was delivered and when.

5. Keeping Your Code in GitHub

5.1 Why Use Git for SAS Programs?

Git is a tool that tracks changes to files over time. GitHub is a website that hosts your Git-tracked code and makes it easy to collaborate. Here's why it matters for statistical programmers:

Problem Today	How Git Solves It
"Which version of the program made this table?"	Every save ("commit") is timestamped and tied to a specific person. You can pull up the exact code used for any past run.
"Someone overwrote my changes"	Git tracks every change separately. Two people can edit the same file and merge their work.
"We need to go back to last week's version"	One command to revert to any previous version. Nothing is ever truly lost.
"Did anyone QC this program?"	GitHub Pull Requests let a reviewer approve code before it goes to production.

5.2 Setting Up Git Access

Your SAS server needs to be able to talk to GitHub. The most common methods:

SSH Key (Recommended)

An SSH key is like a digital ID card for your server. Generate one on the SAS server, then add the public part to your GitHub account:

# Run these once on the SAS server (in a terminal, not SAS):
ssh-keygen -t ed25519 -C 'sas-server@company.com'
# Then copy the contents of ~/.ssh/id_ed25519.pub
# and add it to GitHub > Settings > SSH and GPG Keys

Personal Access Token (Alternative)

Go to GitHub → Settings → Developer Settings → Personal Access Tokens, create a token with "repo" access, and store it as an environment variable on your SAS server. Your admin can help with this.

5.3 Pulling Code from GitHub

This downloads the latest version of your study's code repository to the SAS server:

%let git_repo   = /sas/projects/study-ABC001;
%let git_remote = git@github.com:myorg/study-ABC001.git;

proc python;
submit;

import subprocess, os

repo_dir = SAS.symget('git_repo')
remote   = SAS.symget('git_remote')

if os.path.exists(os.path.join(repo_dir, '.git')):
    # Already cloned before — just get the latest updates
    result = subprocess.run(
        ['git', '-C', repo_dir, 'pull', 'origin', 'main'],
        capture_output=True, text=True
    )
    print(f'Updated: {result.stdout}')
else:
    # First time — clone (download) the entire repository
    result = subprocess.run(
        ['git', 'clone', remote, repo_dir],
        capture_output=True, text=True
    )
    print(f'Cloned: {result.stdout}')

endsubmit;
run;

Git vocabulary: "Clone" means downloading a repository for the first time. "Pull" means getting the latest changes after you've already cloned it. Think of clone as the first PROC COPY, and pull as syncing updates.

5.4 Pushing Code to GitHub

After you've made changes to your SAS programs, this commits (saves a snapshot) and pushes (uploads) them to GitHub:

proc python;
submit;

import subprocess
from datetime import datetime

repo_dir = SAS.symget('git_repo')
study    = SAS.symget('study_id')

def git(args):
    """Helper: run a git command in the repo directory"""
    result = subprocess.run(
        ['git', '-C', repo_dir] + args,
        capture_output=True, text=True
    )
    if result.returncode != 0:
        print(f'  Error: {result.stderr}')
    return result

# Tell git who you are (only needed once per server)
git(['config', 'user.email', 'programmer@company.com'])
git(['config', 'user.name', 'Clinical Programmer'])

# Stage changes (tell git which files to include)
git(['add', 'programs/*.sas'])    # all SAS programs
git(['add', 'macros/*.sas'])      # all macros

# Commit (save a snapshot with a message)
ts = datetime.now().strftime('%Y-%m-%d %H:%M')
msg = f'[{study}] Updated ADAE derivation - {ts}'
result = git(['commit', '-m', msg])

if 'nothing to commit' in (result.stdout or ''):
    print('No changes to save')
else:
    # Push (upload to GitHub)
    git(['push', 'origin', 'main'])
    print(f'Code pushed to GitHub: {msg}')

endsubmit;
run;

Git vocabulary: "Stage" (git add) = select which files to include in the snapshot. "Commit" = save the snapshot with a description. "Push" = upload your commits to GitHub. Think of it like: select → save → share.

5.5 Working with Branches (For Teams)

When multiple programmers work on the same study, branches let everyone work in their own copy without stepping on each other. When your work is ready, you create a Pull Request on GitHub for someone to review before it gets merged into the main codebase.

proc python;
submit;

import subprocess

repo = SAS.symget('git_repo')

def git(args):
    return subprocess.run(
        ['git', '-C', repo] + args,
        capture_output=True, text=True
    )

# Create your own branch (like a personal copy)
git(['checkout', '-b', 'feature/adae-update-john'])

# ... make your changes to .sas files ...

# Save and upload your branch
git(['add', '-A'])
git(['commit', '-m', 'Updated ADAE baseline flag logic'])
git(['push', '-u', 'origin', 'feature/adae-update-john'])

print('Branch pushed! Create a Pull Request on GitHub.')

endsubmit;
run;

6. Complete Example: The Full Pipeline

Here's everything together in one SAS program. This is a template you can adapt for your studies. It runs all four steps in sequence:

/*************************************************************
*  FULL PIPELINE: Pull from S3 → Process → Push to S3 → Git
*  Study: ABC-001
*************************************************************/

/* === CONFIGURATION (edit these for your study) === */
%let study_id     = ABC-001;
%let s3_bucket    = pharma-clinical-data;
%let s3_raw       = studies/ABC-001/raw/;
%let local_raw    = /sas/data/raw;
%let local_output = /sas/output;
%let git_repo     = /sas/projects/study-ABC001;


/* ==================================================
   STEP 1: PULL RAW DATA FROM S3
   ================================================== */
proc python;
submit;

import boto3, os

s3     = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
prefix = SAS.symget('s3_raw')
local  = SAS.symget('local_raw')
os.makedirs(local, exist_ok=True)

paginator = s3.get_paginator('list_objects_v2')
count = 0
for page in paginator.paginate(Bucket=bucket, Prefix=prefix):
    for obj in page.get('Contents', []):
        fname = obj['Key'].split('/')[-1]
        if fname:
            s3.download_file(bucket, obj['Key'],
                             os.path.join(local, fname))
            count += 1
            print(f'  Downloaded: {fname}')

SAS.symput('pull_count', str(count))
print(f'STEP 1 DONE: {count} files pulled')

endsubmit;
run;


/* ==================================================
   STEP 2: PROCESS IN SAS (your normal programs)
   ================================================== */
libname raw xport "&local_raw/dm.xpt";

data work.adsl;
  set raw.dm;
  length TRT01P $40 TRT01A $40;
  TRT01P = ARM;
  TRT01A = ARM;
  SAFFL  = 'Y';
  ITTFL  = 'Y';
run;

libname raw clear;
%put NOTE: STEP 2 DONE - ADSL created;


/* ==================================================
   STEP 3: PUSH RESULTS TO S3
   ================================================== */
libname xptout xport "&local_output/adsl.xpt";
proc copy in=work out=xptout; select adsl; run;
libname xptout clear;

proc python;
submit;

import boto3, os

s3     = boto3.client('s3')
bucket = SAS.symget('s3_bucket')
study  = SAS.symget('study_id')
outdir = SAS.symget('local_output')

uploaded = 0
for f in os.listdir(outdir):
    if f.endswith('.xpt'):
        s3.upload_file(
            os.path.join(outdir, f),
            bucket,
            f'studies/{study}/adam/{f}'
        )
        uploaded += 1
        print(f'  Uploaded: {f}')

SAS.symput('push_count', str(uploaded))
print(f'STEP 3 DONE: {uploaded} files pushed')

endsubmit;
run;


/* ==================================================
   STEP 4: COMMIT & PUSH CODE TO GITHUB
   ================================================== */
proc python;
submit;

import subprocess
from datetime import datetime

repo  = SAS.symget('git_repo')
study = SAS.symget('study_id')

def git(args):
    return subprocess.run(
        ['git', '-C', repo] + args,
        capture_output=True, text=True
    )

git(['add', 'programs/*.sas', 'macros/*.sas'])
ts = datetime.now().strftime('%Y-%m-%d %H:%M')
git(['commit', '-m', f'[{study}] Pipeline run - {ts}'])
result = git(['push', 'origin', 'main'])

status = 'PUSHED' if result.returncode == 0 else 'FAILED'
SAS.symput('git_status', status)
print(f'STEP 4 DONE: Git {status}')

endsubmit;
run;


/* === PIPELINE SUMMARY === */
%put NOTE: ==========================================;
%put NOTE: PIPELINE COMPLETE FOR &study_id;
%put NOTE: Files pulled from S3 : &pull_count;
%put NOTE: Files pushed to S3   : &push_count;
%put NOTE: Git status           : &git_status;
%put NOTE: ==========================================;

7. Recommended Folder Structures

7.1 S3 Bucket Organization

Organize your S3 bucket the way you'd organize a study on a shared drive — but cleaner:

s3://pharma-clinical-data/
  studies/
    ABC-001/
      raw/                # Source data from EDC
        dm.xpt
        ae.xpt
        lb.xpt
      sdtm/               # SDTM datasets
        dm.xpt
        ae.xpt
      adam/                # ADaM datasets
        adsl.xpt
        adae.xpt
      output/              # TFLs
        tables/
        figures/
        listings/
      logs/                # SAS logs for each run

7.2 Git Repository Structure

Your Git repository tracks only code, not data. Data stays in S3.

study-ABC001/                # One repo per study
  programs/
    sdtm/                    # SDTM programs
      s_dm.sas
      s_ae.sas
    adam/                     # ADaM programs
      a_adsl.sas
      a_adae.sas
    tfl/                      # TFL programs
      t_14_1_1.sas
      f_14_2_1.sas
    pipeline/                 # This pipeline script
      master_pipeline.sas
  macros/                     # Shared macros
  specs/                      # Mapping specs, SAPs
  .gitignore                  # Tells git what to ignore
  README.md                   # Study overview

What Goes in .gitignore

This file tells Git to ignore data files and other things that shouldn't be tracked:

# .gitignore for SAS Clinical Projects

# DATA FILES (these live in S3, not Git)
*.sas7bdat
*.sas7bcat
*.xpt

# OUTPUTS (regenerated from code)
*.log
*.lst
*.pdf
*.rtf

# SYSTEM FILES
*.lck
*.bak
*.tmp
.aws/
__pycache__/

Rule of thumb: If a file is generated by running code, it goes in S3 (not Git). If a file IS code (or documentation), it goes in Git.

8. Troubleshooting Common Issues

What You See	What's Wrong	How to Fix It
"ERROR: Procedure PYTHON not found"	SAS version is too old, or Python isn't configured	Ask your SAS admin to verify SAS 9.4M7+ and set PYTHON_HOME in sasv9.cfg
`ModuleNotFoundError: No module named 'boto3'`	boto3 isn't installed in the Python that SAS uses	Run: `pip install boto3` in the Python pointed to by PYTHON_HOME
"AccessDenied" when downloading from S3	The SAS server doesn't have permission to read the S3 bucket	Ask your cloud admin to check IAM permissions for `s3:GetObject`
"AccessDenied" when uploading to S3	Missing upload permission	Need `s3:PutObject` permission in the IAM policy
`git push` rejected	Someone else pushed changes since your last pull	Run `git pull` first to get their changes, then push again
"Permission denied (publickey)" for GitHub	SSH key not set up, or wrong key	Test with: `ssh -T git@github.com`. If it fails, re-add your SSH key to GitHub.
`SAS.sd2df()` runs out of memory	Dataset is too large for Python's memory	Use file-based transfer (export to XPT) instead of in-memory DataFrame

9. Security Quick Reference

These rules are especially important in clinical settings where data is regulated:

Never put AWS keys in your SAS program. Use IAM roles, environment variables, or a credentials file instead.
Never commit data to Git. Use .gitignore to exclude .sas7bdat, .xpt, and all data files. Data lives in S3 only.
Encrypt data in S3. Ask your cloud team to enable server-side encryption (SSE) on clinical data buckets.
Use SSH keys for GitHub. They're more secure than passwords and don't expire the way tokens do.
Review code before merging. Use GitHub Pull Requests so another programmer QCs the code before it reaches the main branch.

10. Wrapping Up

Here's what we covered: PROC PYTHON lets you run Python code inside SAS. With the boto3 library, that Python code can download data from AWS S3 and upload results back. With Git commands, it can save your SAS programs to GitHub with a full change history.

The result is a single SAS program that handles the entire cycle: pull data, process it, push results, and version-control the code. No manual copying, no mystery file versions, no "which shared drive was that on?" moments.

You don't have to implement all of this at once. A good starting point is to pick one piece — maybe just the S3 pull step — and try it in a sandbox environment. Once you're comfortable with PROC PYTHON and boto3, the rest follows naturally.