Deep Dive Series - From Fundamentals to Advanced Clinical Trial Applications

1. Why Regression Matters for Statistical Programmers

Regression analysis is the quantitative backbone of nearly every efficacy and safety analysis in clinical trials. When you build an ADSL flag, derive an AVAL in ADaM, or generate a p-value in a TFL, you are almost always feeding data into — or reporting output from — a regression model. Yet many statistical programmers encounter these methods only as opaque PROC calls, rarely connecting the mathematical intuition to the SAS code they write every day.

This article bridges that gap. We start with the two most foundational regression techniques — linear regression for continuous outcomes and logistic regression for binary outcomes — then build toward advanced clinical trial applications including ANCOVA, interaction terms, and model diagnostics. Every concept is mapped directly to SAS procedures and ADaM datasets you already work with.

Who this article is for: Statistical programmers who run PROC REG, PROC GLM, PROC LOGISTIC, or PROC MIXED in production code and want a deeper understanding of what these procedures actually do — and why the statistician chose them.

Title: /home/claude/fig6_comparison.png - Description: /home/claude/fig6_comparison.png

Figure 6: Quick reference — Linear vs. Logistic Regression at a glance.

2. Linear Regression — The Foundation

2.1 What Is Linear Regression?

Linear regression models the relationship between a continuous outcome variable (Y) and one or more predictor variables (X). The goal is to find the straight line (or hyperplane, in multiple regression) that best fits the observed data by minimizing the sum of squared residuals — the vertical distances between observed and predicted values.

2.2 The Model Equation

Simple linear regression: Y = β₀ + β₁X + ε

Multiple linear regression: Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε

Where β₀ is the intercept (predicted Y when all X = 0), each βᵢ is the change in Y for a one-unit increase in Xᵢ holding all other predictors constant, and ε is the random error term assumed to follow a normal distribution with mean 0 and constant variance σ².

2.3 Clinical Example: Dose–Response in Blood Pressure

Consider a Phase 2 dose-ranging trial where 40 subjects received varying doses of an antihypertensive drug. The primary endpoint is change from baseline in systolic blood pressure (SBP). We want to quantify: for every additional milligram of drug, how much does SBP change on average?

Title: /home/claude/fig1_linear_regression.png - Description: /home/claude/fig1_linear_regression.png

Figure 1: Scatter plot with fitted regression line. Orange line represents the OLS estimate. Red dashed lines show residuals (prediction errors) for selected observations.

The fitted model yields an equation of ŷ = 5.1 + (−0.348) × Dose. This tells us that for every additional 1 mg of drug, systolic blood pressure decreases by approximately 0.35 mmHg. The intercept of 5.1 represents the expected SBP change at dose zero (placebo-level), which in this simulated example suggests a slight placebo effect.

2.4 SAS Implementation

/* Simple Linear Regression */

proc reg data=adeff;

model chg = dose / clb vif;

output out=predicted p=pred r=resid;

run;

/* Equivalent using PROC GLM (for unbalanced designs) */

proc glm data=adeff;

class trtpn;

model chg = dose trtpn / solution;

lsmeans trtpn / diff cl;

run;

2.5 Reading the Output

Table 1 shows the key parameter estimates from PROC REG. Each row corresponds to a coefficient in the regression equation.

Parameter	Estimate	Std Error	t Value	p-value	Interpretation
Intercept	5.12	1.84	2.78	0.0081	Expected change when dose = 0
Dose (mg)	-0.348	0.029	-12.01	<.0001	Each +1 mg → −0.35 mmHg SBP

Table 1: Linear regression parameter estimates — Dose vs. Change in SBP (simulated data).

The p-value for the Dose coefficient is <.0001, indicating a statistically significant linear relationship between dose and blood pressure reduction. The coefficient of −0.348 means that for each 1 mg increase in dose, SBP is expected to decrease by 0.348 mmHg, holding all else constant.

2.6 The R² Statistic

R² (coefficient of determination) represents the proportion of total variance in Y explained by the model. In clinical trials, R² is useful for understanding model fit but is rarely the primary inferential metric. An R² of 0.72 in our dose-response example means 72% of the variability in blood pressure change is explained by dose alone. The remaining 28% is attributable to inter-subject variability, measurement error, and unmeasured factors.

Adjusted R² penalizes the addition of unnecessary predictors and should be used whenever comparing models with different numbers of covariates. A decrease in adjusted R² when adding a variable signals overfitting.

2.7 The LINE Assumptions

Linear regression relies on four assumptions, commonly remembered by the mnemonic LINE. Violating these assumptions can lead to biased estimates, invalid p-values, or both. Statistical programmers should understand them because diagnostic outputs from SAS (residual plots, normality tests) are evaluating these assumptions.

Title: /home/claude/fig3_assumptions.png - Description: /home/claude/fig3_assumptions.png

Figure 3: Visual diagnostics for each LINE assumption. (A) Linearity — data points cluster around the fitted line. (B) Independence — residuals show no systematic pattern over observation order. (C) Normality — residuals are approximately bell-shaped. (D) Equal Variance — residuals have constant spread across predicted values.

(L) Linearity: The relationship between X and Y is linear. A residual-vs-predicted plot should show no systematic curvature. If you see a U-shape or fan pattern, consider adding polynomial terms or transforming X.

(I) Independence: Observations are independent of each other. In clinical trials, this assumption can be violated by repeated measures on the same subject (use PROC MIXED) or by site-level clustering (consider random effects).

(N) Normality: The residuals (not the raw data) follow a normal distribution. Check with a QQ plot or the Shapiro-Wilk test (PROC UNIVARIATE). Mild non-normality is tolerable with large samples (n > 30) due to the Central Limit Theorem.

(E) Equal Variance (Homoscedasticity): The variance of residuals is constant across all levels of X. A fan-shaped residual plot signals heteroscedasticity. White's test or the Breusch-Pagan test can formally detect this. Consider robust standard errors (the HCCME option in PROC REG) or a variance-stabilizing transformation.

/* Diagnostic checks in SAS */

proc reg data=adeff;

model chg = dose / spec dwProb; /* Durbin-Watson, White's test */

output out=diag r=resid p=pred;

run;

proc univariate data=diag normal; /* Normality tests */

var resid;

qqplot resid / normal;

run;

3. Logistic Regression — When the Outcome Is Binary

3.1 Why Not Just Use Linear Regression?

When the outcome is binary (e.g., responder/non-responder, adverse event yes/no), linear regression fails for two fundamental reasons. First, predicted values can fall outside the [0, 1] range — you might predict a −10% or 130% probability. Second, the assumption of normally distributed errors is violated because a binary outcome inherently follows a Bernoulli/binomial distribution.

Logistic regression solves both problems by modeling the log-odds of the event rather than the event itself, and using the logit link function to constrain predicted probabilities between 0 and 1.

3.2 The Logistic Model

The logit transformation: log(p / (1 − p)) = β₀ + β₁X₁ + β₂X₂ + ...

Here, p is the probability of the event occurring, and p / (1 − p) is the odds. The left side of the equation — the log of the odds — is called the logit, and it can range from −∞ to +∞, making it compatible with a linear predictor on the right side.

3.3 Understanding Odds and Odds Ratios

Odds = p / (1 − p). If the probability of an adverse event is 0.20, the odds are 0.20 / 0.80 = 0.25, or "1 to 4." Odds express how likely the event is relative to the non-event.

Odds Ratio (OR) = exp(β). The odds ratio is the exponentiated regression coefficient. An OR of 1.073 for age means that for each additional year of age, the odds of the event increase by 7.3%. An OR of 0.287 for treatment means the treatment group has 71.3% lower odds compared to placebo.

Key rule of thumb: OR = 1 means no effect. OR > 1 means higher odds (risk factor). OR < 1 means lower odds (protective). The further from 1, the stronger the effect.

3.4 Clinical Example: Age and Adverse Events

Suppose we model the probability of experiencing a grade ≥3 adverse event as a function of age in a safety population. The S-shaped logistic curve (Figure 2) shows how predicted probability transitions smoothly from near-zero for younger subjects to near-one for the oldest subjects.

Title: /home/claude/fig2_logistic_regression.png - Description: /home/claude/fig2_logistic_regression.png

Figure 2: Logistic regression S-curve. The blue line shows predicted probability of adverse event. Green and red points indicate observed events (1) and non-events (0) with jitter for visibility.

3.5 SAS Implementation

/* Basic Logistic Regression */

proc logistic data=adsafety descending;

class trtpn (ref='0') / param=ref;

model aefl = age trtpn / clodds=wald rsquare;

oddsratio age;

oddsratio trtpn;

output out=logpred predicted=predprob;

run;

/* Note: DESCENDING models P(Y=1) not P(Y=0) */

/* Without DESCENDING, PROC LOGISTIC models the lower value by default */

3.6 Reading the Output

Parameter	Estimate	Std Error	Wald χ²	p-value	Odds Ratio	Interpretation
Intercept	-4.00	0.98	16.65	<.0001	—	Log-odds when age = 0
Age (years)	0.070	0.018	15.12	0.0001	1.073	Each +1 yr → 7.3% higher odds
Treatment	-1.25	0.41	9.30	0.0023	0.287	71% lower odds vs. placebo

Table 2: Logistic regression output — Age + Treatment as predictors of Grade ≥3 AE (simulated data).

The Wald chi-square test is the logistic regression equivalent of the t-test in linear regression. For age, the Wald χ² of 15.12 with p = 0.0001 indicates a statistically significant association. The odds ratio of 1.073 means each additional year of age increases the odds of a grade ≥3 AE by approximately 7.3%. For the treatment effect, an odds ratio of 0.287 indicates the active treatment reduces the odds of an AE by about 71% compared to placebo.

4. Advanced Applications in Clinical Trials

4.1 ANCOVA: Regression in Disguise

Analysis of Covariance (ANCOVA) is the most commonly used regression model in clinical trial efficacy analysis, and it is simply a linear regression with both categorical (treatment group) and continuous (baseline value) predictors. The SAP typically specifies the primary analysis as "change from baseline analyzed using ANCOVA with treatment as a factor and baseline value as a covariate."

The ANCOVA equation: CHG = β₀ + β₁(TRT) + β₂(BASE) + ε

Why adjust for baseline? Subjects enter a trial with different starting values. Without adjustment, a subject with baseline SBP of 180 mmHg has more room to improve than one at 130 mmHg, introducing noise. ANCOVA partitions this baseline variability out, yielding a more precise estimate of the treatment effect and reducing the residual variance — which translates directly to increased statistical power.

Title: /home/claude/fig4_ancova.png - Description: /home/claude/fig4_ancova.png

Figure 4: ANCOVA as regression. (A) Raw scatter without adjustment. (B) Parallel regression lines show the treatment effect (vertical gap between lines) after adjusting for baseline. The orange arrow indicates the estimated treatment difference at any baseline value.

/* ANCOVA — the workhorse of clinical efficacy analysis */

proc glm data=adeff;

class trtpn;

model chg = trtpn base / solution clparm;

lsmeans trtpn / diff cl pdiff; /* LS-means difference = treatment effect */

run;

/* Equivalent PROC MIXED for repeated measures */

proc mixed data=adeff;

class trtpn avisitn subjid;

model chg = trtpn base avisitn trtpn*avisitn / solution ddfm=kr;

repeated avisitn / subject=subjid type=un;

lsmeans trtpn*avisitn / diff cl slice=avisitn;

run;

4.2 Interaction Terms

An interaction term tests whether the effect of one predictor depends on the level of another. In clinical trials, interactions appear in subgroup analyses — for example, does the treatment effect differ between males and females?

Model with interaction: CHG = β₀ + β₁(TRT) + β₂(SEX) + β₃(TRT × SEX) + ε

If β₃ is statistically significant, the treatment effect is not the same across sexes — the data suggest effect modification. If β₃ is not significant, the treatment effect is considered consistent across sexes. In regulatory submissions, interaction tests at the α = 0.10 level (rather than 0.05) are commonly used for subgroup analyses, given their lower statistical power.

/* Interaction test for subgroup analysis */

proc glm data=adeff;

class trtpn sex;

model chg = trtpn sex trtpn*sex base / solution;

lsmeans trtpn*sex / diff cl slice=sex; /* Treatment effect within each sex */

run;

4.3 Model Discrimination: The c-Statistic and ROC Curve

For logistic regression, the c-statistic (concordance statistic) measures how well the model discriminates between events and non-events. It equals the area under the ROC curve (AUC). A c-statistic of 0.5 means the model is no better than random guessing; 1.0 means perfect discrimination.

Title: /home/claude/fig5_roc_curve.png - Description: /home/claude/fig5_roc_curve.png

Figure 5: ROC curve. The blue curve represents model performance (AUC ≈ 0.84). The diagonal dashed line represents random guessing (AUC = 0.50). The shaded area is the AUC — larger is better.

/* Generate ROC curve and c-statistic */

proc logistic data=adsafety descending;

class trtpn (ref='0') / param=ref;

model aefl = age trtpn bmi / ctable pprob=(0.1 to 0.9 by 0.1);

roc; /* Plot ROC curve */

ods output Association=c_stat; /* Extract c-statistic */

run;

In clinical trials, the c-statistic is commonly used to assess predictive models for composite endpoints (e.g., MACE in cardiovascular trials) and for prognostic index development. A c-statistic above 0.70 is generally considered acceptable; above 0.80 is considered excellent.

4.4 Multicollinearity: When Predictors Overlap

Multicollinearity occurs when two or more predictor variables are highly correlated with each other. This inflates the standard errors of regression coefficients, making individual predictors appear non-significant even when the overall model is significant. In clinical trial programming, this can arise when including both the raw baseline value and a derived variable (e.g., baseline category) in the same model.

The Variance Inflation Factor (VIF) quantifies multicollinearity. A VIF above 10 signals a serious problem; above 5 warrants investigation. PROC REG outputs VIF values directly with the VIF option.

/* Check multicollinearity */

proc reg data=adeff;

model chg = dose age bmi base / vif tol;

/* VIF > 10 or TOL < 0.1 → multicollinearity concern */

run;

5. Linear vs. Logistic Regression — Side-by-Side Comparison

Table 3 summarizes the key differences between linear and logistic regression across every dimension a statistical programmer needs to understand.

Feature	Linear Regression	Logistic Regression
Outcome Type	Continuous (e.g., mmHg, kg)	Binary (e.g., responder Y/N)
Model Equation	ŷ = β₀ + β₁x₁ + ε	log(p/(1−p)) = β₀ + β₁x₁
Link Function	Identity (direct)	Logit (log-odds)
Error Distribution	Normal (Gaussian)	Binomial
Estimation Method	Ordinary Least Squares (OLS)	Maximum Likelihood (ML)
Key Output	β coefficients, R², F-test	Odds Ratios, c-statistic, Wald χ²
Assumptions	LINE: Linearity, Independence, Normality, Equal variance	Linearity of log-odds, Independence, No multicollinearity
Primary SAS Procedure	PROC REG, PROC GLM	PROC LOGISTIC
Clinical Example	Change from baseline in SBP	Proportion achieving ≥50% reduction

Table 3: Comprehensive comparison of linear and logistic regression.

6. SAS Procedure Map for Regression Analysis

One of the most practical questions a statistical programmer faces is: which PROC should I use? Table 4 maps each SAS procedure to its regression family, typical use case, and key options.

SAS Procedure	Regression Type	When To Use	Key Options
PROC REG	Simple/Multiple Linear	Continuous outcome, OLS	MODEL, SELECTION=, VIF, DW
PROC GLM	General Linear Model	ANOVA, ANCOVA, unbalanced designs	CLASS, LSMEANS, CONTRAST
PROC MIXED	Linear Mixed Model	Repeated measures, random effects	RANDOM, REPEATED, TYPE=
PROC LOGISTIC	Binary/Ordinal Logistic	Binary or ordinal outcome	CLASS, ODDSRATIO, ROC, CTABLE
PROC GENMOD	Generalized Linear Model	Non-normal distributions, GEE	DIST=, LINK=, REPEATED
PROC PHREG	Cox Proportional Hazards	Time-to-event outcome	MODEL / STRATA, HAZARDRATIO

Table 4: SAS procedure mapping for regression analysis in clinical trials.

Selection guidance: For a continuous endpoint with a simple design, start with PROC REG. If you need CLASS variables or LS-means, move to PROC GLM. For repeated measures or random effects, use PROC MIXED. For binary outcomes, PROC LOGISTIC is the standard. PROC GENMOD is the Swiss army knife for non-standard distributions or GEE estimation. PROC PHREG handles time-to-event data exclusively.

6.1 ADaM Dataset Considerations

Regression models in clinical trials consume ADaM datasets, and the dataset structure directly affects model specification. Here are the most common mappings:

ADSL: Subject-level covariates (AGE, SEX, RACE, BMIBL) used as regression predictors. These are typically merged onto analysis datasets via USUBJID.

ADEFF / ADLB / ADVS: The AVAL (analysis value) and CHG (change from baseline) columns feed the dependent variable. BASE (baseline) feeds the covariate. ANL01FL flags identify the analysis population row.

ADAE: For logistic regression of safety endpoints, the outcome is typically derived as a binary flag (AEFL, ATEFFL) at the subject level before modeling.

ADTTE: Time-to-event datasets feed PROC PHREG (Cox regression), not standard linear or logistic models. The AVAL is the time and CNSR is the censoring indicator.

7. Putting It All Together: A Complete Analysis Flow

To solidify the concepts, let us walk through a realistic analysis flow that a statistical programmer would encounter in a Phase 3 trial.

Scenario: A double-blind, placebo-controlled trial in hypertension. The primary efficacy endpoint is change from baseline in SBP at Week 12 (continuous → linear regression / ANCOVA). A key secondary endpoint is the proportion of subjects achieving SBP < 140 mmHg (binary → logistic regression).

Step 1: Primary Analysis (ANCOVA)

/* Primary efficacy: Change in SBP at Week 12 */

proc mixed data=adeff (where=(paramcd='SBP' and avisitn=12

and anl01fl='Y' and ittfl='Y'));

class trtpn (ref='0') region;

model chg = trtpn base region / solution ddfm=kr cl;

lsmeans trtpn / diff cl alpha=0.05;

ods output Diffs=lsm_diff LSMeans=lsm_est;

run;

Step 2: Key Secondary (Logistic Regression)

/* Key secondary: Responder analysis (SBP < 140 mmHg) */

proc logistic data=adeff (where=(paramcd='SBP' and avisitn=12

and anl01fl='Y' and ittfl='Y'))

descending;

class trtpn (ref='0') region / param=ref;

model respfl = trtpn base region;

oddsratio trtpn;

output out=resp_pred predicted=pred_prob;

ods output OddsRatios=or_trt ParameterEstimates=parms;

run;

Step 3: Subgroup Analysis (Interaction)

/* Subgroup: Treatment effect by sex */

proc mixed data=adeff (where=(paramcd='SBP' and avisitn=12

and anl01fl='Y' and ittfl='Y'));

class trtpn sex region;

model chg = trtpn sex trtpn*sex base region / solution ddfm=kr;

lsmeans trtpn*sex / diff cl slice=sex;

ods output Diffs=sub_diffs;

run;

Step 4: Diagnostics

/* Residual diagnostics for primary model */

proc mixed data=adeff (where=(paramcd='SBP' and avisitn=12

and anl01fl='Y'));

class trtpn region;

model chg = trtpn base region / solution residual outpm=resid_out;

run;

proc univariate data=resid_out normal plot;

var resid;

run;

8. Key Takeaways

1.Linear regression (PROC REG, PROC GLM, PROC MIXED) models continuous outcomes. The coefficients represent direct unit changes. ANCOVA is linear regression with treatment + baseline covariate.

2.Logistic regression (PROC LOGISTIC) models binary outcomes. Coefficients are on the log-odds scale; exponentiate to get odds ratios. Always use DESCENDING to model P(Y=1).

3.Assumptions matter. Learn to read residual plots (LINE assumptions) and VIF output. When assumptions fail, the model is still giving you numbers — but those numbers may be wrong.

4.ANCOVA is the most important regression model in clinical trials. If you program efficacy tables, you are running regression. Understanding the model helps you catch data issues, verify SAP alignment, and communicate confidently with statisticians.

5.The SAS procedure choice follows from the data type and design: continuous + simple → PROC REG; continuous + CLASS variables → PROC GLM; repeated measures → PROC MIXED; binary → PROC LOGISTIC; non-standard distributions → PROC GENMOD; time-to-event → PROC PHREG.

clinstandards.org — Technical depth for statistical programmers.

Linear & Logistic Regression for Statistical Programmers

1. Why Regression Matters for Statistical Programmers

2. Linear Regression — The Foundation

2.1 What Is Linear Regression?

2.2 The Model Equation

2.3 Clinical Example: Dose–Response in Blood Pressure

2.4 SAS Implementation

2.5 Reading the Output

2.6 The R² Statistic

2.7 The LINE Assumptions

3. Logistic Regression — When the Outcome Is Binary

3.1 Why Not Just Use Linear Regression?

3.2 The Logistic Model

3.3 Understanding Odds and Odds Ratios

3.4 Clinical Example: Age and Adverse Events

3.5 SAS Implementation

3.6 Reading the Output

4. Advanced Applications in Clinical Trials

4.1 ANCOVA: Regression in Disguise

4.2 Interaction Terms

4.3 Model Discrimination: The c-Statistic and ROC Curve

4.4 Multicollinearity: When Predictors Overlap

5. Linear vs. Logistic Regression — Side-by-Side Comparison

6. SAS Procedure Map for Regression Analysis

6.1 ADaM Dataset Considerations

7. Putting It All Together: A Complete Analysis Flow

Step 1: Primary Analysis (ANCOVA)

Step 2: Key Secondary (Logistic Regression)

Step 3: Subgroup Analysis (Interaction)

Step 4: Diagnostics

8. Key Takeaways

Discussion (0)