23 SAS for R Programmers
An April 2026 survey of 22 peer US biostatistics MS programs (see the Peer programs block in references.bib) found that 17 of the 22 require or strongly recommend SAS in their core or analysis-track coursework. Industry pharma, CRO, and regulatory-filing positions in the United States frequently filter screening on SAS experience. R-only graduates from this programme have reported being excluded at the screening stage.
This chapter is the minimum credible response: enough SAS to read code, read a SAS log, move data between SAS and R, and reproduce the five or six procedures that appear in most CSRs.
23.1 Prerequisites
Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 23.16.
- What is the DATA step in SAS, and what is its relationship to PROC steps?
- Given a SAS dataset
work.demo, what SAS procedure and what R function would each produce a summary of continuous variables analogous to R’ssummary()? - What file format is typically used to move datasets between SAS and non-SAS systems (including R), and what R package reads it?
23.2 Learning objectives
By the end of this chapter you should be able to:
- Register for a free SAS OnDemand for Academics account and log in to SAS Studio.
- Write a short SAS program that reads a CSV, summarises it, fits a simple model, and writes output.
- Map the common R statistical procedures (
lm,glm,lmer,survfit,coxph) onto their SAS equivalents. - Read a SAS log and find the error or warning that broke the job.
- Move data between SAS and R using transport files (
.xpt) and the Rhavenpackage. - Recognise the points where SAS and R defaults differ (categorical reference coding, mixed- models covariance structure).
23.3 Orientation
R and SAS each dominate a different sector. R dominates academic biostatistics, much of the clinical-epidemiology literature, and any work that benefits from a vibrant open-source ecosystem. SAS dominates FDA submissions, CDISC pipelines, clinical trial statistical programming, and a substantial fraction of pharmaceutical sponsor and contract research organisation (CRO) workflows.
The question is not which is better but which you will encounter on your first job. For an MS biostatistician entering industry clinical trials, the answer is usually both. This chapter aims at R programmers who need a working reading and writing competence in SAS, not fluency.
23.4 The statistician’s contribution
The languages are mechanical. The judgements:
Match the tool to the audience. A regulatory submission to FDA requires SAS-formatted deliverables (CDISC datasets, statistical analysis output). An academic paper accepts R. The choice of tool follows the audience, not analyst preference.
Verify cross-language equivalence. When you translate an R analysis to SAS (or vice versa), the answers should agree to several decimal places. They sometimes do not, because of default differences (reference levels, sums of squares, mixed-model covariance structures). The discipline is to spot disagreements and trace them to a default mismatch, not to shrug.
Read SAS logs carefully. SAS produces a hierarchy of messages: NOTE, WARNING, ERROR. A program that ‘finished’ may have silently dropped half its observations because of a NOTE-level missing-value message. Reading the full log is non-optional.
Document the cross-language workflow. A CSR that comes from a mix of R cleaning and SAS analysis is reproducible only if the handoff is documented: file format, variable names, encoding. The discipline is the same as for single-language workflows but with one extra boundary.
These judgements are what distinguish a useful SAS competence from a fragile one.
23.5 Getting access: SAS OnDemand for Academics
SAS OnDemand for Academics (SAS ODA) is SAS Institute’s free, cloud-hosted SAS Studio environment for educators, students, and independent learners. It runs entirely in a browser, requires no install, and is sufficient for everything in this chapter and for most coursework in peer MS programmes.
23.5.1 Step-by-step registration
- In a browser, go to the welcome page: https://welcome.oda.sas.com/.
- Click Register for an account. If you have an existing SAS profile (e.g., from a past training course), you can sign in instead and skip to step 7.
- Fill in the registration form: email address, first name, last name, country, and organisation (enter your university, e.g.,
UC San Diego). - Create a SAS profile password. Accept the SAS terms and conditions.
- SAS sends a confirmation email. Click the activation link.
- Return to https://welcome.oda.sas.com/ and sign in with your SAS profile email and password.
- On first login, SAS assigns a numeric userid of the form
u99999999. Record this id: it is also your SAS Studio home folder name (/home/u99999999/). - You are redirected to your region’s SAS Studio server (for West Coast US users, this looks like
https://odamid-usw2-2.oda.sas.com/SASStudio/). SAS Studio opens with an emptyProgram 1tab.
23.5.2 The SAS Studio interface
The SAS Studio layout is three panes:
| Pane | Purpose |
|---|---|
| Server Files and Folders (left) | File browser for your home directory. Upload data here, save programs here. |
Program editor (main, CODE tab) |
Write SAS code. |
| LOG and RESULTS (main, tabs) | SAS log (errors, warnings, notes) and rendered output (listings, tables, graphs). |
The bottom of the left sidebar also exposes Tasks and Utilities (point-and-click SAS procedures that emit code you can inspect), Snippets (reusable code blocks), Libraries (SAS librefs for data access), and File Shortcuts.
23.5.3 Practical limits
SAS ODA is free, so it is bounded:
- Memory: approximately 5 GB per session.
- Idle timeout: approximately 90 minutes; unsaved programs are lost when the session ends.
- Storage: approximately 5 GB per user home directory.
- Procedures: the standard Base, Stat, Graph, and IML products are available; SAS/CONNECT, SAS/ACCESS to most databases, and Enterprise Miner are not.
- Computing: shared infrastructure; long batch jobs are impractical. Use local SAS or AWS for production runs.
23.5.4 Uploading data
- In the Server Files and Folders pane, navigate to
Files (Home). - Click the upload icon (third from the left in the file toolbar).
- Select a
.csv,.xlsx, or.sas7bdatfile from your laptop.
The file appears in your home directory and can be read from your SAS programs by path:
proc import datafile = '/home/u99999999/penguins.csv'
out = work.penguins
dbms = csv replace;
getnames = yes;
run;
23.5.5 Saving programs
In the Program editor, click the save icon (floppy disk) and choose a location in your home directory. Programs save as .sas files and can be reopened in subsequent sessions.
23.5.6 Alternatives
- Full SAS install via a university site license. Rare outside large pharma or government labs.
- SAS University Edition. Discontinued in August 2021. Do not use.
- WPS Workbench (World Programming Ltd.). A paid third-party interpreter that runs most SAS code. Useful for sites that cannot use a cloud service but is not free.
For this chapter and course, ODA is the default.
23.6 SAS basics for R users
The DATA step and PROC steps are the two fundamental constructs.
DATA step. Reads, creates, or transforms a SAS dataset, processing one row at a time with an implicit loop. Conceptually similar to a dplyr::mutate plus dplyr::filter chain, but with imperative row-at-a-time semantics:
data work.adults;
set work.demo; /* read each row */
if age >= 18; /* filter (subsetting if) */
age_group = ifn(age < 65, 1, 2);
log_bmi = log(bmi);
run;
In R:
adults <- demo |>
filter(age >= 18) |>
mutate(age_group = if_else(age < 65, 1L, 2L),
log_bmi = log(bmi))PROC steps. Consume an existing dataset and produce listings, tables, graphs, or statistical output. Each PROC has its own syntax but most share an OUT = for results and a BY / CLASS / MODEL clause structure:
proc means data = work.adults n mean std median q1 q3 maxdec = 2;
class age_group;
var log_bmi;
run;
proc freq data = work.adults;
tables age_group * sex / chisq;
run;
Libraries. A library (libname) is a directory containing SAS datasets:
libname mylib '/home/u99999999/data';
data mylib.demo_archive;
set work.demo;
run;
Datasets in WORK (the temporary library) vanish at session end; named libraries persist.
Formats and informats. A format displays a value (e.g., 1 displays as 'Active'); an informat reads input. SAS’s proc format defines them:
proc format;
value sexfmt 1 = 'Male' 2 = 'Female';
run;
data work.demo2;
set work.demo;
format sex sexfmt.;
run;
The R analogue is a factor with explicit labels.
Common procedures to know. PROC SORT, PROC PRINT, PROC CONTENTS, PROC FREQ, PROC MEANS, PROC UNIVARIATE. These correspond roughly to arrange, head/tail, str/glimpse, table/count, summary/mean, summary/quantile. Knowing these five is enough to read 90% of SAS data-cleaning code.
23.7 Procedure correspondence table
The R-to-SAS map for the procedures a clinical biostatistician uses most:
| R | SAS |
|---|---|
head(df), tail(df) |
PROC PRINT DATA = df (OBS=10); |
str(df) / skim(df) |
PROC CONTENTS DATA = df; |
summary(df$x), mean(df$x) |
PROC MEANS DATA = df N MEAN STD; |
table(df$x, df$y) |
PROC FREQ DATA = df; TABLES x*y; |
t.test(y ~ g, data = df) |
PROC TTEST DATA = df; CLASS g; VAR y; |
lm(y ~ x1 + x2, data = df) |
PROC REG DATA = df; MODEL y = x1 x2; |
aov(y ~ g, data = df) |
PROC GLM DATA = df; CLASS g; MODEL y = g; |
glm(y ~ x, family = binomial) |
PROC LOGISTIC DATA = df; MODEL y = x; |
glm(y ~ x, family = poisson) |
PROC GENMOD DATA = df; MODEL y = x / DIST = POISSON LINK = LOG; |
lme4::lmer(y ~ x + (1|id)) |
PROC MIXED DATA = df; CLASS id; MODEL y = x; RANDOM INTERCEPT / SUBJECT = id; |
survival::survfit(Surv(t, e)~g) |
PROC LIFETEST DATA = df; TIME t*e(0); STRATA g; |
survival::coxph(Surv(t, e)~x) |
PROC PHREG DATA = df; MODEL t*e(0) = x; |
A few of these procedures merit a worked example.
23.7.1 PROC GLM for ANOVA / linear regression
proc glm data = work.penguins;
class species;
model body_mass_g = species flipper_length_mm;
lsmeans species / pdiff cl;
run;
quit;
R equivalent:
fit <- lm(body_mass_g ~ species + flipper_length_mm,
data = penguins_clean)
emmeans::emmeans(fit, "species", contr = "pairwise",
infer = TRUE)The lsmeans statement in SAS produces estimated marginal means (least-squares means); the emmeans package provides the R counterpart.
23.7.2 PROC LOGISTIC for binary outcome
proc logistic data = work.trial;
class treatment (param = ref ref = 'placebo')
sex (param = ref ref = 'M');
model outcome (event = '1') = treatment age sex;
oddsratio treatment;
run;
R equivalent:
trial$treatment <- relevel(factor(trial$treatment),
ref = "placebo")
trial$sex <- relevel(factor(trial$sex), ref = "M")
fit <- glm(outcome ~ treatment + age + sex,
family = binomial, data = trial)
broom::tidy(fit, exponentiate = TRUE, conf.int = TRUE)The param = ref ref = 'X' syntax is SAS’s way of setting reference coding (matching R’s default). Without it, SAS uses effect coding (sum-to-zero), which gives different coefficients.
23.7.3 PROC MIXED for mixed-effects model
proc mixed data = work.long covtest;
class subject visit treatment;
model y = treatment visit treatment*visit / solution;
random intercept / subject = subject type = un;
run;
R equivalent:
fit <- lmer(y ~ treatment * visit + (1 | subject), data = long)Note: PROC MIXED’s default covariance structure is VC (variance components). For a random intercept matching lme4’s default, no TYPE is needed; for unstructured (UN), specify explicitly. Random slopes in SAS:
random intercept visit / subject = subject type = un;
23.7.4 PROC LIFETEST and PROC PHREG for survival
proc lifetest data = work.adtte plots = (s lls);
time aval * cnsr (1); /* CDISC: cnsr=1 means censored */
strata trt01p;
run;
proc phreg data = work.adtte;
class trt01p (param = ref ref = 'Placebo');
model aval * cnsr (1) = trt01p age sex / risklimits;
run;
R equivalent:
library(survival)
fit_km <- survfit(Surv(AVAL, 1L - CNSR) ~ TRT01P, data = adtte)
fit_cox <- coxph(Surv(AVAL, 1L - CNSR) ~ TRT01P + AGE + SEX,
data = adtte)The SAS time aval * cnsr (1) syntax means ‘censoring code is 1’; this aligns with the CDISC CNSR = 1 convention. R’s Surv(time, 1L - CNSR) inverts to match its own event = 1 convention.
23.8 Reading SAS logs
SAS produces three tiers of message in the LOG window:
- NOTE (informational): events the program performed. Most NOTEs are benign; some are important (
NOTE: missing values were generated,NOTE: merged with missing values). - WARNING: something unexpected but the program continued. Always inspect.
- ERROR: the program stopped. Fix and rerun.
The most insidious are silent NOTEs. A merge that produces missing keys generates a NOTE; the program ‘completes’ but with rows silently dropped or duplicated:
NOTE: There were 100 observations read from data set WORK.A.
NOTE: There were 95 observations read from data set WORK.B.
NOTE: The data set WORK.MERGED has 95 observations.
If you expected 100 (left join), 95 indicates 5 rows were lost. The NOTE is informational; SAS does not flag it as a problem.
The discipline: read the LOG end to end after every run. Look for unexpected counts, NOTEs about missing values, and WARNINGs about implicit conversions.
For automated checking, proc sql with explicit row-count assertions catches drops:
proc sql noprint;
select count(*) into :n_a from work.a;
select count(*) into :n_merged from work.merged;
quit;
%if &n_a ne &n_merged %then %do;
%put ERROR: Merge dropped rows: &n_a -> &n_merged;
%end;
23.9 Moving data between SAS and R
R to SAS via SAS transport (.xpt):
library(haven)
write_xpt(df, "data.xpt", version = 5)In SAS:
libname x xport '/home/u99999999/data.xpt';
proc copy in = x out = work; run;
libname x clear;
The version = 5 argument matches the FDA- required transport version.
SAS to R:
df <- haven::read_sas("path/to/file.sas7bdat")
df <- haven::read_xpt("path/to/file.xpt")haven preserves SAS labels and formats as R attributes (label attribute on each column). Inspect with attr(df$x, "label").
FDA submissions still require .xpt v5 as the interchange format. Modern SAS proc copy plus libname xport produces it; haven::write_xpt matches.
For more elaborate workflows (R cleans data, SAS analyses, R produces tables), the round trip is:
# R: clean and export
data |>
janitor::clean_names() |>
filter(...) |>
haven::write_xpt("clean.xpt", version = 5)* SAS: read, analyse, export results;
libname x xport '/path/clean.xpt';
proc copy in = x out = work;
run;
proc logistic data = work.clean;
/* ... */
ods output ParameterEstimates = pe;
run;
libname out xport '/path/results.xpt';
data out.pe; set pe; run;
libname out clear;
# R: read SAS results, format as table
results <- haven::read_xpt("results.xpt")
gt::gt(results)The handoff is the friction. Document column names and types at each boundary.
23.10 When SAS is required and when it is not
Required. FDA submissions (CDISC SDTM + ADaM; Chapter 20), most sponsor-driven clinical trials, many CRO positions, legacy pharma analytics teams.
Not required. Academic biostatistics, most NIH-funded investigator-initiated trials, data science, biotechs founded in the last decade, most epidemiological cohort studies.
Either acceptable. Collaborative research teams that let the statistician choose, most epidemiology and health-services research, academic clinical trials units that have adopted R.
For an MS biostatistician, the practical question is: what does the first job require? Industry pharma and CRO: SAS, often with R as secondary. Academic biostatistics: R, with SAS nice-to-have. Government (FDA, CMS, BARDA): SAS, heavily.
23.11 A side-by-side example
A simple penguin analysis in both languages, producing matching output:
R:
library(palmerpenguins)
fit <- lm(body_mass_g ~ flipper_length_mm + species,
data = na.omit(penguins))
broom::tidy(fit, conf.int = TRUE)SAS:
proc import datafile = '/home/u99999999/penguins.csv'
out = work.penguins dbms = csv replace;
getnames = yes;
run;
data work.penguins_clean;
set work.penguins;
if cmiss(of _all_) = 0; /* drop rows with any missing */
run;
proc glm data = work.penguins_clean;
class species (param = ref ref = 'Adelie');
model body_mass_g = flipper_length_mm species / clparm solution;
run;
quit;
Coefficient point estimates and 95% Wald CIs should match between the two within rounding. If they do not, suspect the reference level (R defaults to alphabetical ‘Adelie’; SAS default is alphabetical last, hence the explicit ref = 'Adelie').
23.12 Collaborating with an LLM on SAS
LLMs handle SAS basics; the cross-language translation tends to ignore default differences.
Prompt 1: translating a SAS log error. Paste the LOG and ask: ‘what’s the error and how to fix?’
What to watch for. Common SAS errors are easy (missing semicolon, undefined macro variable). Subtle ones (NOTE-level merge issues, format warnings) need deliberate prompting: ask the LLM to flag every NOTE that could indicate a problem.
Verification. Apply the fix and rerun. If the NOTE persists, push for further analysis.
Prompt 2: R to SAS translation. Paste an R function or analysis and ask the LLM to produce the SAS equivalent.
What to watch for. Reference coding for CLASS variables (the default mismatch). Mixed-models covariance structures (R’s lme4 default vs. SAS PROC MIXED’s VC default). Survival censoring conventions (R event = 1 vs. SAS event = 1 or CDISC cnsr = 1).
Verification. Run both versions on the same data; coefficients should agree to several decimal places. Diagnose any disagreement as a default mismatch.
Prompt 3: PROC choice. Describe the analysis and ask: ‘which SAS procedure is most appropriate?’
What to watch for. Multiple PROCs handle the same problem (PROC GLM vs. PROC REG vs. PROC GENMOD for linear regression; PROC MIXED vs. PROC GLIMMIX for mixed models). The LLM should pick the modern, full-featured choice (GLIMMIX over MIXED for non-normal outcomes; GENMOD for GLMs).
Verification. SAS documentation; cross-reference against domain conventions.
23.13 Principle in use
Three habits define defensible cross-language work:
- Verify equivalence. When translating between R and SAS, confirm the outputs match to several decimal places. Disagreements are default mismatches, not noise.
- Read the SAS LOG. NOTE-level messages can hide silent data loss. Make full-log review a habit, not an exception.
- Use
.xptfor handoffs. SAS transport v5 is the durable interchange format;haven::read_xptandhaven::write_xptmake the boundary explicit.
23.14 Exercises
- Register for SAS OnDemand for Academics. Upload a CSV (use
palmerpenguins::penguinswritten out viawrite.csv()). RunPROC MEANSon the numeric variables andPROC FREQon species. Compare to the equivalent R summaries. - Translate a simple
lm()fit from one of your previous analyses intoPROC GLM. Confirm that the coefficient table and residual standard error match to at least four decimal places. - Write a SAS transport file (
.xpt) from R usinghaven::write_xpt(). Read it into SAS withlibname xpt xport. Verify that the row and column counts match. - Translate a
glm(..., family = binomial)logistic regression from R to PROC LOGISTIC. Match reference levels explicitly. Verify that odds ratios agree. - Read a SAS log from one of your runs end to end. List every NOTE that could indicate a problem; investigate each.
23.15 Further reading
- Delwiche and Slaughter (2019), The Little SAS Book, 6th ed., canonical introduction.
- Cody (2018), Learning SAS by Example, 2nd ed., task-oriented introduction.
- SAS documentation at
documentation.sas.com, reference. - The
havenpackage documentation for the R side of the boundary.
23.16 Prerequisites answers
- A DATA step reads, creates, or transforms a SAS dataset row by row, using an implicit loop over records. PROC steps consume existing datasets and produce listings, tables, graphs, or statistical output; a PROC step does not create new observations (with rare exceptions). The typical workflow alternates: DATA step to shape the data, PROC step to analyse it.
- In SAS,
PROC MEANSorPROC SUMMARYproduces analogous output (N, mean, SD, min, max, quantiles). In R,summary(df)gives a mixed-type summary across columns;skimr::skim(df)is closer toPROC MEANSoutput.PROC UNIVARIATEgives more detail (quantiles, skewness, kurtosis). - The SAS transport file (
.xpt, SAS transport v5) is the long-standing interchange format and is still required for FDA submissions. In R, thehavenpackage reads and writes both.xptand native.sas7bdat:haven::read_xpt(),haven::write_xpt(),haven::read_sas().