14 Data Wrangling Essentials
Stat 545 Chapters 5–9 (Jenny Bryan, UBC) and blog posts 04-lowercasingdataframes, 15-piping, 43-dynamic-column-names.
14.1 Prerequisites
Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 14.17.
- What is ‘tidy data’ as defined by Wickham (2014) (Wickham, 2014), and how does it differ from a typical ‘wide’ clinical data layout?
- Write the
dplyrpipeline that keeps only rows whereage >= 18, selectsage,sex, andoutcome, and summarises outcome by sex. - What does
tidyr::pivot_longer()do, and when should you reach for it rather thanpivot_wider()?
14.2 Learning objectives
By the end of this chapter you should be able to:
- Identify tidy and non-tidy data layouts.
- Use the core
dplyrverbs (filter,select,mutate,arrange,summarise,group_by) idiomatically with the native pipe|>. - Reshape data with
pivot_longer()andpivot_wider(), including usingnames_sepandnames_patternfor compound column names. - Pass dynamic column names to
dplyrverbs using{ }and.data[[ ]]. - Clean column names with
janitor::clean_names(). - Recognise common wrangling antipatterns and replace them with tidy equivalents.
14.3 Orientation
Real biomedical data arrive wide, irregular, and full of inconsistent coding. Most of a biostatistician’s time is spent reshaping, joining, cleaning, and type-converting before a single model is fit. The tidyverse stack is the dominant toolkit for this work; fluency with it pays back on every project.
This chapter is not exhaustive, the tidyverse is too large for that. It covers the verbs and patterns that recur in clinical and epidemiological work.
14.4 The statistician’s contribution
The verbs are mechanical. The judgements:
Tidy first, model later. Untidy data leads to joins that are subtly wrong, summaries that mean the wrong thing, and models that conflate observations. Reshaping into tidy form before modelling is not optional; it is the precondition for the modelling to be correct.
Keep raw data raw. Never overwrite the input file with cleaned data. The cleaning script is the authoritative record of what you did; the cleaned data is a build product. If data/raw/visits.csv is the original, write to data/derived/visits.rds, not back to visits.csv.
Document the cleaning rules. ‘Removed implausible values’ is not a documented rule. ‘Removed BP > 250 or BP < 50 (n = 12 rows, 0.6% of data)’ is. The cleaning script’s comments are the audit trail.
Recognise the cost of pivoting. Pivoting is expensive on large data: a wide dataset of \(10^6\) rows with 100 timepoints becomes a long dataset of \(10^8\) rows. Sometimes the analysis demands long; sometimes keeping it wide is appropriate. Pivot when the operation requires it, not as ritual.
These judgements determine whether the wrangling script produces correct, audit-able data.
14.5 Tidy data: three rules
Wickham (2014) defines tidy data by three rules:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a separate table.
Rule 1 violation: ‘blood pressure at visit 2’ is encoded as a column rather than as a value of a ‘visit’ variable.
Rule 2 violation: one row contains data from multiple patients, or one patient’s data is spread across multiple rows without a clear unit.
Rule 3 violation: patient demographics are mixed with visit-level outcomes in one table; both should be separate (linked by a patient ID).
A typical wide clinical layout violates rules 1 and 2:
patient_id bp_v1 bp_v2 bp_v3 hr_v1 hr_v2 hr_v3
001 140 138 136 72 70 68
002 150 148 146 78 76 74
The same data tidied:
patient_id visit measure value
001 1 bp 140
001 1 hr 72
001 2 bp 138
001 2 hr 70
...
The tidy form makes group operations natural: ‘mean BP by visit’ is group_by(visit) |> summarise(mean(bp)), not a custom column-traversal.
14.6 Core dplyr verbs
The vocabulary of tidy wrangling:
library(dplyr)
library(palmerpenguins)
data(penguins)
# filter rows
penguins |> filter(species == "Adelie")
# select columns
penguins |> select(species, body_mass_g, flipper_length_mm)
# create new columns
penguins |> mutate(mass_kg = body_mass_g / 1000)
# sort
penguins |> arrange(desc(body_mass_g))
# group and summarise
penguins |>
group_by(species, island) |>
summarise(mean_mass = mean(body_mass_g, na.rm = TRUE),
n = n(),
.groups = "drop")The pipe |> (R 4.1+) chains operations left-to-right. Each verb takes a tibble and returns a tibble. The pipeline reads as a sequence of transformations: ‘filter, then select, then group, then summarise’.
The magrittr %>% is the older alternative, still widely used. Functionally equivalent for most code; |> has slightly different placeholder semantics (_ instead of .). For new code, prefer |>.
14.7 Common dplyr patterns
Renaming:
penguins |> rename(mass = body_mass_g, flipper = flipper_length_mm)Counting:
penguins |> count(species, sex, sort = TRUE)Distinct values:
penguins |> distinct(species, island)Slicing:
penguins |> slice_max(body_mass_g, n = 5) # 5 heaviest
penguins |> slice_sample(n = 100) # random 100Conditional updates with case_when:
penguins |> mutate(
size = case_when(
body_mass_g < 3500 ~ "small",
body_mass_g < 5000 ~ "medium",
body_mass_g >= 5000 ~ "large"
)
)Across multiple columns:
penguins |> summarise(
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
)across() applies a function across multiple columns matched by where() (predicate), starts_with(), matches(), etc. Replaces the older _at and _if suffixes.
14.8 Reshaping with tidyr
library(tidyr)
# wide to long
wide |> pivot_longer(
cols = starts_with("bp_v"),
names_to = "visit",
names_prefix = "bp_v",
values_to = "bp"
)
# more complex: bp_v1, hr_v1, bp_v2, hr_v2 ...
wide |> pivot_longer(
cols = matches("_v"),
names_to = c("measure", "visit"),
names_pattern = "(.+)_v(.+)",
values_to = "value"
)
# now has columns: patient_id, measure, visit, value
# long to wide
long |> pivot_wider(
names_from = visit,
values_from = bp,
names_prefix = "bp_v"
)names_pattern and names_sep parse complex column names into multiple new columns. The first version of this is often wrong; print intermediate results and verify.
14.9 janitor::clean_names()
Clinical CRF exports come with column names like Patient ID, Age (years), SBP at baseline (mmHg). These are unworkable in R: spaces, parentheses, units in names. janitor::clean_names() standardises:
library(janitor)
d <- read.csv("crf.csv") |> clean_names()
names(d)
#> [1] "patient_id" "age_years" "sbp_at_baseline_mm_hg"Run on every dataset on arrival. Saves quoting and backtick-escaping for the rest of the project.
14.10 Dynamic column names
When you want to write functions that take column names as arguments, two idioms apply.
{ } (embrace) for bare names:
summarise_col <- function(df, col, fn = mean) {
df |> summarise(result = fn({{ col }}, na.rm = TRUE))
}
summarise_col(penguins, body_mass_g, fn = median).data[[name]] for character strings:
summarise_col_str <- function(df, col_name, fn = mean) {
df |> summarise(result = fn(.data[[col_name]], na.rm = TRUE))
}
summarise_col_str(penguins, "body_mass_g", fn = median)The two approaches reflect different sources of the column name. Bare-name interface ({ }) is what most users expect; string interface (.data[[ ]]) is useful for programmatic generation of names.
rlang::sym() and !! are older patterns; { } is preferred in modern code.
14.11 Common antipatterns
Loop over rows:
# bad
result <- numeric(nrow(df))
for (i in seq_len(nrow(df))) {
result[i] <- compute_something(df[i, ])
}
# good (vectorised)
df |> mutate(result = compute_something_vectorised(...))R’s vectorised operations are 10–100× faster than row loops, and are usually clearer.
Hardcoded column indices:
# bad
df[, c(2, 4, 7)]
# good
df |> select(age, sex, outcome)Indices break when columns are reordered or added; names are robust.
Manipulating raw data files:
# bad
write.csv(d, "data.csv", row.names = FALSE) # overwrites input
# good
write_rds(d, "data/derived/cleaned.rds")Raw data should be read-only.
stringsAsFactors paranoia:
# unnecessary in R 4.0+
options(stringsAsFactors = FALSE)The default changed in R 4.0; the option is no-op now.
14.12 Worked example: cohort cleanup
library(tidyverse)
library(janitor)
# read raw export
raw <- read_csv("data/raw/visits.csv", show_col_types = FALSE)
# clean names, restrict to adults, derive age groups
clean <- raw |>
clean_names() |>
filter(age >= 18, !is.na(outcome)) |>
mutate(
age_group = cut(age, breaks = c(18, 40, 65, Inf),
right = FALSE,
labels = c("18-39", "40-64", "65+")),
visit_date = as.Date(visit_date)
)
# pivot to long for repeated measures
long <- clean |>
pivot_longer(
cols = starts_with("bp_"),
names_to = "visit",
names_prefix = "bp_",
values_to = "bp"
) |>
filter(!is.na(bp))
# save the derived data
saveRDS(long, "data/derived/visits_long.rds")The script is the audit trail: clean column names, filter inclusions documented as code, derived variables computed and named, output written to a non-raw location. A reviewer can read the script and recover exactly what cleaning happened.
14.13 Collaborating with an LLM on wrangling
LLMs handle dplyr/tidyr well; the trap is producing working code on the wrong data shape.
Prompt 1: tidying a dataset. Paste the column names and 5 rows of head, ask: ‘tidy this with pivot_longer. Show me the result.’
What to watch for. The LLM will produce code that runs. Whether it produces the correct tidy form depends on understanding the semantics of the columns (are bp_v1 and bp_v2 two visits of the same BP measurement, or are they different measurement types?).
Verification. Run the code; inspect the long form; check that one observation per row holds.
Prompt 2: writing a helper function with dynamic column names. Describe the function (takes a column name, applies a verb) and ask the LLM to use modern tidy-eval idioms.
What to watch for. { } for bare names, .data[[ ]] for strings, enquo/!! for backwards-compatible code. The LLM may use older patterns; modern is preferred.
Verification. Test the function with both bare names and strings to confirm it handles the input form you expect.
Prompt 3: cleaning column names. Paste a list of messy names and ask: ‘clean these names; explain the transformation rules.’
What to watch for. The LLM may apply a one-off transformation rather than recommending janitor::clean_names(). The package solution is better because it is consistent across all your files.
Verification. Compare the LLM output to janitor::clean_names()’s output on the same input.
14.14 Principle in use
Three habits define defensible wrangling:
- Tidy data first. Wide-to-long when the analysis needs long; long-to-wide when it needs wide. Tidy is not the goal; correctness is. Tidy is usually the path to correctness.
- Keep raw data read-only. Cleaning scripts produce derived data; the raw file is sacrosanct.
- Use
clean_names()on every read. The cost is one line; the benefit is the rest of the script becomes easier to write.
14.15 Exercises
- Using
palmerpenguins::penguins, compute the mean bill length for each species-island combination, dropping rows with missing bill length. Return a tidy tibble with columnsspecies,island,mean_bill,n. - Reshape a synthetic wide clinical dataset (one row per patient, columns
bp_v1,bp_v2,bp_v3) into long format with one row per patient-visit. - Write a function
summarise_col(df, col, fn)that acceptscolas a bare column name andfnas a function; apply it to the penguins data withcol = bill_length_mmandfn = median. - Take a real CRF export (or simulate one) with messy column names and apply
janitor::clean_names(). Document any rename decisions you make manually beyond what janitor does. - Using
case_when, derive a clinical classification variable from three or four continuous predictors. Verify the classification covers all rows (no NAs introduced unexpectedly).
14.16 Further reading
- (Wickham et al., 2023) Chapters 3–7, canonical applied wrangling introduction.
- (Wickham, 2014), the original Tidy Data paper; essential for the three rules.
- (Bryan & Stephens, 2019) Chapters 5–9, slower-paced alternative.
- The
dplyrandtidyrcheat sheets atposit.co/resources/cheatsheets.
14.17 Prerequisites answers
- Tidy data satisfies three properties: each variable forms a column, each observation forms a row, and each type of observational unit forms a separate table. A typical wide clinical layout has one row per patient and one column per (variable, timepoint) pair, which violates the first two properties: ‘blood pressure at visit 2’ is not a variable, and one patient’s row contains many observations. Tidying makes group operations natural and prevents whole categories of bugs.
df |> filter(age >= 18) |> select(age, sex, outcome) |> group_by(sex) |> summarise(mean_outcome = mean(outcome, na.rm = TRUE)). The pipe-chained verbs read as a sequence of transformations.pivot_longer()reshapes from wide to long: columns become values in a new column. Use it when columns encode a variable (like visit number) that should be a row index. Usepivot_wider()when rows encode a variable that should be a column (e.g., computing a difference between two visit values). The right shape depends on the next operation.