21 Testing Data Analysis Workflows

Sources

Adapted from author’s lecture notes and supporting materials for a graduate practicum in biostatistics.

21.1 Prerequisites

Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 21.17.

Why would you test a one-off data analysis script that will only ever run on a single dataset?
What is the difference between a unit test, an integration test, and an end-to-end test in the context of a data analysis pipeline?
What does testthat::expect_snapshot() capture, and when is a snapshot test preferable to a direct value-comparison test?

21.2 Learning objectives

By the end of this chapter you should be able to:

Write unit tests for analytic helper functions with testthat.
Add an integration test that exercises the full data pipeline on a small synthetic dataset.
Use expect_snapshot() to capture complex outputs (printed objects, plots, summary tables).
Run the test suite locally via testthat::test_local() and in CI via GitHub Actions.
Identify test smells (tests that pass for the wrong reason, tests that hide bugs, flaky tests).
Apply visual regression testing to ggplot output via vdiffr.

21.3 Orientation

Tests are how you convince a reader (including future-you) that your code actually does what you think it does. Analysis code rarely gets tests, and the lack of them is a major source of reproducibility failures. This chapter brings the habits of software engineering to bear on the practice of data analysis.

The framework is testthat (3rd edition). The companion textbook covers testthat in the package- development context (chapter 20 of Statistical Computing in the Age of AI). This chapter focuses on applying it to non-package analysis code.

21.4 The statistician’s contribution

Tests are mechanical to write. The judgements:

Test the statistical content, not just the plumbing. A test that ‘returns a tibble of the right shape’ is a structural check, not a test of correctness. A test that the regression coefficient on a known synthetic dataset matches the closed-form answer is a test of correctness.

Edge cases are where bugs live. Empty input, single observation, all-NA, perfect collinearity, boundary value (zero variance, all-zero counts, class with one observation). Test these deliberately.

Test the pipeline, not just the helpers. Unit tests catch helper-function bugs. Integration tests on synthetic data catch pipeline bugs, including ones where helpers work individually but the composition fails.

Don’t test what you don’t trust. Mocking out a function call inside a test of that function makes the test tautological. The test passes because you made it pass, not because the code is correct.

These judgements are what make tests useful rather than performative.

21.5 Why test analyses?

Three concrete benefits:

Catch regressions on refactor. You restructure the cleaning script to use dplyr instead of base::merge. The output should be identical. A test that compares the output to a saved baseline catches any regression.

Catch upstream data changes. Your script reads a CSV from an institutional data warehouse. The warehouse changes a column from days to seconds. The script runs without error. A test that asserts typical values for the column flags the change.

Document expected behaviour. The test ‘with input X, the function returns Y’ is executable documentation. Future-you reads the test to understand what the function is supposed to do.

The cost is small (5 minutes per test). The cost of silent miscalculation in a published paper is much larger.

21.6 Unit tests with testthat

For a function in R/clean.R:

# R/clean.R
#' Compute age groups from numeric age
#' @param age numeric vector
age_group <- function(age) {
  cut(age,
      breaks = c(0, 18, 40, 65, Inf),
      right = FALSE,
      labels = c("under-18", "18-39", "40-64", "65+"))
}

Tests in tests/testthat/test-clean.R:

test_that("age_group bins typical adult ages correctly", {
  expect_equal(as.character(age_group(c(20, 50, 75))),
               c("18-39", "40-64", "65+"))
})

test_that("age_group respects boundary values", {
  expect_equal(as.character(age_group(c(18, 40, 65))),
               c("18-39", "40-64", "65+"))
  expect_equal(as.character(age_group(c(17, 39, 64))),
               c("under-18", "18-39", "40-64"))
})

test_that("age_group handles NA", {
  expect_true(is.na(age_group(NA)))
})

test_that("age_group errors on non-numeric input", {
  expect_error(age_group("twenty"))
})

Each test_that() is one named test with one or more expect_*() assertions. Run all tests:

testthat::test_local()
# or, if the project is a package:
devtools::test()

For non-package analysis projects, place tests in tests/testthat/ and source the relevant scripts at the top of the test file.

21.7 Integration tests on synthetic data

Unit tests catch helper-function bugs. They do not catch bugs where the helpers work individually but the pipeline composition fails.

Integration tests run the full pipeline on a small synthetic dataset:

# tests/testthat/test-pipeline.R

test_that("cleaning pipeline produces expected analytic dataset", {
  # synthetic data that exercises every branch
  raw <- tibble::tibble(
    patient_id = c(1, 2, 3, 4),
    age        = c(25, 50, 75, 17),     # one excluded
    bp_v1      = c(120, 140, 160, NA),
    bp_v2      = c(118, 138, NA,  NA),
    sex        = c("M", "F", "F", "M"),
    treatment  = c("placebo", "active", "placebo", "active")
  )

  clean <- run_pipeline(raw)

  # structural assertions
  expect_s3_class(clean, "tbl_df")
  expect_named(clean, c("patient_id", "age_group", "sex",
                        "treatment", "visit", "bp"))

  # row count: 3 adults x 2 visits, minus NAs
  expect_equal(nrow(clean), 5)    # patient 4 excluded; patient 3 has 1 NA bp

  # specific values
  expect_equal(clean[clean$patient_id == 1 & clean$visit == 1, "bp"][[1]],
               120)
})

The synthetic dataset is small (4 patients, 2 visits) but exercises each cleaning rule: age exclusion (patient 4), missing values (patient 3 visit 2). A test that runs in a fraction of a second covers the pipeline’s logic.

For each cleaning rule (filter, derive, pivot, recode), the synthetic dataset should have at least one row that exercises it.

Check your understanding: structural vs. correctness

Question. Your test asserts expect_equal(nrow(result), 100). The test passes. Does this mean the result is correct?

Answer.

No. The test asserts only that the result has 100 rows. It says nothing about which rows, what values, or whether the right rows were filtered. A bug that produces 100 rows of the wrong patients passes the test silently. To test correctness:

expect_equal(nrow(result), 100)            # structural
expect_setequal(result$patient_id,         # correctness
                expected_ids)
expect_equal(result$age, expected_ages)

The combination tests structure and content. The structural test alone is a false positive generator. This pattern, ‘looks right at the surface, wrong underneath’, is the most common testing failure mode in data analysis.

21.8 Snapshot tests

For complex outputs that are hard to assert literally, snapshots:

test_that("summary table renders correctly", {
  fit <- lm(mpg ~ wt + hp, data = mtcars)
  expect_snapshot(summary(fit))
})

test_that("Table 1 has expected structure", {
  expect_snapshot(gtsummary::tbl_summary(d, by = treatment))
})

On first run, the output is captured to tests/testthat/_snaps/. On subsequent runs, output is compared to the snapshot; if it differs, the test fails and you decide whether to accept (testthat::snapshot_accept()) or investigate.

Snapshots are useful when:

Output is complex and tedious to assert literally.
Output is human-readable formatting (tables, printed summaries, plot text).
You care more about ‘is this what was reviewed’ than ‘is this exactly the right value’.

Snapshots are not useful when:

Output changes for trivial reasons (whitespace, random seed, system locale). These produce false positives.
The snapshot is too large to read manually.

21.9 Visual regression testing with vdiffr

For ggplot output, vdiffr saves an SVG of the rendered plot and compares on subsequent runs:

library(vdiffr)

test_that("regression plot is unchanged", {
  fit <- lm(mpg ~ wt, data = mtcars)
  p <- ggplot(broom::augment(fit), aes(.fitted, .resid)) +
        geom_point() +
        geom_hline(yintercept = 0)
  expect_doppelganger("regression-residual-plot", p)
})

vdiffr handles cross-platform rendering issues (font differences, anti-aliasing) better than raw image comparison. Useful for catching unintended changes to plot code (an axis label change, a geom swap).

21.10 Continuous integration

usethis::use_github_action_check_standard() sets up GitHub Actions to run R CMD check on every push and pull request, on multiple OS / R-version combinations. For a non-package analysis project, the equivalent is a workflow that runs tests:

# .github/workflows/test.yaml
name: tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: r-lib/actions/setup-r@v2
      - uses: r-lib/actions/setup-r-dependencies@v2
        with:
          packages: |
            local::.
            any::testthat
      - name: Run tests
        run: Rscript -e 'testthat::test_local()'

Push the workflow; on every commit, GitHub runs the tests and reports pass/fail. Regressions caught at push time are far cheaper to fix than regressions caught in production.

21.11 Test smells

Common patterns that look like tests but produce false confidence:

Tautological tests.

test_that("foo works", {
  expect_equal(foo(1), foo(1))    # always passes
})

The test cannot fail because it tests foo() against itself. Useless.

Tests that mock the tested function.

test_that("fit_model returns a list", {
  with_mocked_bindings(
    code = expect_type(fit_model(d), "list"),
    fit_model = function(d) list(...)
  )
})

You replaced the function under test with a stub; the test passes because the stub returns a list, not because fit_model works.

Brittle snapshots.

test_that("output looks right", {
  expect_snapshot(do_thing(today()))     # date in output
})

The snapshot includes the current date; the test fails every day. Either remove the date from the output or use expect_snapshot_value with a date-tolerant comparison.

Overly permissive tolerances.

expect_equal(result, 0.5, tolerance = 0.5)   # passes for any value

A 100% tolerance is no test.

Tests that test the test framework, not the code.

expect_true(TRUE)

These are placeholders that should be filled in. Watch for them in code reviews.

21.12 Worked example: testing a cleaning pipeline

# R/cleaning.R
clean_visits <- function(raw) {
  raw |>
    janitor::clean_names() |>
    dplyr::filter(age >= 18) |>
    tidyr::pivot_longer(
      cols = dplyr::starts_with("bp_"),
      names_to = "visit",
      names_prefix = "bp_v",
      values_to = "bp"
    ) |>
    dplyr::filter(!is.na(bp))
}

# tests/testthat/test-cleaning.R
source("../../R/cleaning.R")

test_that("cleaning excludes minors", {
  raw <- tibble::tibble(
    patient_id = 1:3,
    age        = c(25, 17, 30),
    bp_v1      = c(120, 110, 130),
    bp_v2      = c(122, 112, 128)
  )
  result <- clean_visits(raw)
  expect_setequal(result$patient_id, c(1, 3))
})

test_that("cleaning drops NA blood pressures", {
  raw <- tibble::tibble(
    patient_id = c(1, 1, 2),
    age        = c(25, 25, 30),
    bp_v1      = c(120, NA, 130),
    bp_v2      = c(NA,  118, 128)
  )
  result <- clean_visits(raw)
  expect_equal(nrow(result), 4)        # 6 cells, 2 NA → 4
  expect_true(all(!is.na(result$bp)))
})

test_that("cleaning produces long format", {
  raw <- tibble::tibble(
    patient_id = 1,
    age        = 25,
    bp_v1      = 120,
    bp_v2      = 118
  )
  result <- clean_visits(raw)
  expect_named(result,
               c("patient_id", "age", "visit", "bp"),
               ignore.order = TRUE)
  expect_equal(nrow(result), 2)
})

The tests cover: the age filter, the NA filter, the pivot. Each test is a single named scenario; each asserts specific values, not just structure. Running the test suite catches any regression to the cleaning logic.

21.13 Collaborating with an LLM on tests

LLMs draft tests well; the judgement about what to test needs human input.

Prompt 1: drafting tests. Paste the function and ask: ‘write testthat tests covering happy path, edge cases, and error handling. Make the assertions specific to values, not just structure.’

What to watch for. Default LLM tests tend to be shape-checks. Push for value-checks. Edge cases: empty, NA, single observation, type mismatch.

Verification. Introduce a bug into the function and re-run the tests. If the bug is caught, the tests are useful. If not, add a more specific test.

Prompt 2: integration test on synthetic data. Describe the pipeline; ask the LLM to generate a small synthetic dataset that exercises every branch.

What to watch for. The LLM may generate a dataset that exercises the happy path but not the edges. Verify each branch (filter rules, missing-data paths, factor levels) is covered.

Verification. Add a deliberate bug to one branch; the test should fail. If it does not, the synthetic data does not exercise that branch.

Prompt 3: diagnosing a flaky test. Paste the test and the failure; ask: ‘what’s the source of non-determinism?’

What to watch for. Common causes: random number generation without seed, parallel processing without seed-aware RNG, system-time-dependent output, locale issues. The LLM should know these.

Verification. Apply the fix and run the test multiple times. If it stops being flaky, fixed.

21.14 Principle in use

Three habits define defensible testing for analyses:

Test the math, not just the shape. Structural checks alone produce false confidence.
Integration tests on synthetic data. A 20-line synthetic dataset that exercises every branch catches pipeline bugs that unit tests miss.
CI as a tripwire. GitHub Actions running tests on every push catches regressions at the commit, not at the publication.

21.15 Exercises

Add unit tests for two functions from an existing analysis of yours. Aim for at least: a happy-path test, an edge-case test, and an error-path test.
Write an integration test: a 20-line synthetic dataset that exercises every branch of your pipeline. Run the full pipeline in under a second.
Set up GitHub Actions with a workflow that runs testthat::test_local() on every push. Push and verify the workflow runs green.
Introduce a deliberate bug into your pipeline. Verify the test suite catches it. If it does not, add a test that does.
Apply vdiffr::expect_doppelganger to one of your project’s plots. Modify the plot’s theme slightly; verify the test fails and you can accept or reject the change.

21.16 Further reading

(Wickham & Bryan, 2023) testing chapters, the canonical testthat reference.
vdiffr documentation on CRAN, visual regression testing for ggplot2.
The testthat package vignettes.

21.17 Prerequisites answers

Even a one-off analysis benefits from tests: they catch regressions when you refactor mid-analysis, they serve as executable documentation of expected behaviour, and they detect silent failures when upstream data changes format. The cost of writing a handful of tests is small; the cost of silently mis-analysing data is much larger.
A unit test exercises a single function with known inputs and checks the output. An integration test runs a pipeline segment (multiple functions) on synthetic data and checks the pipeline’s output. An end-to-end test runs the full analysis pipeline from raw data to final figures/tables. Unit tests are fast and narrow; end-to-end tests are slow and broad; integration tests sit in between, with good cost-benefit for analytic pipelines.
expect_snapshot() captures the output of an expression (printed text, numerical result, plot) to a file on first run. On subsequent runs it compares the new output against the saved snapshot. Use it when the output is complex (a printed object, a plot, a formatted table) and not easily expressed as a literal value. Review changes manually when they occur; do not auto-accept.