11 The zzcollab Framework
In-house zzcollab framework at ~/prj/sfw/07-zzcollab/zzcollab/ (version 0.1.x). Blog posts 14-penguins1zzcollab and 42-zzedcindependence. This chapter describes zzcollab as it exists at time of writing; check the framework’s CHANGELOG.md for current status.
11.1 Prerequisites
Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 11.16.
- What are the Five Pillars that
zzcollabidentifies as essential for a reproducible research compendium, and what does each pillar capture that the other four do not? - How does
zzcollabrelate torrtools,renv, and Docker? Does it replace them, wrap them, or compose them? - What is the difference between the
analysisandmodelingprofiles shipped withzzcollab, and when would you pick one over the other?
11.2 Learning objectives
By the end of this chapter you should be able to:
- Scaffold a new research project with
zzc <profile>in one command. - Explain and audit the Five Pillars of a
zzcollabcompendium. - Choose the appropriate profile (minimal, analysis, modeling, publishing, shiny) for a given study.
- Use the project
Makefiletargets (make r,make check-renv,make render) to interact with the container. - Extend
zzcollabwith a custom bundle that encodes your research group’s conventions intemplates/bundles.yaml. - Recognise when
zzcollabis overkill and when it is exactly the right level of infrastructure.
11.3 Orientation
zzcollab is an opinionated research-compendium framework. It packages the ideas of rrtools, renv, and Docker into a single command-line workflow, trading flexibility for speed of setup and consistency across a research group’s projects. You get less control over the directory layout than with bare rrtools; in exchange, every project in your group looks the same, collaborators know where things live, and you can go from mkdir to a reproducible container in under five minutes.
The framework exists because the manual setup (Dockerfile, renv, rrtools layout, Makefile targets, check scripts) is roughly 30–60 minutes of error-prone work per project. For a research group that produces many projects, automating that setup pays back quickly. For a one-off analysis, plain rrtools (chapter 7) suffices.
11.4 The statistician’s contribution
zzcollab automates mechanical setup. The judgements:
Choose the right profile. A modelling project that uses lme4 will fail in an analysis profile container that lacks libgsl-dev. A simple descriptive analysis in a modeling profile waits five minutes longer for an unnecessary container build. The profile is a substantive choice; defaulting to the heaviest one wastes time, the lightest one wastes more time when it fails to support what you need.
Decide when to extend. A research group has local conventions: a preferred database client, a set of in-house packages, journal-specific Quarto templates. Extending zzcollab with a custom bundle codifies these conventions and saves every downstream project from re-enumerating them. The extension is a one-time investment; the benefit compounds.
Audit the Pillars. make check-renv and related commands verify that the compendium has all five pillars and that they are mutually consistent. Skipping the audit before deposition or paper submission is the kind of small omission that breaks reproducibility years later.
Recognise the limits. zzcollab is not the answer to every project. For interactive exploration, RStudio alone is faster. For a one-off script, scaffolding a compendium is overhead. The skill is knowing which projects warrant the framework.
These decisions are what distinguish using zzcollab as a tool from using it as a ritual.
11.5 The Five Pillars
zzcollab identifies five components as jointly necessary for reproducibility. A compendium missing any one is incomplete.
| Pillar | Role |
|---|---|
Dockerfile |
Operating system, system libraries, R build |
renv.lock |
Exact versions of every R package |
.Rprofile |
Project-local R configuration (activates renv) |
| Source code | Analysis scripts and functions |
| Data | Raw and processed data (or DOI pointers) |
Pillar 1: Dockerfile. Captures the OS and system libraries below R. Without it, an analysis that depends on libgdal or a specific BLAS implementation reproduces only on machines with those libraries pre-installed. The zzcollab-generated Dockerfile pins the rocker base image (e.g., rocker/r-ver:4.4.0) and the profile-specific apt packages.
Pillar 2: renv.lock. Captures R itself plus every R package version, including transitive dependencies. Without it, package updates between the analysis run and the reproduction silently change behaviour.
Pillar 3: .Rprofile. A project-local R startup file that activates renv’s project library. Without it, opening the project in a fresh R session loads the user’s global library, defeating renv’s isolation.
Pillar 4: Source code. The analysis itself, in R/ (helpers) and analysis/ (scripts and narrative). Version-controlled in git from day one.
Pillar 5: Data. Raw and processed data, or a documented procedure for obtaining them. Sensitive data often cannot live in the compendium directly; in that case, the compendium contains a README documenting the access path (chapter 2 has the workflow).
The five pillars are a checklist. Every project should satisfy each; make check-renv and a quick manual audit verify that they do.
11.6 Profiles and bundles
The command zzc <profile> scaffolds a project using one of five profiles. Each profile is defined in templates/bundles.yaml as a combination of a system library bundle (Linux packages installed via apt-get) and an R package bundle (R packages installed via renv).
Current profiles (abbreviated from bundles.yaml):
| Profile | System libs | R packages |
|---|---|---|
minimal |
git, curl | base R only |
analysis |
+ XML/SSL deps | tidyverse, palmerpenguins |
modeling |
+ libgsl-dev | + glmnet, lme4, survival |
publishing |
+ texlive, pandoc, libv8-dev | + Quarto-specific tooling |
shiny |
+ nodejs | + shiny, bslib, reactlog |
Choosing among them:
minimal. When you want to add packages one at a time. Useful for teaching examples or for projects with tight package budgets.analysis. Default for descriptive, exploratory, and tidyverse-heavy work. The fastest container build that still has the common tools.modeling. When you will fit non-trivial statistical models (GLMMs, survival, penalised regression). The compiled-package dependencies (libgsl-dev,libssh2-dev) are the slow part; baking them in saves later trouble.publishing. When the deliverable is a rendered manuscript. The TeX Live and Pandoc layers are large but unavoidable for paper rendering.shiny. For Shiny apps. Adds Node.js for building front-end assets and the Shiny package family.
A profile mismatch is recoverable: edit the Dockerfile, apt-get install whatever is missing, make docker-build. But getting it right the first time saves the rebuild.
11.7 The first-project walkthrough
The minimum viable zzcollab session:
mkdir penguins-analysis
cd penguins-analysis
zzc analysis # scaffold (Five Pillars appear)
make r # enter R session inside Docker
# ... do analysis ...
make check-renv # audit the compendium
make render # build final Quarto outputWhat happens at each step:
zzc analysis runs the framework’s scaffolding. After it completes, the directory contains a Dockerfile (rocker/r-ver:4.4.0 base + XML/SSL deps), an initial renv.lock referencing tidyverse and palmerpenguins, an .Rprofile that activates renv, an analysis/ and R/ directory tree, and a Makefile. Git is initialised; the first commit captures the scaffolding.
make r builds the Docker image (slow on first run; cached subsequently) and drops you into an interactive R session inside the container. The container’s R session sees the project’s renv library; the working directory maps to the host’s project directory, so files edited inside the container persist.
make check-renv runs the audit suite: verify that all packages mentioned in the code are in the lockfile, that the lockfile and library are in sync, that the Dockerfile builds, that the .Rprofile activates renv correctly. Run it before deposition or paper submission.
make render builds the final output. For a Quarto-based project, this renders the analysis documents to HTML and PDF.
11.8 The project Makefile
Every zzcollab project gets a Makefile with a small set of targets:
| Target | Action |
|---|---|
make help |
List all targets |
make r |
Drop into R inside the project’s container |
make rstudio |
Launch RStudio Server (port 8787) |
make render |
Render analysis/ to HTML and PDF |
make check-renv |
Validate renv.lock against the library |
make docker-build |
Rebuild the container (use after adding system deps) |
make clean |
Remove generated outputs (preserves caches) |
The non-obvious ones:
make rstudio for browser-based interactive work. After make rstudio, open http://localhost:8787 in a browser; log in as rstudio (password printed by make rstudio’s output). Useful for collaborators who do not want to use the terminal.
make docker-build rebuilds the image after you add system dependencies. If you add a new apt-get install line to the Dockerfile, make r will silently use the old image; make docker-build forces a rebuild.
make clean removes derived outputs (rendered HTML, PDFs, Quarto’s _freeze/ cache). Useful before pushing a release; not useful day to day.
11.9 Worked example: adding a project bundle
Suppose your group works heavily with the Palmer Penguins data and has a local style package. Define a custom bundle:
# templates/bundles.yaml (excerpt)
penguin_research:
description: "Palmer Penguins analysis with house style"
inherits: analysis
apt_packages:
- libgsl-dev # for any modelling extras
r_packages:
- palmerpenguins
- lme4
- mygroup_styles_pkg # in-house packageThen:
zzc penguin_research my-new-projectFuture projects scaffold with the bundle preconfigured. Adding the bundle once saves every team member from re-enumerating dependencies.
For a personal customisation that does not need to be shared, the same applies but to a forked copy of zzcollab.
11.10 Extending zzcollab
For a research group with local conventions (a preferred database client, a set of in-house packages, journal-specific Quarto templates), fork zzcollab and add to templates/bundles.yaml:
- Add a new bundle. Follow the YAML schema:
description,inherits(a parent bundle to extend),apt_packages,r_packages. - Test on a scratch project.
zzc <bundle> tmp-test; verify the container builds and the packages install. - Document. Add a paragraph to the group’s README explaining the bundle’s purpose.
- Distribute. A shared fork or branch lets the team use the new bundle across projects.
The framework is intentionally easy to extend: bundles are a few lines of YAML, and existing ones serve as templates.
11.11 When zzcollab is not the right tool
zzcollab is overkill for:
- Interactive exploration. Just open RStudio.
- One-off scripts. Just write a script.
- Teaching examples whose point is the code, not the infrastructure. Less infrastructure is clearer.
- Quick reproductions of someone else’s analysis. Their existing structure should control.
It is the right tool for: collaborative analyses that will be passed between team members, analyses that must survive for months or years after the lead analyst has moved on, and any analysis subject to federal reproducibility requirements (chapter 2).
The ‘right level of infrastructure’ is itself a judgement. Excessive infrastructure on a small project is busywork; insufficient infrastructure on a large one is technical debt. zzcollab makes the high-infrastructure path cheap, which shifts the right answer toward more infrastructure on the margin.
11.12 Collaborating with an LLM on zzcollab
LLMs handle the framework well; the trap is generic ‘set up a research project’ suggestions that ignore the framework’s existence.
Prompt 1: auditing a non-zzcollab analysis. Paste a directory listing and ask: ‘map this project onto the Five Pillars. What is missing, and what would I need to add to make it zzcollab-compliant?’
What to watch for. The LLM should identify each pillar concretely (Dockerfile present? renv.lock? .Rprofile? Source code? Data?). If it generates abstract guidance instead, push for the file-by-file inventory.
Verification. Run zzc analysis in a scratch copy; compare the resulting structure to your existing project. The diff is the work to do.
Prompt 2: comparing frameworks. Describe a project type and ask the LLM to compare zzcollab, rrtools, and workflowr.
What to watch for. Each framework has a niche. zzcollab is opinionated and fast; rrtools is flexible; workflowr emphasises the analysis website. The LLM should not declare a winner; it should match each to a use case.
Verification. Cross-reference each framework’s README; if the LLM’s claims about features are correct, the comparison is reliable.
Prompt 3: designing a bundle. Paste bundles.yaml and describe a niche (e.g., neuroimaging, RNA-seq, electrophysiology). Ask: ‘design a bundle covering the standard dependencies for this niche.’
What to watch for. The system libraries (neuroimaging needs FSL or AFNI tooling; RNA-seq often needs Bioconductor packages plus their apt-get dependencies). The LLM should produce plausible candidates; verify against domain documentation.
Verification. Build the proposed bundle on a scratch project; install the packages. The build either succeeds or surfaces missing dependencies (which you then add).
11.13 Principle in use
Three habits define defensible zzcollab use:
- Pick the right profile up front. Profile mismatches are recoverable but slow. Match profile to project type from day one.
- Audit before deposition.
make check-renvplus a manual Pillar inventory before submitting or archiving. - Extend deliberately, not promiscuously. A custom bundle for genuine group conventions; not for every one-off project.
11.14 Exercises
- Scaffold a
zzcollabanalysis project for the Palmer Penguins dataset. Runmake r, load the data, produce a single scatter plot, and save it toanalysis/. - Take an existing analysis of yours and migrate it into a
zzcollabcompendium. Confirm every Pillar is present and runmake check-renvto validate. - Design a custom bundle for your research niche (e.g., ‘survival analysis with heavy tidymodels use’). Add it to a fork of
zzcollaband use it for a real project. - Compare the time to set up a project from scratch with
rrtoolsvs. withzzc analysis. Time both. Document the difference. - Use
make rstudioto launch RStudio Server in azzcollabcontainer; access it from a browser; render an analysis from inside the browser session. Confirm the result matchesmake renderfrom the terminal.
11.15 Further reading
- The
zzcollabREADME atgithub.com/rgt47/zzcollab, authoritative reference. - (Marwick et al., 2018), broader conceptual framing that
zzcollabinherits. - The
rockerproject documentation for the base imageszzcollabbuilds on.
11.16 Prerequisites answers
- The Five Pillars are: (1) a Dockerfile (captures OS and system libs), (2) an renv.lock (captures R package versions),
- an .Rprofile (activates renv and sets project-local R options), (4) source code (the actual analysis), and (5) data (raw and processed, or DOI pointers when data cannot be shared). Each pillar captures a layer that the others do not: Docker covers the OS below R,
renvcovers R itself,.Rprofilecovers R’s runtime configuration, source code is the work product, and data is the input. Remove any one and the analysis cannot be reproduced.
- an .Rprofile (activates renv and sets project-local R options), (4) source code (the actual analysis), and (5) data (raw and processed, or DOI pointers when data cannot be shared). Each pillar captures a layer that the others do not: Docker covers the OS below R,
zzcollabcomposes the other three. It usesrrtoolsfor the package-backed compendium layout,renvfor R package management, and Docker for environment reproduction. On top of these it adds profile-based shortcuts (five ready-made bundles), a build system (Makefile), and validation commands (make check-renv). It does not replace any of the three tools; it removes the manual wiring between them.- The
analysisprofile provides tidyverse-focused R packages and is the right default for descriptive analyses, data wrangling, and straightforward reporting. Themodelingprofile addsglmnet,lme4,survival, andlibgsl-devsystem support, and is appropriate for regression, mixed- effects, or survival analyses that need compiled model-fitting packages. Choosemodelingwhen you will fit non-trivial models; otherwise chooseanalysisfor faster container builds.