15 Factors, Strings, and Dates
Stat 545 Chapters 10–13 (Jenny Bryan, UBC); the forcats, stringr, and lubridate packages.
15.1 Prerequisites
Answer the following questions to see if you can bypass this chapter. You can find the answers at the end of the chapter in Section 15.15.
- Why has
stringsAsFactors = FALSEbeen the default since R 4.0, and what problems did the old default cause? - What does
stringr::str_detect()return, and how does it differ fromstringr::str_extract()? - Given
c('2026-04-23', '04/23/2026', 'April 23, 2026'), write onelubridate/readrpipeline that parses all three intoDateobjects.
15.2 Learning objectives
By the end of this chapter you should be able to:
- Manage factors idiomatically with
forcats(reorder, lump, relevel, drop, infreq). - Use
stringrverbs confidently for detection, extraction, replacement, splitting, and interpolation. - Parse dates robustly with
lubridate, including mixed formats and time zones. - Diagnose and fix the common ‘character vs. factor vs. ordered factor’ decision that causes modelling errors downstream.
- Recognise locale, time-zone, and Excel-date pitfalls.
15.3 Orientation
After the numeric columns, the next most common source of analysis bugs is the handling of text, categories, and dates. This chapter covers the three packages that make each category tractable. Master them and roughly one-third of biomedical data cleaning becomes routine.
The packages are:
forcatsfor factors (categorical variables with defined levels).stringrfor character operations.lubridatefor dates and times.
All three are part of the tidyverse, with consistent naming and pipe-friendly interfaces.
15.4 The statistician’s contribution
Mechanics are mechanical. The judgements:
Factor or character? A character vector is flexible: any value, any order, any new level later. A factor is constrained: a fixed set of levels, a defined order, attached metadata. For analysis, factors are correct: the model needs to know the reference level and the level set. For data cleaning and joining, characters are usually safer (no surprises when a new value arrives). Convert to factor late, near the modelling step.
Reference levels matter. A logistic regression with factor treatment and reference ‘high-dose’ produces coefficients with a different interpretation from one with reference ‘placebo’. The default (alphabetical) is rarely what you want. Set the reference deliberately, in writing, in the data-cleaning script.
Time zones are tricky. A ‘date’ value ‘2026-04-23’ without a time zone is fine for daily-resolution analyses. A ‘time’ value ‘2026-04-23 02:00:00’ is ambiguous: which time zone? UTC? US/Pacific? The local time zone of the machine that recorded it? Time-zone errors propagate quietly through analyses; daylight- saving transitions cause spurious 1-hour gaps.
Excel dates. When data come from Excel, dates may arrive as 5-digit serial numbers (‘45405’) rather than date strings. Parsing them as dates produces nonsense (year 124, depending on the epoch convention). The fix is janitor::excel_numeric_to_date() or careful import. Verify any column whose name suggests dates but whose values are numeric.
These judgements are what distinguish a working data cleaning script from one that silently mishandles a common type.
15.5 Factors with forcats
The standard operations:
library(forcats)
f <- factor(c("low", "high", "low", "medium", "high"))
# levels in current order
levels(f)
#> [1] "high" "low" "medium"
# alphabetical by default; reorder by frequency
fct_infreq(f)
# reorder by another variable
df |> mutate(species = fct_reorder(species, body_mass_g, mean))
# specify order explicitly
fct_relevel(f, "low", "medium", "high")
# combine rare levels
fct_lump_n(f, n = 2) # keep top 2; rest -> 'Other'
fct_lump_min(f, min = 5) # keep levels with >= 5 obs
# drop unused levels
fct_drop(f)
# rename levels
fct_recode(f, "L" = "low", "M" = "medium", "H" = "high")When to convert character to factor:
- Just before modelling: ensures the model knows the level set and the reference.
- Just before plotting: enables
fct_reorderfor ordered axis labels.
When to keep as character:
- During data cleaning: avoids ‘invalid factor level’ surprises when a new value arrives.
- For joining: factor matching can fail when levels differ between tables.
15.6 Strings with stringr
Pattern detection, extraction, replacement, and splitting:
library(stringr)
s <- c("ABC123", "ABC", "XYZ-789", NA)
# detect: returns logical
str_detect(s, "[0-9]")
#> [1] TRUE FALSE TRUE NA
# extract: returns first match per input, or NA
str_extract(s, "[0-9]+")
#> [1] "123" NA "789" NA
# extract all matches per input (returns list)
str_extract_all(s, "[0-9]+")
# replace
str_replace(s, "[0-9]+", "***")
# split
str_split("a,b,c,d", ",", simplify = TRUE)
#> [,1] [,2] [,3] [,4]
#> [1,] "a" "b" "c" "d"
# interpolate
name <- "World"
str_glue("Hello, {name}!")
#> Hello, World!
# case
str_to_lower("ABC") # "abc"
str_to_upper("abc") # "ABC"
str_to_title("hello world") # "Hello World"
# trim whitespace
str_trim(" abc ") # "abc"
str_squish(" a b c ") # "a b c"The regex fundamentals you need:
[abc]: any of a, b, c.[a-z],[A-Z],[0-9]: ranges.\\d,\\w,\\s: digit, word char, whitespace.*,+,?: zero-or-more, one-or-more, zero-or-one.{n},{n,m}: exactly n, between n and m.^,$: start, end of string.(): capture group;(?:...)non-capturing.|: alternation.\\: escape; in R strings this is\\\\for a literal backslash.
For complex regex, build incrementally and test each step.
15.7 Dates with lubridate
Parsing:
library(lubridate)
# fixed format
ymd("2026-04-23")
mdy("4/23/2026")
dmy("23 April 2026")
# mixed formats: try each
parse_date_time(c("2026-04-23", "04/23/2026", "April 23, 2026"),
orders = c("ymd", "mdy", "Bdy"))
#> [1] "2026-04-23" "2026-04-23" "2026-04-23"
# date-time
ymd_hms("2026-04-23 14:30:00")
ymd_hm("2026-04-23 14:30")Components:
d <- ymd("2026-04-23")
year(d) # 2026
month(d) # 4
day(d) # 23
wday(d) # 5 (1 = Sunday by default)
wday(d, label = TRUE) # ThuArithmetic:
d + days(7) # 2026-04-30
d %m+% months(3) # 2026-07-23 (handles month length safely)
# differences
ymd("2026-12-31") - ymd("2026-01-01") # Time difference of 364 days
as.numeric(ymd("2026-12-31") - ymd("2026-01-01"), units = "days")Time zones:
# parse as a specific time zone
ymd_hms("2026-04-23 14:30:00", tz = "America/Los_Angeles")
# convert
with_tz(t, "UTC") # same instant, different zone
force_tz(t, "UTC") # same wall clock, different zonewith_tz changes the displayed zone without changing the instant; force_tz changes the instant by declaring the wall clock was always in the new zone. Confusing them is a common source of 1- to 24-hour errors.
15.8 Common pitfalls
Excel dates as serial numbers. A date column from Excel may arrive as 45405 (= 2024-04-23 in Excel’s 1900-epoch numbering). Parse with:
janitor::excel_numeric_to_date(45405)
#> [1] "2024-04-23"Verify: any ‘date’ column whose values are 5-digit numbers should ring an alarm.
Mixed-locale month names. lubridate::dmy() can read ‘avril’ (French April) only if the locale supports it. For multi-language data:
lct <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "fr_FR.UTF-8")
dmy("23 avril 2026")
Sys.setlocale("LC_TIME", lct) # restoreOr use explicit format strings.
Time-zone-naïve datetimes. A datetime read without a time zone defaults to UTC (lubridate) or the system zone (base R). Be explicit; never assume.
Daylight-saving transitions. Adding 24 hours and adding 1 day are not always the same:
ymd_hms("2026-03-09 01:00:00", tz = "US/Pacific") + days(1)
#> [1] "2026-03-10 01:00:00 PDT"
ymd_hms("2026-03-09 01:00:00", tz = "US/Pacific") + hours(24)
#> [1] "2026-03-10 02:00:00 PDT"The first is a calendar day; the second is 24 wall- clock hours, which crosses a DST boundary. Use whichever is appropriate for the analysis.
15.9 Common antipatterns
Hand-coded ifelse chains for date parsing:
# bad
ifelse(grepl("/", x),
as.Date(x, format = "%m/%d/%Y"),
as.Date(x, format = "%Y-%m-%d"))
# good
parse_date_time(x, orders = c("mdy", "ymd"))Letting characters become factors silently in old code:
# always specify
read.csv("file.csv", stringsAsFactors = FALSE)
read_csv("file.csv") # readr default is FALSEComparing factors by value when levels differ:
f1 <- factor("a", levels = c("a", "b"))
f2 <- factor("a", levels = c("a", "c"))
f1 == f2 # works, but levels differ
# better: convert to character for cross-table comparisons
as.character(f1) == as.character(f2)15.10 Worked example: cleaning a CRF export
library(tidyverse)
library(janitor)
library(lubridate)
library(forcats)
raw <- read_csv("data/raw/crf_export.csv")
clean <- raw |>
clean_names() |>
mutate(
# parse mixed-format date column
visit_date = parse_date_time(visit_date,
orders = c("ymd", "mdy")),
# convert numeric Excel dates if any
visit_date = if_else(is.na(visit_date) & !is.na(visit_date_excel),
excel_numeric_to_date(visit_date_excel),
visit_date),
# standardise treatment text
treatment = str_to_title(treatment),
treatment = str_replace(treatment, "Placebo|Pbo", "Placebo"),
# convert to factor with deliberate reference
treatment = factor(treatment,
levels = c("Placebo", "Low Dose", "High Dose")),
# collapse small categories
site = fct_lump_min(site, min = 10, other_level = "Other"),
# recode missing-as-unknown
sex = fct_recode(sex, "Unknown" = "")
) |>
filter(!is.na(visit_date), !is.na(treatment))The cleaned data has dates parsed from mixed formats, treatment standardised and explicitly factored, small sites collapsed, missing sex coded explicitly. The script documents every transformation as code.
15.11 Collaborating with an LLM on types
LLMs handle these packages well; the trap is silent type coercion.
Prompt 1: parsing dates. Paste a sample of date strings (10 examples covering the formats present) and ask: ‘parse these robustly, flagging any that fail.’
What to watch for. parse_date_time() with multiple orders is the canonical answer. If the LLM produces nested ifelse or case_when, push back.
Verification. Run on the full column. Count NAs produced; investigate any unexpected ones.
Prompt 2: regex extraction. Describe the pattern you want and provide examples. Ask the LLM to write a regex.
What to watch for. Test the regex on edge cases the LLM may not have considered (empty strings, unusual separators, mixed case). LLM regexes often fail at the margins.
Verification. str_detect and str_extract on a test set; count mismatches.
Prompt 3: factor reference level. Paste a modelling formula and ask: ‘is the default reference level for treatment clinically sensible? If not, how should it be set?’
What to watch for. The LLM should know that alphabetical reference is rarely what you want. Specifying via factor(..., levels = ...) or fct_relevel is the canonical fix.
Verification. Re-fit the model with the explicit reference; check coefficients match the intended interpretation.
15.12 Principle in use
Three habits define defensible type handling:
- Convert to factor late. Cleaning in character; modelling in factor. Specify reference levels deliberately.
- Parse dates with named formats.
parse_date_timewith explicit orders, not guesses. Watch for Excel serial numbers. - Be explicit about time zones. A datetime without a zone is a bug waiting to happen.
15.13 Exercises
- Using
palmerpenguins::penguins, reorder thespeciesfactor by mean body mass (ascending) usingfct_reorder(). Produce a bar plot that reflects the new order. - A character column contains hospital names like
'General Hospital Main Campus','Gen. Hospital','GENHOSP'. Write a regex-based cleaning pipeline that collapses these into a single canonical level. - A dataset has a
visit_datecolumn in mixed formats. Write a pipeline that parses everything withparse_date_time()and flags rows that fail to parse. - Write a function that takes any character column and returns a tibble of value, count, and proportion. Use it to audit the categorical variables in a recent analysis of yours.
- Create a dataset with a datetime in PST and convert it to UTC two ways: with
with_tzandforce_tz. Explain the difference.
15.14 Further reading
- (Wickham et al., 2023) Chapters 14–18,
forcats,stringr,lubridate. - (Bryan & Stephens, 2019) Chapters 10–13, detailed type-specific coverage.
- The
regular-expressions.infowebsite — comprehensive regex reference.
15.15 Prerequisites answers
- Before R 4.0,
read.csv()anddata.frame()converted character columns to factors by default, causing surprise when levels were invented, ordered, or appeared where a character string was expected (e.g., as merge keys). Downstream analysts had to rememberstringsAsFactors = FALSEor suffer silent coercion. R 4.0 madeFALSEthe default; the old default is remembered only when reading elderly code. str_detect(string, pattern)returns a logical vector of the same length asstring:TRUEwhere the pattern matched,FALSEelsewhere.str_extract()returns a character vector: the first matching substring per input, orNA_character_where there was no match.str_detectis for filtering;str_extractis for pulling out parts.parse_date_time(x, orders = c('ymd', 'mdy', 'Bdy'))(or withreadr::parse_dateand multiple format strings).lubridate::parse_date_time()tries each order against each input and returns the first that parses. Inputs that match no format produceNA, allowing you to flag and inspect them.