Operational Utilities: Setup, Diagnostics, and Pipeline Tracking

Overview

The ops_* functions are a set of lightweight utilities that sit outside the main analysis pipeline. They help you verify your environment before starting, explore data quality, and track how your cohort changes at each processing step.

Function Purpose
ops_setup() Check dx CLI, RAP authentication, and R package dependencies
ops_toy() Generate synthetic UKB-like data for development and testing
ops_na() Summarise missing values (NA and "") across all columns
ops_snapshot() Record pipeline checkpoints and track dataset changes

ops_setup() may query dx CLI and RAP authentication status as part of its health check. All other functions operate entirely locally: ops_toy() and ops_na() are read-only; ops_snapshot() and its companions track and optionally clean up columns; ops_withdraw() removes withdrawn participants in-place. None of them read from or write to RAP storage.


ops_setup() — Environment Health Check

Run ops_setup() once after installing ukbflow to confirm that all required components are in place before starting a real analysis.

library(ukbflow)

ops_setup()
#> ── ukbflow environment check ──────────────────────────────────────────────
#> ℹ ukbflow 0.1.0  |  R 4.4.1  |  2026-03-09
#> ── 1. dx-toolkit ──────────────────────────────────────────────────────────
#> ✔ dx: /usr/local/bin/dx  (dx-toolkit v0.375.0)
#> ── 2. RAP authentication ───────────────────────────────────────────────────
#> ✔ user: evan.zhou
#> ✔ project: project-GXk9...
#> ── 3. R packages ───────────────────────────────────────────────────────────
#> ✔ cli  3.6.3  [core]
#> ✔ data.table  1.15.4  [core]
#> ✔ survival  3.7.0  [assoc_coxph]
#> ✔ forestploter  1.1.1  [plot_forest]
#> ...
#> ───────────────────────────────────────────────────────────────────────────
#> ✔ 15 passed
#> ! 2 optional / warning

For programmatic use (e.g. inside scripts or CI), set verbose = FALSE and inspect the returned list:

result <- ops_setup(verbose = FALSE)
result$summary
#> $pass
#> [1] 15
#> $warn
#> [1] 2
#> $fail
#> [1] 0

# Gate the rest of your script on a clean environment
stopifnot(result$summary$fail == 0)

Individual checks can be disabled when only a subset is needed:

# Check R package dependencies only (skip dx and RAP auth)
ops_setup(check_dx = FALSE, check_auth = FALSE)

ops_toy() — Synthetic UKB Data

ops_toy() generates a realistic but entirely synthetic dataset that mimics the structure of UKB phenotype data on the RAP. Use it to develop and test derive_*, assoc_*, and plot_* functions without needing real UKB data access.

Cohort scenario

The default "cohort" scenario produces a wide participant-level table that covers all major UKB data domains:

dt <- ops_toy()
#> ✔ ops_toy: 1000 participants | 75 columns | scenario = "cohort" | seed = 42

dim(dt)
#> [1] 1000   75

names(dt)
#>  [1] "eid"          "p31"          "p34"          "p53_i0"
#>  [5] "p21022"       "p21001_i0"    "p20116_i0"    "p1558_i0"
#>  ...

Column groups included:

Group Columns
Demographics eid, p31, p34, p53_i0, p21022
Covariates p21001_i0, p20116_i0, p1558_i0, p21000_i0, p22189, p54_i0
Genetic PCs p22009_a1p22009_a10
Self-report disease p20002_i0_a0a4, p20008_i0_a0a4
Self-report cancer p20001_i0_a0a4, p20006_i0_a0a4
HES p41270 (JSON array), p41280_a0a8
Cancer registry p40006_i0i2, p40011_i0i2, p40012_i0i2, p40005_i0i2
Death registry p40001_i0, p40002_i0_a0a2, p40000_i0
First occurrence p131742
GRS columns grs_bmi, grs_raw, grs_finngen
Messy columns messy_allna, messy_empty, messy_label

The messy columns deliberately stress-test derive_missing() and ops_na() against common data quality issues (all-NA columns, empty strings, non-standard missing labels).

Feed the output directly into the derive pipeline:

dt <- ops_toy()
dt <- derive_missing(dt)
dt <- derive_covariate(dt,
  as_numeric = "p21001_i0",
  as_factor  = c("p31", "p20116_i0")
)

Forest scenario

The "forest" scenario returns a results table matching the output of assoc_coxph(), useful for developing and testing plot_forest() without running a real Cox model:

dt_forest <- ops_toy(scenario = "forest")
#> ✔ ops_toy: 24 rows | 11 columns | scenario = "forest" | seed = 42

plot_forest(
  data  = dt_forest[model == "Fully adjusted"],
  est   = dt_forest[model == "Fully adjusted", HR],
  lower = dt_forest[model == "Fully adjusted", CI_lower],
  upper = dt_forest[model == "Fully adjusted", CI_upper]
)

Reproducibility

Results are reproducible by default (seed = 42). Pass seed = NULL for a different dataset on every call:

dt1 <- ops_toy(seed = 1)
dt2 <- ops_toy(seed = 1)
identical(dt1, dt2)   # TRUE

dt_random <- ops_toy(seed = NULL)   # different every call

ops_na() — Missing Value Diagnostics

ops_na() scans every column for NA and empty strings (""), returning counts and percentages sorted by missingness. Counting "" as missing is intentional — UKB exports frequently use empty strings as placeholders for absent text values, so ops_na() reports effective missingness rather than a plain is.na() count. It is designed to be called before derive_missing() to understand the data quality profile of a freshly extracted UKB dataset.

dt <- ops_toy()
ops_na(dt)
#> ── ops_na ──────────────────────────────────────────────────────────────────
#> ℹ 1000 rows | 65 columns | threshold = 0%
#> ✖ messy_allna   1000 / 1000  (100.00%)
#> ✖ p41280_a4     1000 / 1000  (100.00%)
#> ✖ p20002_i0_a4   976 / 1000  ( 97.60%)
#> ✖ p131742        916 / 1000  ( 91.60%)
#> ...
#> ────────────────────────────────────────────────────────────────────────────
#> ✖ 41 columns ≥ 10% missing
#> ✔ 24 columns complete (0% missing)

Columns with ≥ 10% missing are flagged in red (); those between 0% and 10% in yellow (!). The summary block (totals) is always printed regardless of the threshold setting.

Controlling CLI output with threshold

Use threshold to silence low-missingness columns from the per-column listing when the dataset has many columns. The summary block and returned data.table are always complete.

# Only list columns with > 50% missing in the console output
ops_na(dt, threshold = 50)

# Suppress all per-column lines — summary only
ops_na(dt, threshold = 99)

Programmatic use

ops_na() returns a data.table invisibly, regardless of threshold:

result <- ops_na(dt, verbose = FALSE)
result
#>           column  n_na pct_na
#>           <char> <int>  <num>
#>  1:  messy_allna  1000  100.0
#>  2:    p41280_a4  1000  100.0
#>  ...

# Identify columns to drop before modelling
cols_to_drop <- result[pct_na > 90, column]
dt[, (cols_to_drop) := NULL]

ops_snapshot() — Pipeline Checkpoints

ops_snapshot() records a lightweight summary of your dataset at each processing step and stores it in the session cache. Each subsequent call automatically computes deltas (Δ) against the previous snapshot, making it easy to track how rows, columns, and missingness change through the pipeline.

Recording snapshots

dt <- ops_toy()
ops_snapshot(dt, label = "raw")
#> ── snapshot: raw ───────────────────────────────────────────────────────────
#>   rows      1,000
#>   cols         65
#>   NA cols      41
#>   size       0.61 MB
#> ────────────────────────────────────────────────────────────────────────────

dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")
#> ── snapshot: after_derive_missing ──────────────────────────────────────────
#>   rows      1,000  (= 0)
#>   cols         65  (= 0)
#>   NA cols      43  (+2)
#>   size       0.61 MB  (= 0)
#> ────────────────────────────────────────────────────────────────────────────

dt <- dt[p31 == "Female"]
ops_snapshot(dt, label = "female_only")
#> ── snapshot: female_only ───────────────────────────────────────────────────
#>   rows        570  (-430)
#>   cols         65  (= 0)
#>   NA cols      43  (= 0)
#>   size       0.36 MB  (-0.25 MB)
#> ────────────────────────────────────────────────────────────────────────────

When label is omitted, snapshots are named snapshot_1, snapshot_2, etc. automatically. Labels should be unique within a session: if the same label is used twice, the history row is appended again but the stored column list is overwritten — which can cause ops_snapshot_cols() and ops_snapshot_diff() to behave unexpectedly.

Viewing the full history

Call ops_snapshot() with no arguments to print and return the complete history data.table:

ops_snapshot()
#> ── ops_snapshot history ────────────────────────────────────────────────────
#>    idx                label timestamp  nrow  ncol n_na_cols size_mb
#>  1:  1                  raw  14:30:01  1000    65        41    0.61
#>  2:  2 after_derive_missing  14:30:05  1000    65        43    0.61
#>  3:  3          female_only  14:30:08   570    65        43    0.36
#> ────────────────────────────────────────────────────────────────────────────

Silent recording

Set verbose = FALSE to record a snapshot without printing anything — useful inside functions or automated scripts:

ops_snapshot(dt, label = "pre_assoc", verbose = FALSE)

Resetting history

ops_snapshot(reset = TRUE)
#> ✔ Snapshot history cleared.

Session scope: the snapshot history lives in ukbflow’s session cache and is cleared when the R session ends or when ops_snapshot(reset = TRUE) is called. It is not written to disk.


Snapshot Helpers

ops_snapshot_cols() — column names at a checkpoint

Returns the column names recorded at a given snapshot label, minus protected columns (eid, sex, age, age_at_recruitment, and any registered via ops_set_safe_cols()). The primary use is building a drop vector after the raw columns are no longer needed.

raw_cols <- ops_snapshot_cols("raw")
# raw_cols is a character vector of droppable column names

Pass keep to protect additional columns beyond the defaults:

raw_cols <- ops_snapshot_cols("raw", keep = "p53_i0")

ops_snapshot_diff() — compare two checkpoints

Returns lists of columns added and removed between two snapshots — useful for auditing what derive_* functions produced.

result <- ops_snapshot_diff("raw", "after_derive_missing")
result$added    # columns added in this step
result$removed  # columns dropped in this step

ops_snapshot_remove() — drop raw columns after deriving

Removes the raw columns captured at a snapshot from data, keeping any derived columns added since. Built-in safe columns (eid, etc.) and columns supplied in keep are always retained.

# After deriving, drop the original raw columns
dt <- ops_snapshot_remove(dt, from = "raw")
#> ✔ ops_snapshot_remove: dropped 60 raw columns, 15 remaining.

For data.table input the operation is by reference (in-place); for data.frame input a new data.table is returned and the original is not modified.

ops_set_safe_cols() — register study-specific protected columns

Adds column names to the session safe list so they are never dropped by ops_snapshot_cols() or ops_snapshot_remove().

ops_set_safe_cols(c("date_baseline", "age_at_recruitment"))

# Clear registered safe cols
ops_set_safe_cols(reset = TRUE)

ops_withdraw() — Exclude Withdrawn Participants

UK Biobank periodically issues withdrawal files listing participants who have revoked consent. ops_withdraw() reads the headerless single-column CSV supplied by UKB and removes matching rows from your dataset. Two snapshots (before_withdraw / after_withdraw) are recorded automatically.

dt <- ops_withdraw(dt, file = "withdraw.csv")
#> ── snapshot: before_withdraw ───────────────────────────────────────────────
#>   rows      502,492
#>   ...
#> ── snapshot: after_withdraw ────────────────────────────────────────────────
#>   rows      502,489  (-3)
#>   ...
#> ℹ Withdrawal file: w854944_20260310.csv (312 IDs)
#> ✖ Excluded: 3 participants found in data
#> ✔ Remaining: 502,489 participants

Run this immediately after loading your extracted dataset, before any derive_* steps, so withdrawn participants never enter the analysis.


Typical Workflow

The four ops_* functions form a natural bookend around the core pipeline:

library(ukbflow)

# 1. Verify environment before starting
ops_setup()

# 2. Generate test data (or extract real data from RAP)
dt <- ops_toy()

# 3. Inspect data quality before processing
ops_na(dt)

# 4. Run pipeline with checkpoints
ops_snapshot(dt, label = "raw")

dt <- derive_missing(dt)
ops_snapshot(dt, label = "after_derive_missing")

dt <- derive_covariate(dt,
  as_numeric = "p21001_i0",
  as_factor  = c("p31", "p20116_i0")
)
ops_snapshot(dt, label = "after_derive_covariate")

# 5. Review full pipeline history
ops_snapshot()

Getting Help