title: “Using the missForest Package” author: “Daniel J. Stekhoven” date: “2025-10-22” output: pdf_document: number_sections: true toc: true fontsize: 11pt geometry: margin=2.5cm, top=3cm, bottom=2.5cm lang: en bibliography: myBib.bib link-citations: true vignette: > % % % —
This package vignette is a practical, application-focused
user guide for the R package missForest. We’ll walk through
the workflow on real datasets, discuss argument choices with a keen eye
on feasibility and accuracy, and keep
an occasional smile. Don’t be alarmed by the length — most of it is
friendly R output for illustration.
This document is not a theoretical primer on the foundations of the algorithm, nor is it a comparative study. For the theory and evaluations, see @stekhoven11.
missForest algorithm (with ranger by default)missForest is a nonparametric imputation method for
basically any kind of tabular data. It handles mixed
types (numeric + categorical), nonlinear
relations, interactions, and even high
dimensionality ((p n)). For each variable with missingness, it
fits a random forest on the observed part and predicts the missing part,
iterating until a stopping rule is met (or maxiter says
“enough”).
missForest() now uses the
ranger backend for speed and multithreading.backend = "randomForest".The out-of-bag (OOB) error from the backend is transformed into an imputation error estimate — one for numeric variables (NRMSE) and one for factors (PFC). This estimate has been shown to be a good proxy of the true error @stekhoven11.
From CRAN:
install.packages("missForest", dependencies = TRUE)
The default backend (ranger) is automatically used if
installed; otherwise the package will fall back to
randomForest if requested.
missForestWe’ll start with a small walk-through on iris, sprinkle
in performance hints, and then get fancy with parallelization for big
jobs.
data(iris) (@anderson35).data(esoph) (@breslow80).eval = FALSE. See
@UCI10 for details.missForest in a nutshellLoad the package:
library(missForest)
Create 10% missing values completely at random (not a lifestyle choice we endorse, but very educational):
set.seed(81)
data(iris)
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.00 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.60 1st Qu.:0.300
## Median :5.750 Median :3.000 Median :4.40 Median :1.300
## Mean :5.826 Mean :3.055 Mean :3.78 Mean :1.147
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.10 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.90 Max. :2.500
## NA's :10 NA's :18 NA's :17 NA's :17
## Species
## setosa :46
## versicolor:47
## virginica :44
## NA's :13
##
##
##
Impute:
set.seed(81)
iris.imp <- missForest(iris.mis) # default backend = "ranger"
The result is a list with:
iris.imp$ximp – the imputed data matrix,iris.imp$OOBerror – estimated imputation error(s).A common gotcha (we’ve all done it): use
iris.imp$ximp (not iris.imp)
in subsequent analyses.
iris.imp$OOBerror
## NRMSE PFC
## 0.14266179 0.05109489
Because iris has both numeric and categorical variables,
you see two numbers: NRMSE (numeric) and
PFC (factors). Both are better when closer to
0.
If you prefer per-variable diagnostics (for that post-imputation
feature selection debate), use variablewise = TRUE:
imp_var <- missForest(iris.mis, variablewise = TRUE)
imp_var$OOBerror
## MSE MSE MSE MSE PFC
## 0.11881330 0.08162039 0.08368251 0.03025632 0.05839416
verbose = TRUEWant to watch it think? Switch on diagnostics:
set.seed(81)
imp_verbose <- missForest(iris.mis, verbose = TRUE)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1521036 0.05839416
## difference(s): 0.006262057 0.06
## time: 0.023 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1404773 0.05109489
## difference(s): 2.362534e-05 0
## time: 0.022 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1426618 0.05109489
## difference(s): 5.312229e-06 0
## time: 0.022 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.1428649 0.05109489
## difference(s): 9.117668e-06 0
## time: 0.021 seconds
imp_verbose$OOBerror
## NRMSE PFC
## 0.14266179 0.05109489
You’ll see estimated error(s), difference(s) between iterations, and time per iteration. When differences increase (by type), the algorithm returns the previous iteration’s imputation.
maxiterSometimes the stopping rule is slow to trigger (data are complicated;
it happens). You can guard time with maxiter, or
deliberately pick an earlier iteration.
set.seed(96)
data(esoph)
esoph.mis <- prodNA(esoph, noNA = 0.05)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.5278622 0.7538558
## difference(s): 0.003172176 0.03787879
## time: 0.023 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.5455244 0.7138864
## difference(s): 0.0005775373 0.003787879
## time: 0.022 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.5573711 0.733632
## difference(s): 0.0002471342 0
## time: 0.022 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.5175258 0.7378849
## difference(s): 6.804298e-05 0
## time: 0.022 seconds
##
## missForest iteration 5 in progress...done!
## estimated error(s): 0.5278932 0.7258845
## difference(s): 0.0001609878 0
## time: 0.022 seconds
esoph.imp$OOBerror
## NRMSE PFC
## 0.5175258 0.7378849
ntree and
mtryntree scales linearly with time.
Defaults to 100; values in the tens often work well.mtry = floor(sqrt(p)) is a robust default, but tuning
can pay off on complex data.Demonstration on a bigger matrix (timings only):
# musk <- ... # (not fetched during CRAN build)
# musk.mis <- prodNA(musk, 0.05)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 100)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)
As you might guess, fewer trees → fewer minutes, at a modest cost in error.
replace = FALSEIf you set replace = FALSE, the sampler uses about
0.632 * n observations (otherwise OOB would vanish).
Sometimes it helps, sometimes not:
set.seed(81)
imp_sub <- missForest(iris.mis, replace = FALSE, verbose = TRUE)
## missForest iteration 1 in progress...done!
## estimated error(s): 0.1546552 0.04379562
## difference(s): 0.006307625 0.06
## time: 0.019 seconds
##
## missForest iteration 2 in progress...done!
## estimated error(s): 0.1423046 0.05109489
## difference(s): 1.459414e-05 0
## time: 0.018 seconds
##
## missForest iteration 3 in progress...done!
## estimated error(s): 0.1444149 0.05109489
## difference(s): 6.297579e-06 0
## time: 0.018 seconds
##
## missForest iteration 4 in progress...done!
## estimated error(s): 0.1426693 0.04379562
## difference(s): 2.360911e-05 0
## time: 0.018 seconds
imp_sub$OOBerror
## NRMSE PFC
## 0.14441492 0.05109489
classwt,
cutoff, strata, sampsizeThese let you focus the classifier for factor variables (and sampling
for both types). Each is a list with one entry per
variable (use NULL/1 where not
applicable). A quick note on backends: For this cutoff example
we explicitly use the legacy randomForest backend. The
default ranger backend handles cutoffs by fitting a
probability forest and then post-thresholding, but its
predict() method requires passing the training data for
non-quantile prediction, and a faithful OOB probability–based estimate
is more involved to reproduce in a vignette. The
randomForest backend natively supports per-class cutoffs
and gives a clean, portable example the reader can run without extra
plumbing.
# Per-variable samples: numeric use single integers; factors need a vector per class
iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10))
imp_ss <- missForest(iris.mis, sampsize = iris.sampsize)
# Per-class cutoffs (factor only). With ranger backend, cutoffs are emulated via probability forests.
iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1))
imp_co <- missForest(iris.mis, cutoff = iris.cutoff, backend = "randomForest")
# Class weights (factor only)
iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20))
imp_cw <- missForest(iris.mis, classwt = iris.classwt)
nodesize and
maxnodesnodesize is a length-2 vector: first
for numeric, second for factors. Our package defaults:
c(5, 1) (yes: numeric=5, factor=1).backend = "ranger", nodesize maps to
min.bucket; maxnodes is ignored (consider
ranger’s max.depth if needed).backend = "randomForest", both behave as in
randomForest.imp_nodes <- missForest(iris.mis, nodesize = c(5, 1))
xtrue and
mixErrorIf you have a ground truth (or simulate one), supply
xtrue to log the true error per iteration.
The return value then includes $error.
set.seed(81)
imp_bench <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
## missForest iteration 1 in progress...done!
## error(s): 0.1500059 0.07692308
## estimated error(s): 0.1521036 0.05839416
## difference(s): 0.006262057 0.06
## time: 0.023 seconds
##
## missForest iteration 2 in progress...done!
## error(s): 0.1435611 0.07692308
## estimated error(s): 0.1404773 0.05109489
## difference(s): 2.362534e-05 0
## time: 0.025 seconds
##
## missForest iteration 3 in progress...done!
## error(s): 0.1449059 0.07692308
## estimated error(s): 0.1426618 0.05109489
## difference(s): 5.312229e-06 0
## time: 0.021 seconds
##
## missForest iteration 4 in progress...done!
## error(s): 0.1422927 0.07692308
## estimated error(s): 0.1428649 0.05109489
## difference(s): 9.117668e-06 0
## time: 0.021 seconds
imp_bench$error
## NRMSE PFC
## 0.14490585 0.07692308
# Or compute it later:
err_manual <- mixError(imp_bench$ximp, iris.mis, iris)
err_manual
## NRMSE PFC
## 0.14490585 0.07692308
parallelize and
num.threadsWe offer two modes:
parallelize = "variables" Different variables are
imputed in parallel using a registered foreach backend. To
avoid nested oversubscription, per-variable ranger
calls use num.threads = 1 internally.
parallelize = "forests" A single variable’s forest
is built with ranger multithreading (set
num.threads) or, with randomForest, by
combining sub-forests via foreach.
Register a backend first (example with doParallel):
library(doParallel)
registerDoParallel(2)
# Variables mode
imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)
# Forests mode (ranger threading)
imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE, num.threads = 2)
Which one is faster? It depends on your data and machine. Try both when in doubt (and coffee is brewing).
Imputation with missForest is straightforward, and OOB
errors help you judge quality at a glance. Do remember: imputation does
not add information; it helps retain partially observed
rows for downstream analyses that prefer complete cases. For broader
perspectives, see @schafer97 and @little87.
We thank Steve Weston for contributions regarding parallel computation ideas and tools in the R ecosystem.