yaml

title: “Using the missForest Package” author: “Daniel J. Stekhoven” date: “2025-10-22” output: pdf_document: number_sections: true toc: true fontsize: 11pt geometry: margin=2.5cm, top=3cm, bottom=2.5cm lang: en bibliography: myBib.bib link-citations: true vignette: > % % % —

Introduction

What is this document? (And what it isn’t!)

This package vignette is a practical, application-focused user guide for the R package missForest. We’ll walk through the workflow on real datasets, discuss argument choices with a keen eye on feasibility and accuracy, and keep an occasional smile. Don’t be alarmed by the length — most of it is friendly R output for illustration.

This document is not a theoretical primer on the foundations of the algorithm, nor is it a comparative study. For the theory and evaluations, see @stekhoven11.

The missForest algorithm (with ranger by default)

missForest is a nonparametric imputation method for basically any kind of tabular data. It handles mixed types (numeric + categorical), nonlinear relations, interactions, and even high dimensionality ((p n)). For each variable with missingness, it fits a random forest on the observed part and predicts the missing part, iterating until a stopping rule is met (or maxiter says “enough”).

  • By default, missForest() now uses the ranger backend for speed and multithreading.
  • For legacy/compatibility, you can select the classic randomForest backend via backend = "randomForest".

The out-of-bag (OOB) error from the backend is transformed into an imputation error estimate — one for numeric variables (NRMSE) and one for factors (PFC). This estimate has been shown to be a good proxy of the true error @stekhoven11.

Installation

From CRAN:

install.packages("missForest", dependencies = TRUE)

The default backend (ranger) is automatically used if installed; otherwise the package will fall back to randomForest if requested.


Missing value imputation with missForest

We’ll start with a small walk-through on iris, sprinkle in performance hints, and then get fancy with parallelization for big jobs.

Data for illustrations

  • Iris: Five variables, one categorical with three levels; base R: data(iris) (@anderson35).
  • Esoph: Oesophageal cancer case-control study; base R: data(esoph) (@breslow80).
  • Musk: A larger, high-dimensional example used only for timing demonstrations. Because CRAN machines shouldn’t fetch the internet, the chunks here are eval = FALSE. See @UCI10 for details.

missForest in a nutshell

Load the package:

library(missForest)

Create 10% missing values completely at random (not a lifestyle choice we endorse, but very educational):

set.seed(81)
data(iris)
iris.mis <- prodNA(iris, noNA = 0.1)
summary(iris.mis)
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.00   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.60   1st Qu.:0.300  
##  Median :5.750   Median :3.000   Median :4.40   Median :1.300  
##  Mean   :5.826   Mean   :3.055   Mean   :3.78   Mean   :1.147  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.10   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.90   Max.   :2.500  
##  NA's   :10      NA's   :18      NA's   :17     NA's   :17     
##        Species  
##  setosa    :46  
##  versicolor:47  
##  virginica :44  
##  NA's      :13  
##                 
##                 
## 

Impute:

set.seed(81)
iris.imp <- missForest(iris.mis)  # default backend = "ranger"

The result is a list with:

  • iris.imp$ximp – the imputed data matrix,
  • iris.imp$OOBerror – estimated imputation error(s).

A common gotcha (we’ve all done it): use iris.imp$ximp (not iris.imp) in subsequent analyses.

iris.imp$OOBerror
##      NRMSE        PFC 
## 0.14266179 0.05109489

Because iris has both numeric and categorical variables, you see two numbers: NRMSE (numeric) and PFC (factors). Both are better when closer to 0.

If you prefer per-variable diagnostics (for that post-imputation feature selection debate), use variablewise = TRUE:

imp_var <- missForest(iris.mis, variablewise = TRUE)
imp_var$OOBerror
##        MSE        MSE        MSE        MSE        PFC 
## 0.11881330 0.08162039 0.08368251 0.03025632 0.05839416

Additional iteration output with verbose = TRUE

Want to watch it think? Switch on diagnostics:

set.seed(81)
imp_verbose <- missForest(iris.mis, verbose = TRUE)
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.1521036 0.05839416 
##     difference(s): 0.006262057 0.06 
##     time: 0.023 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.1404773 0.05109489 
##     difference(s): 2.362534e-05 0 
##     time: 0.022 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.1426618 0.05109489 
##     difference(s): 5.312229e-06 0 
##     time: 0.022 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.1428649 0.05109489 
##     difference(s): 9.117668e-06 0 
##     time: 0.021 seconds
imp_verbose$OOBerror
##      NRMSE        PFC 
## 0.14266179 0.05109489

You’ll see estimated error(s), difference(s) between iterations, and time per iteration. When differences increase (by type), the algorithm returns the previous iteration’s imputation.

Changing the number of iterations: maxiter

Sometimes the stopping rule is slow to trigger (data are complicated; it happens). You can guard time with maxiter, or deliberately pick an earlier iteration.

set.seed(96)
data(esoph)
esoph.mis <- prodNA(esoph, noNA = 0.05)
esoph.imp <- missForest(esoph.mis, verbose = TRUE, maxiter = 6)
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.5278622 0.7538558 
##     difference(s): 0.003172176 0.03787879 
##     time: 0.023 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.5455244 0.7138864 
##     difference(s): 0.0005775373 0.003787879 
##     time: 0.022 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.5573711 0.733632 
##     difference(s): 0.0002471342 0 
##     time: 0.022 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.5175258 0.7378849 
##     difference(s): 6.804298e-05 0 
##     time: 0.022 seconds
## 
##   missForest iteration 5 in progress...done!
##     estimated error(s): 0.5278932 0.7258845 
##     difference(s): 0.0001609878 0 
##     time: 0.022 seconds
esoph.imp$OOBerror
##     NRMSE       PFC 
## 0.5175258 0.7378849

Speed/accuracy trade-off: ntree and mtry

  • ntree scales linearly with time. Defaults to 100; values in the tens often work well.
  • mtry = floor(sqrt(p)) is a robust default, but tuning can pay off on complex data.

Demonstration on a bigger matrix (timings only):

# musk <- ...  # (not fetched during CRAN build)
# musk.mis <- prodNA(musk, 0.05)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 100)
# missForest(musk.mis, verbose = TRUE, maxiter = 3, ntree = 20)

As you might guess, fewer trees → fewer minutes, at a modest cost in error.

Subsampling instead of bootstrapping: replace = FALSE

If you set replace = FALSE, the sampler uses about 0.632 * n observations (otherwise OOB would vanish). Sometimes it helps, sometimes not:

set.seed(81)
imp_sub <- missForest(iris.mis, replace = FALSE, verbose = TRUE)
##   missForest iteration 1 in progress...done!
##     estimated error(s): 0.1546552 0.04379562 
##     difference(s): 0.006307625 0.06 
##     time: 0.019 seconds
## 
##   missForest iteration 2 in progress...done!
##     estimated error(s): 0.1423046 0.05109489 
##     difference(s): 1.459414e-05 0 
##     time: 0.018 seconds
## 
##   missForest iteration 3 in progress...done!
##     estimated error(s): 0.1444149 0.05109489 
##     difference(s): 6.297579e-06 0 
##     time: 0.018 seconds
## 
##   missForest iteration 4 in progress...done!
##     estimated error(s): 0.1426693 0.04379562 
##     difference(s): 2.360911e-05 0 
##     time: 0.018 seconds
imp_sub$OOBerror
##      NRMSE        PFC 
## 0.14441492 0.05109489

Imbalanced data & sampling controls: classwt, cutoff, strata, sampsize

These let you focus the classifier for factor variables (and sampling for both types). Each is a list with one entry per variable (use NULL/1 where not applicable). A quick note on backends: For this cutoff example we explicitly use the legacy randomForest backend. The default ranger backend handles cutoffs by fitting a probability forest and then post-thresholding, but its predict() method requires passing the training data for non-quantile prediction, and a faithful OOB probability–based estimate is more involved to reproduce in a vignette. The randomForest backend natively supports per-class cutoffs and gives a clean, portable example the reader can run without extra plumbing.

# Per-variable samples: numeric use single integers; factors need a vector per class
iris.sampsize <- list(12, 12, 12, 12, c(10, 15, 10))
imp_ss <- missForest(iris.mis, sampsize = iris.sampsize)

# Per-class cutoffs (factor only). With ranger backend, cutoffs are emulated via probability forests.
iris.cutoff <- list(1, 1, 1, 1, c(0.3, 0.6, 0.1))
imp_co <- missForest(iris.mis, cutoff = iris.cutoff, backend = "randomForest")

# Class weights (factor only)
iris.classwt <- list(NULL, NULL, NULL, NULL, c(10, 30, 20))
imp_cw <- missForest(iris.mis, classwt = iris.classwt)

Tree shape controls: nodesize and maxnodes

  • nodesize is a length-2 vector: first for numeric, second for factors. Our package defaults: c(5, 1) (yes: numeric=5, factor=1).
  • With backend = "ranger", nodesize maps to min.bucket; maxnodes is ignored (consider ranger’s max.depth if needed).
  • With backend = "randomForest", both behave as in randomForest.
imp_nodes <- missForest(iris.mis, nodesize = c(5, 1))

Benchmarking with a complete matrix: xtrue and mixError

If you have a ground truth (or simulate one), supply xtrue to log the true error per iteration. The return value then includes $error.

set.seed(81)
imp_bench <- missForest(iris.mis, xtrue = iris, verbose = TRUE)
##   missForest iteration 1 in progress...done!
##     error(s): 0.1500059 0.07692308 
##     estimated error(s): 0.1521036 0.05839416 
##     difference(s): 0.006262057 0.06 
##     time: 0.023 seconds
## 
##   missForest iteration 2 in progress...done!
##     error(s): 0.1435611 0.07692308 
##     estimated error(s): 0.1404773 0.05109489 
##     difference(s): 2.362534e-05 0 
##     time: 0.025 seconds
## 
##   missForest iteration 3 in progress...done!
##     error(s): 0.1449059 0.07692308 
##     estimated error(s): 0.1426618 0.05109489 
##     difference(s): 5.312229e-06 0 
##     time: 0.021 seconds
## 
##   missForest iteration 4 in progress...done!
##     error(s): 0.1422927 0.07692308 
##     estimated error(s): 0.1428649 0.05109489 
##     difference(s): 9.117668e-06 0 
##     time: 0.021 seconds
imp_bench$error
##      NRMSE        PFC 
## 0.14490585 0.07692308
# Or compute it later:
err_manual <- mixError(imp_bench$ximp, iris.mis, iris)
err_manual
##      NRMSE        PFC 
## 0.14490585 0.07692308

Parallelization: parallelize and num.threads

We offer two modes:

  1. parallelize = "variables" Different variables are imputed in parallel using a registered foreach backend. To avoid nested oversubscription, per-variable ranger calls use num.threads = 1 internally.

  2. parallelize = "forests" A single variable’s forest is built with ranger multithreading (set num.threads) or, with randomForest, by combining sub-forests via foreach.

Register a backend first (example with doParallel):

library(doParallel)
registerDoParallel(2)
# Variables mode
imp_vars <- missForest(iris.mis, parallelize = "variables", verbose = TRUE)

# Forests mode (ranger threading)
imp_fors <- missForest(iris.mis, parallelize = "forests", verbose = TRUE, num.threads = 2)

Which one is faster? It depends on your data and machine. Try both when in doubt (and coffee is brewing).


Concluding remarks

Imputation with missForest is straightforward, and OOB errors help you judge quality at a glance. Do remember: imputation does not add information; it helps retain partially observed rows for downstream analyses that prefer complete cases. For broader perspectives, see @schafer97 and @little87.

We thank Steve Weston for contributions regarding parallel computation ideas and tools in the R ecosystem.

References