Species Distribution Modeling with NCEAS Data

Introduction

This vignette provides a practical introduction to species distribution modeling using the MIAmaxent package. Primarily, it demonstrates the main functionality of the package. Secondarily, it examines some central issues around model selection and evaluation. It is designed to be suitable as learning exercise, also for those with limited experience in R.

We will model the distribution of a tree species in Switzerland using occurrence data and environmental predictors from the disdat package. These data are part of a benchmark dataset for species distribution modeling, which was compiled by the National Center for Ecological Analysis and Synthesis (NCEAS) and first used in Elith et al. (2006). The full dataset is described in detail in Elith et al. (2020).

Setup Swiss tree data

First, load the required packages:

library(dplyr)
library(purrr)
library(tidyr)
library(tibble)
library(ggplot2)
library(terra)
library(sf)
library(disdat)
library(MIAmaxent)

The disdat package contains distribution data for 30 tree species in Switzerland, along with 13 environmental variables. We’ll work with one species, but you can easily switch to another.

# Choose a species (change this to model a different species!)
selected_species <- "swi01" # `disdat` has anonymized species codes: swi01 to swi30

# Define predictor variables
predictors <- c(
  "bcc", "calc", "ccc", "ddeg", "nutri", "pday",
  "precyy", "sfroyy", "slope", "sradyy", "swb", "tavecc", "topo"
)

# Extract presence-only (PO) data for the selected species
swi_po <- disPo("SWI") # function from `disdat` package
po_records <- swi_po[swi_po$spid == selected_species, ]
cat("Number of presence records:", nrow(po_records), "\n")
#> Number of presence records: 482

# Background data (10,000 random locations)
background <- disBg("SWI") # function from `disdat` package

# Combine presence and background into training data frame
# RV = 1 for presences, NA for background
train_data <- rbind(
  data.frame(RV = 1, po_records[, predictors]),
  data.frame(RV = NA, background[, predictors])
)

cat("Training presences:", sum(train_data$RV == 1, na.rm = TRUE), "\n")
#> Training presences: 482

Let’s briefly plot the coordinates of the presence records to get a sense of the map.

studyarea <- disBorder("SWI")
po_sf <- st_as_sf(po_records, coords = c("x", "y"), crs = 21781)
ggplot() +
  geom_sf(data = studyarea, fill = "lightgrey") +
  geom_sf(data = po_sf, color = "blue", size = 0.5) +
  labs(
    title = paste("Presence records for species", selected_species),
    x = "Easting (m)", y = "Northing (m)"
  ) +
  theme_minimal()

The data include 13 environmental predictors:

Variable	Description	Type	Units
bcc	Broadleaved continuous cover	Continuous	% cover
calc	Bedrock strictly calcareous	Categorical	0/1
ccc	Coniferous continuous cover	Continuous	% cover
ddeg	Growing degree-days above 0°C	Continuous	°C·days
nutri	Soil nutrients index	Continuous	D mval/cm²
pday	Days with rainfall > 1 mm	Continuous	days
precyy	Average yearly precipitation sum	Continuous	mm
sfroyy	Summer frost frequency	Continuous	days
slope	Slope	Continuous	degrees ×10
sradyy	Potential yearly global radiation	Continuous	kJ·m⁻²·day⁻¹
swb	Site water balance	Continuous	mm
tavecc	Average temperature of coldest month	Continuous	°C
topo	Topographic position	Continuous	dimensionless

# Rename columns to more informative names
col_name_map <- c(
  RV = "RV",
  bcc = "broadleaf.cover",
  calc = "bedrock.calcareous",
  ccc = "conifer.cover",
  ddeg = "growing.degree.days",
  nutri = "soil.nutrients",
  pday = "precip.days",
  precyy = "annual.precip",
  sfroyy = "summer.frost.freq",
  slope = "slope",
  sradyy = "solar.radiation",
  swb = "water.balance",
  tavecc = "temp.coldest.month",
  topo = "topographic.position"
)
names(train_data) <- col_name_map[names(train_data)]

# Convert the categorical variable to factor
train_data$bedrock.calcareous <- as.factor(train_data$bedrock.calcareous)

Examine occurrence–environment relationships

Before building a model, we should explore how environmental variables relate to species occurrence. The plotFOP() function shows Frequency of Observed Presence across the range of each variable. For example:

plotFOP(train_data, "temp.coldest.month")

The points show how frequently presences are observed at different values of the variable, and the red line is a smoothed trend across these. The grey distribution shows the density of values in the training data.

Let’s examine all continuous explanatory variables. We’ll extract the FOP data and create a faceted plot to compare them side-by-side. The code is not shown but it gets FOP data for each variable and plots these using ggplot2 rather than base r (the default).

By using a shared Y-axis, we can more easily compare the strength of each variable’s relationship with species occurrence. Variables with larger FOP ranges (along the Y-axis) typically have greater explanatory power.

Exercise: Which environmental variables appear most strongly related to the species’ occurrence? How would you describe the species’ relationship to this variable in ecological terms?

Transform and select environmental variables

Transform to derived variables

To model different types of occurrence–environment relationships, we transform the original, explanatory variables (EVs) into “derived variables” (DVs). This allows us to fit various linear, unimodal, and step-like responses.

# Create derived variables using multiple transformation types
# L = Linear, M = Monotonic, D = Deviation (unimodal),
# HF = Forward hinge, HR = Reverse hinge, T = Threshold
train_DVs <- deriveVars(train_data,
  transformtype = c("L", "M", "D", "HF", "HR", "T"),
  allsplines = TRUE,
  quiet = TRUE
)

# Example: some of the DVs created for temperature
head(names(train_DVs$dvdata$temp.coldest.month))
#> [1] "temp.coldest.month_L"   "temp.coldest.month_M"   "temp.coldest.month_D05"
#> [4] "temp.coldest.month_D1"  "temp.coldest.month_D2"  "temp.coldest.month_HF1"

DV names indicate the transformation type. For example, temp.coldest.month_D05 is a deviation transformation (square-root deviation from optimum), while temp.coldest.month_HF1 is a reverse hinge transformation (with knot at first position).

With derived variables created, we can proceed to variable selection, where we choose which DVs to include in the final model. Variable selection in MIAmaxent follows a forward stepwise procedure, where models with additional variables are compared to simpler models. The selection process is broken down into two stages.

Selection stage 1: Select DVs for each EV

# Select the best DVs for each individual EV
train_DV_select <- selectDVforEV(train_DVs$dvdata,
  alpha = 0.001,
  quiet = TRUE
)

The alpha = 0.001 parameter sets a threshold: only DVs explaining significant variation at this level are retained.

# Example: DVs selected for temperature
names(train_DV_select$dvdata$temp.coldest.month)
#> [1] "temp.coldest.month_D2"

Selection stage 2: Select EVs for the final model

# Select which EVs (represented by their best DVs) to include
train_model <- selectEV(train_DV_select$dvdata,
  alpha = 0.001,
  interaction = FALSE,
  quiet = TRUE
)

# View the selection trail, only the top model from each round
selection_trail <- train_model$selection[!duplicated(train_model$selection$round), ]
knitr::kable(selection_trail, digits = c(0, 0, 0, 3, 1, 0, 3)) # Print the data.frame as a table

	round	variables	m	Dsq	Chisq	df
1	1	broadleaf.cover	3	0.097	815.1	3
13	2	broadleaf.cover + growing.degree.days	5	0.127	246.0	2
24	3	broadleaf.cover + growing.degree.days + slope	6	0.138	97.2	1
33	4	broadleaf.cover + growing.degree.days + slope + precip.days	8	0.146	62.3	2

In the selection trail, variables indicates which explanatory variables are represented in the model while m indicates number of derived variables. Dsq is the fraction of deviance explained [0,1].

# View the final model
train_model$selectedmodel$formula
#> RV ~ broadleaf.cover_HR6 + broadleaf.cover_M + broadleaf.cover_D1 + 
#>     growing.degree.days_HR12 + growing.degree.days_D05 + slope_HR8 + 
#>     precip.days_HF16 + precip.days_T7
#> <environment: 0x000001f8982b6158>

Exercise: Compare the selection trail to the formula of the selected model. Do you understand how they correspond?

Explore the model with response curves

Response curves show how the model predicts species occurrence across the range of each variable (holding others constant).

# Get the most important variable (first selected)
top_EV <- train_model$selection$variables[1]

# Plot response curve for the most important variable
plotResp(
  train_model$selectedmodel,
  train_DVs$transformations,
  top_EV
)

Notice that the scale of the Y-axis in response curves is unbounded [0,inf] — with presence-only data, the model can only estimate relative suitability, not absolute probability of occurrence. Now compare the above response curve with the FOP plot to see how the model captures the observed pattern:

plotFOP(train_data, top_EV)

Exercise: Create response curves for the other variables in your model. Compare the shapes of the response curves to the formula of the model. Do you see how the transformations correspond to the shapes?

Evaluate the model with AUC

To evaluate how well the model performs, we will use an independent validation set (presence-absence data).

# Extract presence-absence (PA) data
swi_pa <- disPa("SWI") # function from `disdat` package

# Merge with environmental data by location
swi_env <- disEnv("SWI") # function from `disdat` package
pa_merged <- merge(swi_pa, swi_env)

# Create validation data frame with RV and environmental variables (parallel to train_data)
validation_data <- data.frame(
  RV = pa_merged[, selected_species],
  pa_merged[, predictors]
)

# Rename columns to match training data
names(validation_data) <- col_name_map[names(validation_data)]

# Convert the categorical variable to factor
validation_data$bedrock.calcareous <- as.factor(validation_data$bedrock.calcareous)

cat("Validation presences:", sum(validation_data$RV == 1), "\n")
#> Validation presences: 107
cat("Validation absences:", sum(validation_data$RV == 0), "\n")
#> Validation absences: 9906

pa_sf <- st_as_sf(swi_pa, coords = c("x", "y"), crs = 21781)
# Convert to factor for discrete colors (presence/absence)
pa_sf[[selected_species]] <- as.factor(pa_sf[[selected_species]])
ggplot() +
  geom_sf(data = studyarea, fill = "lightgrey") +
  geom_sf(data = pa_sf, aes(color = .data[[selected_species]]), size = 0.5) +
  scale_color_manual(
    values = c("0" = "red", "1" = "blue"),
    labels = c("0" = "Absence", "1" = "Presence"),
    name = "Record type"
  ) +
  labs(
    title = paste("Presence-absence records for species", selected_species),
    x = "Easting (m)", y = "Northing (m)"
  ) +
  theme_minimal()

Using the independent validation set (presence-absence data), we will calculate a commonly used discrimination metric: AUC (Area Under the Curve of the Receiver Operating Characteristic). AUC measures how well the model is able to separate presences and absences. Values typically range from 0.5 (no better than random) to 1.0 (perfect discrimination).

# Calculate validation AUC
auc_result <- testAUC(
  model = train_model$selectedmodel,
  transformations = train_DVs$transformations,
  data = validation_data
)

The ROC plot shows false positive rate vs. true positive rate at different thresholds — low thresholds in the upper right corner, and high thresholds in the lower left corner.

Exercise: How does your model perform? If it is important that the model identifies most presences (high sensitivity), would you choose a high or low threshold? What are the trade-offs?

Summary and extended exercises

What we’ve learned:

How to explore species-environment relationships with FOP plots
How to transform variables to fit different response shapes
How to select variables using nested model comparison
How to evaluate model performance with an independent validation set (presence-absence data)

Exercise: Use the testAUC() function again, but this time supply the presence-only data as the evaluation data. How does the AUC change compared to when it is calculated on the independent validation set (presence-absence data)? Why might this be?

Exercise: Experiment with the alpha parameter in selectDVforEV() and selectEV(). Try alpha = 0.01 or alpha = 0.0001. How does this affect the number of variables selected and the model complexity?

Advanced: The bias-variance tradeoff in model selection

This section explores a fundamental challenge in statistical modeling: balancing model complexity to optimize predictive performance (Hastie, Tibshirani, and Friedman 2009). Simple models (high bias) may fail to capture important patterns, while complex models (high variance) may fit noise in the training data and perform poorly on new data.

In MIAmaxent, this tradeoff is controlled primarily be the alpha parameter. Higher alpha values permit more complex models with more variables, while lower values enforce simplicity. We’ll systematically compare three alpha levels to understand their effects on model performance.

Terminology note. In this section we use standard machine learning terminology for data division:

Training set is data used to fit the model
Test set is data held out from the same source for internal evaluation
Validation set is independent data for external evaluation

Set up the alpha comparison

First, we’ll create a data structure to organize our comparison:

# Create a tibble to store results for three alpha levels
model_comparison <- tibble(
  alpha = c(1e-1, 1e-3, 1e-9)
)

We will split our presence-background data into training (70%) and test (30%) sets using random partitioning, so that we can evaluate internal performance. The model will not see the 30% test set during training, but this test set is nonetheless part of the same dataset and likely contains similar biases. The independent validation set (presence-absence data), on the other hand, provides a more stringent test of generalization.

set.seed(123) # For reproducibility

# Create indices for 70/30 split
n_total <- nrow(train_data)
train_idx <- sample(1:n_total, size = floor(0.7 * n_total))

# Split the data into training and test sets
train_70 <- train_data[train_idx, ]
test_30 <- train_data[-train_idx, ]

# Derive variables for the training set
train_DVs_70 <- deriveVars(train_70,
  transformtype = c("L", "M", "D", "HF", "HR", "T"),
  allsplines = TRUE,
  quiet = TRUE
)

Fit and evaluate the models

Now we’ll fit models by repeating variable selection at different levels of alpha:

# Add model fitting results using map
model_comparison <- model_comparison %>%
  mutate(
    # Step 1: Select DVs for each EV
    dv_selection = map(alpha, \(a) selectDVforEV(train_DVs_70$dvdata,
      alpha = a,
      quiet = TRUE
    )),
    # Step 2: Select EVs for final model
    ev_selection = map2(
      dv_selection, alpha,
      \(dv, a) selectEV(dv$dvdata,
        alpha = a,
        interaction = FALSE,
        quiet = TRUE
      )
    ),
    # Extract number of DVs in final model
    n_dvs = map_int(ev_selection, \(ev) length(ev$selectedmodel$coefficients) - 1)
  )

# Display intermediate results
model_comparison %>%
  select(alpha, n_dvs)
#> # A tibble: 3 × 2
#>         alpha n_dvs
#>         <dbl> <int>
#> 1 0.1            21
#> 2 0.001           8
#> 3 0.000000001     2

Notice how model complexity decreases as alpha decreases. Lower alpha values (more stringent thresholds) allow fewer variables into the model.

Now we’ll calculate AUC on two different datasets:

The 30% test set from the presence-background data (internal evaluation)
The independent validation set (presence-absence data) for external evaluation

# Add AUC calculations
model_comparison <- model_comparison %>%
  mutate(
    # AUC on 30% test set (internal evaluation)
    auc_test = map_dbl(
      ev_selection,
      \(ev) testAUC(
        model = ev$selectedmodel,
        transformations = train_DVs_70$transformations,
        data = test_30
      )
    ),
    # AUC on independent validation set (external evaluation)
    auc_validation = map_dbl(
      ev_selection,
      \(ev) testAUC(
        model = ev$selectedmodel,
        transformations = train_DVs_70$transformations,
        data = validation_data
      )
    )
  )

# Display final results table
results_table <- model_comparison %>%
  select(alpha, n_dvs, auc_test, auc_validation)

knitr::kable(results_table, digits = c(10, 0, 3, 3))

alpha	n_dvs	auc_test	auc_validation
1e-01	21	0.917	0.793
1e-03	8	0.914	0.798
1e-09	2	0.900	0.750

Interpret results

The bias-variance tradeoff manifests in several observable patterns:

Model complexity: As alpha decreases, the number of derived variables in the model decreases. This reflects the fundamental tradeoff between simple models (which may underfit) and complex models (which may overfit).
Internal evaluation (test set AUC): The 30% test set is drawn from the same presence-background sample used for training. High AUC values here indicate that the model captures patterns in the training data, but this does not guarantee the model will generalize well to independent data.
External evaluation (validation set AUC): The independent validation set (presence-absence data) provides a more stringent test of model performance. The difference between internal and external AUC reveals how well the model generalizes beyond its training data. Small differences suggest the model captures robust ecological patterns; large differences may indicate overfitting to artifacts or spatial autocorrelation in the training data.

References

Elith, Jane, Catherine H Graham, Robert P Anderson, Miroslav Dudík, Ferrier Simon, Antoine Guisan, Robert J Hijmans, et al. 2006. “Novel Methods Improve Prediction of Species’ Distributions from Occurrence Data.” Ecography 29 (2): 129–51. https://doi.org/10.1111/j.2006.0906-7590.04596.x.

Elith, Jane, Catherine Graham, Roozbeh Valavi, Meinrad Abegg, Caroline Bruce, Simon Ferrier, Andrew Ford, et al. 2020. “Presence-Only and Presence-absence Data for Comparing Species Distribution Modeling Methods.” Biodiversity Informatics 15 (2, 2): 69–80. https://doi.org/10.17161/bi.v15i2.13384.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2009. The Elements of Statistical Learning. 2nd ed. Springer Series in Statistics. New York, NY: Springer. https://doi.org/10.1007/978-0-387-84858-7.

Roberts, David R., Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera-Arroita, Severin Hauenstein, et al. 2017. “Cross-Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure.” Ecography 40 (8): 913–29. https://doi.org/10.1111/ecog.02881.

Valavi, Roozbeh, Jane Elith, José J. Lahoz‐Monfort, and Gurutzeta Guillera‐Arroita. 2019. “blockCV: An r Package for Generating Spatially or Environmentally Separated Folds for k-Fold Cross-Validation of Species Distribution Models.” Methods in Ecology and Evolution 10 (2): 225–32. https://doi.org/10.1111/2041-210X.13107.