| Title: | Bounded Outcome Risk Guard for Model Evaluation |
| Version: | 0.3.1 |
| Description: | Comprehensive toolkit for valid spatial, temporal, and grouped model evaluation. Automatically detects data dependencies (spatial autocorrelation, temporal structure, clustered observations), generates appropriate cross-validation schemes (spatial blocking, checkerboard, hexagonal, KNNDM, environmental blocking, leave-location-out, purged CV), and validates evaluation pipelines for leakage. Includes area of applicability (AOA) assessment following Meyer & Pebesma (2021) <doi:10.1111/2041-210X.13650>, forward feature selection with blocked CV, spatial thinning, block-permutation variable importance, extrapolation detection, and interactive visualizations. Integrates with 'tidymodels', 'caret', 'mlr3', 'ENMeval', and 'biomod2'. Based on evaluation principles described in Roberts et al. (2017) <doi:10.1111/ecog.02881>, Kaufman et al. (2012) <doi:10.1145/2382577.2382579>, Kapoor & Narayanan (2023) <doi:10.1016/j.patter.2023.100804>, and Linnenbrink et al. (2024) <doi:10.5194/gmd-17-5897-2024>. |
| License: | MIT + file LICENSE |
| Language: | en-US |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| LinkingTo: | Rcpp |
| Imports: | Rcpp, methods, stats, utils |
| Suggests: | caret, ggplot2 (≥ 3.4.0), leaflet, rsample, tidyselect, recipes, mlr3, sf, terra, tidyterra, ranger, parsnip, workflows, xgboost, lightgbm, future.apply, testthat (≥ 3.0.0), knitr, rmarkdown |
| VignetteBuilder: | knitr |
| URL: | https://github.com/gcol33/BORG, https://gillescolling.com/BORG/ |
| BugReports: | https://github.com/gcol33/BORG/issues |
| Depends: | R (≥ 4.1.0) |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | yes |
| Packaged: | 2026-03-29 08:09:31 UTC; Gilles Colling |
| Author: | Gilles Colling |
| Maintainer: | Gilles Colling <gilles.colling051@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-03-29 15:50:40 UTC |
BORG: Bounded Outcome Risk Guard for Model Evaluation
Description
Comprehensive toolkit for valid spatial, temporal, and grouped model evaluation. Automatically detects data dependencies (spatial autocorrelation, temporal structure, clustered observations), generates appropriate cross-validation schemes (spatial blocking, checkerboard, hexagonal, KNNDM, environmental blocking, leave-location-out, purged CV), and validates evaluation pipelines for leakage. Includes area of applicability (AOA) assessment following Meyer & Pebesma (2021) doi:10.1111/2041-210X.13650, forward feature selection with blocked CV, spatial thinning, block-permutation variable importance, extrapolation detection, and interactive visualizations. Integrates with 'tidymodels', 'caret', 'mlr3', 'ENMeval', and 'biomod2'. Based on evaluation principles described in Roberts et al. (2017) doi:10.1111/ecog.02881, Kaufman et al. (2012) doi:10.1145/2382577.2382579, Kapoor & Narayanan (2023) doi:10.1016/j.patter.2023.100804, and Linnenbrink et al. (2024) doi:10.5194/gmd-17-5897-2024.
BORG automatically detects and enforces valid model evaluation by identifying information reuse between training and evaluation data. It guards against:
Data leakage through preprocessing (normalization, imputation, PCA)
Look-ahead bias in temporal evaluation
Spatial autocorrelation violations in block CV
Target leakage through features derived from outcomes
Train-test contamination through shared identifiers
Value
No return value. This is a package-level documentation page.
Main Functions
borgPrimary interface for guarding evaluation pipelines
borg_diagnoseDiagnose data dependency structure
borg_cvGenerate valid CV schemes based on diagnosis
borg_inspectInspect R objects for leakage signals
borg_validateValidate a complete evaluation workflow
borg_assimilateAssimilate leaky pipelines into compliance
Risk Classification
BORG classifies evaluation risks as:
- hard_violation
Evaluation is fundamentally invalid. Must be blocked. Examples: preprocessing on full data, train-test ID overlap, target leakage.
- soft_inflation
Results are biased but bounded. Performance estimates are misleading but model ranking may be preserved. Examples: insufficient spatial block size, post-hoc subgroup analysis.
Supported Frameworks
BORG integrates with:
-
caret:
trainControl,train,preProcess -
rsample:
vfold_cv,initial_split,rolling_origin -
recipes:
recipe,prep,bake -
mlr3:
Task,Learner,Resampling Base R: manual index-based splitting
Options
BORG respects the following options:
borg.auto_checkIf TRUE, automatically validate splits when using supported frameworks. Default: FALSE.
borg.strictIf TRUE, throw errors on hard violations. If FALSE, return warnings. Default: TRUE.
borg.verboseIf TRUE, print diagnostic messages. Default: FALSE.
Author(s)
Maintainer: Gilles Colling gilles.colling051@gmail.com (ORCID) [copyright holder]
See Also
Useful links:
Report bugs at https://github.com/gcol33/BORG/issues
BorgDiagnosis S4 Class
Description
Holds the result of borg_diagnose: a structured assessment
of data dependency patterns that affect cross-validation validity.
Usage
## S4 method for signature 'BorgDiagnosis'
show(object)
Arguments
object |
A |
Value
The BorgDiagnosis object, returned invisibly.
Called for the side effect of printing a diagnostic summary to the console.
Slots
dependency_typeCharacter. Primary dependency type detected: "none", "spatial", "temporal", "clustered", or "mixed".
severityCharacter. Overall severity: "none", "moderate", "severe".
recommended_cvCharacter. Recommended CV strategy: "random", "spatial_block", "temporal_block", "group_fold", "spatial_temporal".
spatialList. Spatial autocorrelation diagnostics with elements: detected (logical), morans_i (numeric), morans_p (numeric), range_estimate (numeric), effective_n (numeric), coords_used (character).
temporalList. Temporal autocorrelation diagnostics with elements: detected (logical), acf_lag1 (numeric), ljung_box_p (numeric), decorrelation_lag (integer), embargo_minimum (integer), time_col (character).
clusteredList. Clustered structure diagnostics with elements: detected (logical), icc (numeric), n_clusters (integer), cluster_sizes (numeric), design_effect (numeric), group_col (character).
inflation_estimateList. Estimated metric inflation from random CV with elements: auc_inflation (numeric, proportion), rmse_deflation (numeric), confidence (character: "low"/"medium"/"high"), basis (character).
n_obsInteger. Number of observations in the dataset.
timestampPOSIXct. When the diagnosis was performed.
callLanguage object. The original call that triggered diagnosis.
See Also
BorgRisk S4 Class
Description
Holds the result of borg_inspect or borg_validate:
a structured assessment of evaluation risks detected in a workflow or object.
This class stores identified risks, their classification (hard violation vs soft inflation), affected data indices, and recommended remediation actions.
Usage
## S4 method for signature 'BorgRisk'
show(object)
Arguments
object |
A |
Value
The BorgRisk object, returned invisibly.
Called for the side effect of printing a risk assessment summary to the
console.
Slots
risksA list of detected risk objects, each containing:
- type
Character string: risk category (e.g., "preprocessing_leak")
- severity
Character string: "hard_violation" or "soft_inflation"
- description
Character string: human-readable description
- affected_indices
Integer vector: row/column indices affected
- source_object
Character string: name of the leaky object
n_hardInteger. Count of hard violations detected.
n_softInteger. Count of soft inflation risks detected.
is_validLogical. TRUE if no hard violations detected.
train_indicesInteger vector. Row indices in training set.
test_indicesInteger vector. Row indices in test set.
timestampPOSIXct. When the inspection was performed.
callLanguage object. The original call that triggered inspection.
See Also
borg_inspect, borg_validate, borg
Examples
# Create an empty BorgRisk object (no risks detected)
show(new("BorgRisk",
risks = list(),
n_hard = 0L,
n_soft = 0L,
is_valid = TRUE,
train_indices = 1:80,
test_indices = 81:100,
timestamp = Sys.time(),
call = quote(borg_inspect(x))
))
Coerce BorgDiagnosis to Data Frame
Description
Converts a BorgDiagnosis object into a one-row data frame
of diagnostic results for programmatic access.
Usage
## S3 method for class 'BorgDiagnosis'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
Arguments
x |
A |
row.names |
Optional row names for the output data frame. |
optional |
Logical. Passed to |
... |
Additional arguments passed to |
Value
A one-row data frame with columns: dependency_type,
severity, recommended_cv, n_obs,
spatial_detected, morans_i, temporal_detected,
acf_lag1, clustered_detected, icc.
See Also
Coerce BorgRisk to Data Frame
Description
Converts a BorgRisk object into a data frame of detected risks.
Usage
## S3 method for class 'BorgRisk'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
Arguments
x |
A |
row.names |
Optional row names for the output data frame. |
optional |
Logical. Passed to |
... |
Additional arguments passed to |
Value
A data frame where each row corresponds to a detected risk. Columns are:
type, severity, description, source_object,
n_affected.
See Also
Audit Feature Importance Calculations
Description
Detects when feature importance (SHAP, permutation importance, etc.) is computed using test data, which can lead to biased feature selection and data leakage.
Usage
audit_importance(
importance,
data,
train_idx,
test_idx,
method = "auto",
model = NULL
)
Arguments
importance |
A vector, matrix, or data frame of importance values. |
data |
The data used to compute importance. |
train_idx |
Integer vector of training indices. |
test_idx |
Integer vector of test indices. |
method |
Character indicating the importance method. One of "shap", "permutation", "gain", "impurity", or "auto" (default). |
model |
Optional fitted model object for additional validation. |
Details
Feature importance computed on test data is a form of data leakage because:
SHAP values computed on test data reveal test set structure
Permutation importance on test data uses test labels
Feature selection based on test importance leads to overfit models
This function checks if the data used for importance calculation includes test indices and flags potential violations.
Value
A BorgRisk object with audit results.
Examples
set.seed(42)
data <- data.frame(y = rnorm(100), x1 = rnorm(100), x2 = rnorm(100))
train_idx <- 1:70
test_idx <- 71:100
# Simulate importance values
importance <- c(x1 = 0.6, x2 = 0.4)
# Good: importance computed on training data
result <- audit_importance(importance, data[train_idx, ], train_idx, test_idx)
# Bad: importance computed on full data (includes test)
result_bad <- audit_importance(importance, data, train_idx, test_idx)
Audit Predictions for Data Leakage
Description
Validates that predictions were generated correctly without data leakage. Checks that predictions correspond to test data only and that the prediction process did not use information from the test set.
Usage
audit_predictions(
predictions,
train_idx,
test_idx,
actual = NULL,
data = NULL,
model = NULL
)
Arguments
predictions |
Vector of predictions (numeric or factor). |
train_idx |
Integer vector of training indices. |
test_idx |
Integer vector of test indices. |
actual |
Optional vector of actual values for comparison. |
data |
Optional data frame containing the original data. |
model |
Optional fitted model object for additional checks. |
Value
A BorgRisk object with audit results.
Examples
# Create data and split
set.seed(42)
data <- data.frame(y = rnorm(100), x = rnorm(100))
train_idx <- 1:70
test_idx <- 71:100
# Fit model and predict
model <- lm(y ~ x, data = data[train_idx, ])
predictions <- predict(model, newdata = data[test_idx, ])
# Audit predictions
result <- audit_predictions(predictions, train_idx, test_idx)
Autoplot Method for BorgDiagnosis Objects
Description
Creates a summary panel of detected dependency diagnostics.
Usage
## S3 method for class 'BorgDiagnosis'
autoplot(object, type = c("summary", "variogram"), ...)
Arguments
object |
A |
type |
Character. Plot type: |
... |
Additional arguments (currently unused). |
Details
Requires the ggplot2 package.
Value
A ggplot object.
Examples
if (requireNamespace("ggplot2", quietly = TRUE)) {
set.seed(42)
d <- data.frame(
site = rep(1:20, each = 10),
value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)
diag <- borg_diagnose(d, groups = "site", target = "value")
ggplot2::autoplot(diag)
}
Autoplot Method for BorgRisk Objects
Description
Creates a ggplot2 visualization of detected risks from a BORG validation.
Usage
## S3 method for class 'BorgRisk'
autoplot(object, max_risks = 10, show_fixes = TRUE, ...)
Arguments
object |
A |
max_risks |
Integer. Maximum number of risks to display. Default: 10. |
show_fixes |
Logical. If TRUE (default), annotate each risk with a
suggested fix from |
... |
Additional arguments (currently unused). |
Details
Requires the ggplot2 package.
Value
A ggplot object.
Examples
if (requireNamespace("ggplot2", quietly = TRUE)) {
data <- data.frame(x = 1:100, y = 101:200)
result <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
ggplot2::autoplot(result)
}
Autoplot Method for borg_comparison Objects
Description
Visualizes random vs blocked CV comparison results.
Usage
## S3 method for class 'borg_comparison'
autoplot(object, type = c("boxplot", "density", "paired"), ...)
Arguments
object |
A |
type |
Character. Plot type: |
... |
Additional arguments (currently unused). |
Details
Requires the ggplot2 package.
Value
A ggplot object.
Examples
if (requireNamespace("ggplot2", quietly = TRUE)) {
set.seed(42)
d <- data.frame(
site = rep(1:20, each = 10),
x = rnorm(200),
y = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)
comp <- borg_compare_cv(d, formula = y ~ x, groups = "site", repeats = 3)
ggplot2::autoplot(comp)
}
Autoplot Method for borg_cv Objects
Description
Visualizes cross-validation fold structure.
Usage
## S3 method for class 'borg_cv'
autoplot(
object,
type = c("folds", "spatial", "sizes"),
data = NULL,
coords = NULL,
raster = NULL,
...
)
Arguments
object |
A |
type |
Character. Plot type: |
data |
Optional data frame (or sf/SpatVector) for spatial plots.
Required for |
coords |
Character vector of coordinate column names. Required for
|
raster |
Optional |
... |
Additional arguments (currently unused). |
Details
Requires the ggplot2 package.
Value
A ggplot object.
Examples
if (requireNamespace("ggplot2", quietly = TRUE)) {
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
ggplot2::autoplot(cv, type = "sizes")
ggplot2::autoplot(cv, type = "spatial", data = d, coords = c("x", "y"))
}
Autoplot Method for borg_fold_perf Objects
Description
Autoplot Method for borg_fold_perf Objects
Usage
## S3 method for class 'borg_fold_perf'
autoplot(object, ...)
Arguments
object |
A |
... |
Additional arguments (currently unused). |
Value
A ggplot object. If spatial centroids are available,
shows a map of per-fold performance. Otherwise, a bar chart.
Autoplot Method for borg_result Objects
Description
Creates ggplot2 visualizations of BORG diagnosis + CV results.
Usage
## S3 method for class 'borg_result'
autoplot(
object,
type = c("split", "spatial", "temporal", "groups"),
fold = 1,
data = NULL,
coords = NULL,
raster = NULL,
time = NULL,
groups = NULL,
...
)
Arguments
object |
A |
type |
Character. Plot type: |
fold |
Integer. Which fold to plot. Default: 1. |
data |
Optional data frame (or sf/SpatVector) for spatial plots.
Required for |
coords |
Character vector of coordinate column names. Required for
|
raster |
Optional |
time |
Character name of the time column in |
groups |
Character name of the grouping column in |
... |
Additional arguments (currently unused). |
Details
Requires the ggplot2 package. For spatial plots, the sf package is recommended for proper map projections.
Value
A ggplot object.
Examples
if (requireNamespace("ggplot2", quietly = TRUE)) {
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
result <- borg(d, coords = c("x", "y"), target = "z")
ggplot2::autoplot(result)
ggplot2::autoplot(result, type = "spatial", data = d, coords = c("x", "y"))
}
BORG: Guard Your Model Evaluation
Description
The main entry point for BORG. Diagnoses data dependencies, generates valid cross-validation schemes, and validates evaluation workflows.
Usage
borg(
data,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
formula = NULL,
v = 5,
train_idx = NULL,
test_idx = NULL,
buffer = NULL,
env = NULL,
repeats = 1L,
output = c("list", "rsample", "caret", "mlr3"),
...
)
Arguments
data |
A data frame, |
coords |
Character vector of length 2 specifying coordinate column names
(e.g., |
time |
Character string specifying the time column name. Triggers temporal autocorrelation detection. |
groups |
Character string specifying the grouping column name (e.g., "site_id", "patient_id"). Triggers clustered structure detection. |
target |
Character string specifying the response variable column name. Used for more accurate autocorrelation diagnostics. |
formula |
A model formula (e.g. |
v |
Integer. Number of CV folds. Default: 5. |
train_idx |
Integer vector of training indices. If provided along with
|
test_idx |
Integer vector of test indices. Required if |
buffer |
Numeric. Spatial buffer distance (in coordinate units) applied around test-fold observations. Training points within this buffer are removed to reduce spatial autocorrelation leakage. Default: NULL (no buffer). |
env |
Environmental covariates used for environmental blocking.
A |
repeats |
Integer. Number of times to repeat the CV fold generation with different random seeds. Default: 1 (no repetition). |
output |
Character. CV output format: "list" (default), "rsample", "caret", "mlr3". Ignored when validating an existing split. |
... |
Additional arguments passed to underlying functions. |
Details
borg() operates in two modes:
Diagnosis Mode (Recommended)
When called with structure hints (coords, time, groups)
but without train_idx/test_idx, BORG:
Diagnoses data dependencies (spatial, temporal, clustered)
Estimates how much random CV would inflate metrics
Generates appropriate CV folds that respect the dependency structure
Returns everything needed to proceed with valid evaluation
This is the recommended workflow. Let BORG tell you how to split your data.
Validation Mode
When called with train_idx and test_idx, BORG validates the
existing split:
Checks for index overlap
Validates group isolation (if
groupsspecified)Validates temporal ordering (if
timespecified)Checks spatial separation (if
coordsspecified)Detects preprocessing leakage, target leakage, etc.
Use this mode to verify splits you've created yourself.
Value
Depends on usage mode:
Diagnosis mode (no train_idx/test_idx): A list with class "borg_result" containing:
- diagnosis
A
BorgDiagnosisobject with dependency analysis- cv
A
borg_cvobject with valid cross-validation folds- folds
Shortcut to
cv$foldsfor convenience
Validation mode (with train_idx/test_idx): A BorgRisk
object containing the risk assessment of the provided split.
See Also
borg_diagnose for diagnosis only,
borg_cv for CV generation only,
borg_inspect for detailed object inspection.
Examples
# ===== DIAGNOSIS MODE (recommended) =====
# Spatial data: let BORG create valid folds
set.seed(42)
spatial_data <- data.frame(
x = runif(200, 0, 100),
y = runif(200, 0, 100),
response = rnorm(200)
)
result <- borg(spatial_data, coords = c("x", "y"), target = "response")
result$diagnosis
result$folds[[1]] # First fold's train/test indices
# Clustered data
clustered_data <- data.frame(
site = rep(1:20, each = 10),
value = rep(rnorm(20), each = 10) + rnorm(200, sd = 0.5)
)
result <- borg(clustered_data, groups = "site", target = "value")
result$diagnosis@recommended_cv # "group_fold"
# Temporal data
temporal_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = 200),
value = cumsum(rnorm(200))
)
result <- borg(temporal_data, time = "date", target = "value")
# Get rsample-compatible output for tidymodels (requires rsample package)
result <- borg(spatial_data, coords = c("x", "y"), output = "rsample")
# ===== VALIDATION MODE =====
# Validate an existing split
data <- data.frame(x = 1:100, y = rnorm(100))
borg(data, train_idx = 1:70, test_idx = 71:100)
# Validate with group constraint
data$patient <- rep(1:10, each = 10)
borg(data, train_idx = 1:50, test_idx = 51:100, groups = "patient")
BORG-Guarded Cross-Validation Functions
Description
These functions wrap common cross-validation functions from popular ML frameworks, adding automatic BORG validation. They block random CV when data dependencies are detected.
Details
BORG provides guarded versions of:
-
borg_vfold_cv(): Wrapsrsample::vfold_cv() -
borg_group_vfold_cv(): Wrapsrsample::group_vfold_cv() -
borg_initial_split(): Wrapsrsample::initial_split()
When dependencies are detected, these functions either:
Block the operation and suggest
borg_cv()insteadAutomatically switch to an appropriate blocked CV strategy
Value
No return value. This page documents the family of guarded CV wrapper functions; see individual functions for their return values.
Adversarial Validation
Description
Trains a binary classifier to distinguish training data from prediction locations. High classification accuracy (AUC > 0.7) indicates the prediction domain differs substantially from training, and spatial/blocked CV is essential. Low accuracy (AUC ~ 0.5) means random CV may suffice.
Usage
borg_adversarial(train, prediction, predictors = NULL, v = 5L)
Arguments
train |
Data frame of training data. |
prediction |
Data frame of prediction locations (same predictor columns). |
predictors |
Character vector. If |
v |
Integer. Number of CV folds for adversarial classifier. Default: 5. |
Details
Uses logistic regression as the adversarial classifier (no external dependencies). The AUC is computed via cross-validation on the combined train+prediction dataset with a binary label (0=train, 1=prediction).
Interpretation:
AUC < 0.6: Low dissimilarity. Random CV likely adequate.
AUC 0.6-0.8: Moderate dissimilarity. Spatial CV recommended.
AUC > 0.8: High dissimilarity. Spatial CV essential; check AOA.
Value
A list with class "borg_adversarial" containing:
- auc
Cross-validated AUC of the adversarial classifier
- dissimilarity
Dissimilarity score (0 to 100 percent)
- recommendation
Suggested CV strategy based on dissimilarity
- importance
Variable importance for distinguishing domains
Examples
set.seed(42)
train <- data.frame(a = rnorm(100), b = rnorm(100))
pred <- data.frame(a = rnorm(50, mean = 2), b = rnorm(50))
av <- borg_adversarial(train, pred)
av
Area of Applicability
Description
Determines where a spatial prediction model can be trusted, following Meyer & Pebesma (2021). Combines the dissimilarity index (DI) with a threshold derived from cross-validated DI values to produce a binary applicability mask.
Usage
borg_aoa(
train,
new,
predictors = NULL,
coords = NULL,
weights = NULL,
folds = NULL,
threshold = NULL
)
Arguments
train |
Data frame of training data. |
new |
Data frame of prediction locations (same predictor columns). |
predictors |
Character vector. Predictor column names. |
coords |
Character vector of length 2. Coordinate columns in
|
weights |
Numeric vector. Variable importance weights. |
folds |
Optional |
threshold |
Numeric. Manual DI threshold override. If |
Value
A data frame with class "borg_aoa" containing:
- di
Dissimilarity index for each prediction point
- aoa
Logical. TRUE if inside AOA (DI <= threshold)
- x, y
Coordinates (if
coordsprovided)
Has autoplot() method showing the AOA map.
References
Meyer, H., & Pebesma, E. (2021). Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution, 12(9), 1620-1633. doi:10.1111/2041-210X.13650
Examples
set.seed(42)
train <- data.frame(x = runif(80, 0, 50), y = runif(80, 0, 50),
a = rnorm(80), b = rnorm(80))
pred <- data.frame(x = runif(200, 0, 100), y = runif(200, 0, 100),
a = rnorm(200), b = rnorm(200))
aoa <- borg_aoa(train, pred, predictors = c("a", "b"), coords = c("x", "y"))
table(aoa$aoa)
Assimilate Leaky Evaluation Pipelines
Description
borg_assimilate() attempts to automatically fix detected evaluation risks
by restructuring the pipeline to eliminate information leakage.
Usage
borg_assimilate(workflow, risks = NULL, fix = "all")
Arguments
workflow |
A list containing the evaluation workflow (same structure
as |
risks |
Optional |
fix |
Character vector specifying which risk types to attempt to fix.
Default: |
Details
borg_assimilate() can automatically fix certain types of leakage:
- Preprocessing on full data
Refits preprocessing objects using only training indices
- Feature engineering leaks
Recomputes target encodings, embeddings, and derived features using train-only data
- Threshold optimization
Moves threshold selection to training/validation data
Some violations cannot be automatically fixed:
Train-test index overlap (requires new split)
Target leakage in original features (requires domain intervention)
Temporal look-ahead in features (requires feature re-engineering)
Value
A list containing:
- workflow
The rewritten workflow (modified in place where possible)
- fixed
Character vector of risk types that were successfully fixed
- unfixable
Character vector of risk types that could not be fixed
- report
BorgRiskobject from post-rewrite validation
See Also
borg_validate for validation without assimilation,
borg for proactive enforcement.
Examples
# Attempt to fix a leaky workflow
workflow <- list(
data = data.frame(x = rnorm(100), y = rnorm(100)),
train_idx = 1:70,
test_idx = 71:100
)
result <- borg_assimilate(workflow)
if (length(result$unfixable) > 0) {
message("Some risks require manual intervention:")
print(result$unfixable)
}
Enable/Disable BORG Auto-Check Mode
Description
Configures BORG to automatically validate train/test splits when using supported ML frameworks. When enabled, BORG will intercept common modeling functions and validate indices before training proceeds.
Usage
borg_auto_check(enable = TRUE, strict = TRUE, verbose = FALSE)
Arguments
enable |
Logical. If TRUE, enable auto-check mode. If FALSE, disable. |
strict |
Logical. If TRUE, throw errors on violations. If FALSE, warn. |
verbose |
Logical. If TRUE, print diagnostic messages. |
Value
Invisibly returns the previous state of auto-check options.
Examples
# Enable auto-checking with strict mode
borg_auto_check(TRUE)
# Disable auto-checking
borg_auto_check(FALSE)
# Enable with warnings instead of errors
borg_auto_check(TRUE, strict = FALSE)
Best Subset Variable Selection with Blocked CV
Description
Evaluates all 2^p combinations of predictor variables using blocked CV. Exhaustive search — only feasible for small p (< 15).
Usage
borg_best_subset(
data,
target,
predictors,
folds,
metric = c("rmse", "mae", "rsq"),
fit_fun = stats::lm,
max_vars = NULL,
verbose = FALSE
)
Arguments
data |
Data frame. |
target |
Character. Response variable. |
predictors |
Character vector. Candidate predictors. |
folds |
A |
metric |
Character. Default: |
fit_fun |
Function. Default: |
max_vars |
Integer. Maximum variables in a subset. Default: all. |
verbose |
Logical. Default: FALSE. |
Value
A data frame with class "borg_bss": variables, n_vars,
metric_value, rank.
Optimize Spatial Block Size
Description
Tests multiple block sizes and selects the one that minimizes residual spatial autocorrelation between CV folds. For each candidate size, generates spatial block folds and computes the mean Moran's I of residuals within test sets.
Usage
borg_block_size(
data,
coords,
target,
v = 5,
n_sizes = 10,
range = NULL,
formula = NULL,
verbose = FALSE
)
Arguments
data |
Data frame with coordinate and target columns. |
coords |
Character vector of length 2. Coordinate column names. |
target |
Character. Target variable name. |
v |
Integer. Number of folds. Default: 5. |
n_sizes |
Integer. Number of candidate block sizes to test. Default: 10. |
range |
Numeric vector of length 2. Min and max block sizes to test.
If |
formula |
Optional model formula for computing residual
autocorrelation. If |
verbose |
Logical. Default: FALSE. |
Value
A list with class "borg_block_opt" containing:
- optimal
Optimal block size
- results
Data frame with columns: block_size, mean_morans_i, mean_test_size, n_empty_folds
- range_estimate
Variogram-based range estimate
Has an autoplot() method showing the optimization curve.
Examples
set.seed(42)
d <- data.frame(
x = runif(200, 0, 100), y = runif(200, 0, 100),
z = rnorm(200)
)
opt <- borg_block_size(d, coords = c("x", "y"), target = "z")
opt$optimal
Block Bootstrap Confidence Intervals for CV Metrics
Description
Estimates confidence intervals for cross-validation performance metrics
using spatial block bootstrap. Standard bootstrap assumes independent
observations; block bootstrap preserves the dependency structure
detected by borg_diagnose().
Usage
borg_bootstrap(
model,
data,
target,
coords = NULL,
formula = NULL,
fit_fun = NULL,
folds = NULL,
metric = c("rmse", "mae", "rsq"),
n_boot = 200,
n_blocks = 10,
conf_level = 0.95,
seed = 42
)
Arguments
model |
A fitted model with a |
data |
Data frame with predictors and target. |
target |
Character. Target variable name. |
coords |
Character vector of length 2. Coordinate column names. If provided, uses spatial blocking. Otherwise, uses random blocks. |
formula |
Model formula. Required if |
fit_fun |
Function. Model fitting function. If |
folds |
A |
metric |
Character. Performance metric: |
n_boot |
Integer. Number of bootstrap replicates. Default: 200. |
n_blocks |
Integer. Number of spatial blocks for resampling. Default: 10. |
conf_level |
Numeric. Confidence level. Default: 0.95. |
seed |
Integer. Random seed. Default: 42. |
Details
Spatial block bootstrap
Data is partitioned into spatial blocks using k-means clustering on coordinates. Each bootstrap replicate resamples blocks (with replacement), then includes all observations from selected blocks. This preserves within-block spatial correlation while generating valid resamples.
CI methods
Returns bias-corrected and accelerated (BCa) intervals when possible, falling back to percentile intervals.
Value
A list with class "borg_bootstrap" containing:
- estimate
Point estimate of the metric
- ci_lower
Lower confidence bound
- ci_upper
Upper confidence bound
- conf_level
Confidence level
- boot_distribution
Numeric vector of bootstrap estimates
- se
Bootstrap standard error
- bias
Bootstrap bias estimate
- metric
Metric name
- n_boot
Number of bootstrap replicates
- method
"spatial_block" or "random_block"
Has print() and autoplot() methods.
References
Lahiri, S. N. (2003). Resampling Methods for Dependent Data. Springer.
Examples
set.seed(42)
d <- data.frame(x = runif(150), y = runif(150), a = rnorm(150))
d$z <- 2 * d$a + rnorm(150, sd = 0.5)
model <- lm(z ~ a, data = d)
boot <- borg_bootstrap(model, d, target = "z", coords = c("x", "y"),
n_boot = 50)
boot
Cache and Retrieve BORG Diagnoses
Description
Caches BorgDiagnosis objects keyed by a hash of the
input data, so expensive computations (variograms, distance matrices,
autocorrelation tests) are not repeated across iterations.
Usage
borg_cache_get(data, coords = NULL, target = NULL, envir = .borg_cache_env)
borg_cache_set(
data,
diagnosis,
coords = NULL,
target = NULL,
envir = .borg_cache_env
)
borg_cache_clear(envir = .borg_cache_env)
borg_cache_info(envir = .borg_cache_env)
Arguments
data |
A data frame. Used to compute the cache key. |
coords |
Character vector. Included in the cache key to distinguish diagnoses of the same data with different coordinate columns. |
target |
Character. Included in the cache key. |
envir |
Environment for in-memory cache. Default: the package namespace cache. |
diagnosis |
A |
Value
borg_cache_getA
BorgDiagnosisor NULL if not cached.borg_cache_setInvisible NULL. Stores the diagnosis.
borg_cache_clearInvisible NULL. Clears all cached diagnoses.
borg_cache_infoA data frame with cache key, timestamp, and data dimensions for all cached entries.
Examples
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
diag <- borg_diagnose(d, coords = c("x", "y"), target = "z")
# Cache it
borg_cache_set(d, diag, coords = c("x", "y"), target = "z")
# Retrieve (fast, no recomputation)
cached <- borg_cache_get(d, coords = c("x", "y"), target = "z")
identical(diag, cached) # TRUE
# Clear all
borg_cache_clear()
Model Calibration Diagnostics
Description
Assesses whether a model's predictions are well-calibrated. For classification, checks if predicted probabilities match observed frequencies. For regression, checks if predicted quantiles have correct coverage. Optionally uses spatial-aware binning to avoid autocorrelation artifacts in calibration curves.
Usage
borg_calibration(
predicted,
actual,
type = NULL,
n_bins = 10,
coords = NULL,
strategy = c("uniform", "quantile", "spatial")
)
Arguments
predicted |
Numeric vector. Predicted values (probabilities for classification, point predictions for regression). |
actual |
Numeric or factor vector. Observed outcomes (0/1 for classification, continuous for regression). |
type |
Character. |
n_bins |
Integer. Number of bins for calibration curve. Default: 10. |
coords |
Data frame or matrix with coordinate columns. If provided, uses spatial-aware binning (ensures bins are not spatially clustered). |
strategy |
Character. Binning strategy: |
Value
A list with class "borg_calibration" containing:
- calibration_curve
Data frame with columns: bin_midpoint, observed_freq, predicted_mean, n, ci_lower, ci_upper
- ece
Expected Calibration Error (weighted mean |observed - predicted|)
- mce
Maximum Calibration Error (worst bin)
- brier_score
Brier score (classification) or calibration MSE (regression)
- type
Model type
- reliability_slope
Slope of observed ~ predicted regression (1.0 = perfect calibration)
- reliability_intercept
Intercept (0.0 = no bias)
- n_bins
Number of bins used
- assessment
Character: "well_calibrated", "moderate", or "poorly_calibrated"
Has print() and autoplot() methods.
Examples
# Classification
set.seed(42)
probs <- runif(500)
outcomes <- rbinom(500, 1, probs^0.8) # slightly miscalibrated
cal <- borg_calibration(probs, outcomes)
cal
# Regression
x <- rnorm(200)
y <- 2 * x + rnorm(200, sd = 0.5)
preds <- 2.1 * x # slightly biased
cal_reg <- borg_calibration(preds, y, type = "regression")
cal_reg
Create Validation Certificate
Description
Generate a structured validation certificate documenting the BORG analysis for reproducibility and audit trails.
Usage
borg_certificate(diagnosis, data, comparison = NULL, cv = NULL)
Arguments
diagnosis |
A |
data |
The data frame that was analyzed. |
comparison |
Optional. A |
cv |
Optional. A |
Value
A borg_certificate object containing:
-
meta: Package version, R version, timestamp -
data: Data characteristics and hash -
diagnosis: Dependency type, severity, recommended CV -
cv_strategy: CV type and fold count -
inflation: Theoretical and empirical estimates
See Also
borg_export for writing certificates to file.
Examples
set.seed(42)
data <- data.frame(
x = runif(100, 0, 100),
y = runif(100, 0, 100),
response = rnorm(100)
)
diagnosis <- borg_diagnose(data, coords = c("x", "y"), target = "response",
verbose = FALSE)
cert <- borg_certificate(diagnosis, data)
print(cert)
Quick Leakage Check (Pipe-Friendly)
Description
Single-verb entry point that runs the full BORG pipeline: diagnose dependencies, generate valid CV, validate the split, and return a tidy summary. Designed for use in pipelines.
Usage
borg_check(
data,
model = NULL,
target = NULL,
coords = NULL,
time = NULL,
groups = NULL,
train_idx = NULL,
test_idx = NULL,
v = 5L,
verbose = FALSE
)
Arguments
data |
A data frame. |
model |
A fitted model object (optional). If provided, predictions are evaluated under blocked CV. |
target |
Character. Response variable name. |
coords |
Character vector of length 2. Coordinate column names (optional, for spatial data). |
time |
Character. Time column name (optional, for temporal data). |
groups |
Character. Grouping column name (optional). |
train_idx |
Integer vector. Training indices (optional, for validating an existing split). |
test_idx |
Integer vector. Test indices (optional). |
v |
Integer. Number of CV folds. Default: 5. |
verbose |
Logical. Default: FALSE. |
Details
borg_check() is intentionally simple: one call, one result.
For finer control, use borg_diagnose(), borg_cv(),
and borg_validate() individually.
Value
A data frame with class "borg_check" containing one
row per detected risk (or zero rows if clean), with columns:
- risk_type
Character. Type of risk detected.
- severity
Character.
"hard_inflation","soft_inflation", or"info".- description
Character. Plain-language description.
- n_affected
Integer. Number of observations affected.
- source
Character. Object/step that triggered the risk.
Also has attributes: diagnosis, cv, risks
(the full objects for further inspection).
See Also
borg, borg_diagnose,
borg_explain_risk
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
# Pipe-friendly
result <- borg_check(d, target = "z", coords = c("x", "y"))
result
nrow(result) # 0 if clean
# Check an existing split
result <- borg_check(d, target = "z", coords = c("x", "y"),
train_idx = 1:70, test_idx = 71:100)
Check Geographic Representativeness of CV Folds
Description
Evaluates whether each fold's test set covers a representative portion of the study area. Flags folds that are geographically isolated or biased toward one region.
Usage
borg_check_coverage(folds, data, coords, threshold = 0.2)
Arguments
folds |
A |
data |
Data frame with coordinate columns. |
coords |
Character vector of length 2. Coordinate column names. |
threshold |
Numeric. Minimum proportion of geographic extent that each fold should cover (0-1). Default: 0.2 (each fold covers at least 20 percent of the x and y extent). |
Value
A data frame with class "borg_geo_strat" containing
per-fold coverage metrics:
- fold
Fold index
- x_coverage
Proportion of x-extent covered by test set
- y_coverage
Proportion of y-extent covered by test set
- area_ratio
Convex hull area ratio (test / total)
- centroid_x, centroid_y
Test set centroid
- balanced
Whether fold meets the threshold
Examples
set.seed(42)
d <- data.frame(x = runif(200, 0, 100), y = runif(200, 0, 100), z = rnorm(200))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
strat <- borg_check_coverage(cv, d, coords = c("x", "y"))
strat
Check Nested CV for Leakage
Description
Validates a nested cross-validation setup for data leakage between outer and inner CV loops, and detects strategy mismatches where the inner CV uses random resampling despite data dependencies.
Usage
borg_check_nested_cv(
inner_resamples,
outer_train_idx,
outer_test_idx,
data,
coords = NULL,
time = NULL,
groups = NULL
)
Arguments
inner_resamples |
The inner resampling object. One of:
|
outer_train_idx |
Integer vector. Indices of the outer training set. |
outer_test_idx |
Integer vector. Indices of the outer test set. |
data |
Data frame used for modeling. |
coords |
Character vector of coordinate column names (for spatial check). |
time |
Character. Time column name (for temporal check). |
groups |
Character. Group column name (for clustered check). |
Value
A BorgRisk object with any detected risks.
Examples
# Check if inner random CV is appropriate given grouped data
d <- data.frame(
site = rep(1:20, each = 10),
x = rnorm(200),
y = rep(rnorm(20), each = 10) + rnorm(200, sd = 0.5)
)
result <- borg_check_nested_cv(
inner_resamples = "cv",
outer_train_idx = 1:160,
outer_test_idx = 161:200,
data = d,
groups = "site"
)
Check Residual Spatial Autocorrelation
Description
After fitting a model, checks whether residuals still exhibit spatial autocorrelation. If they do, the model has not fully captured the spatial process and predictions may be biased.
Usage
borg_check_residuals(model, data = NULL, coords = NULL, alpha = 0.05)
Arguments
model |
A fitted model with a |
data |
Data frame with coordinate columns (needed if |
coords |
Character vector of length 2. Coordinate column names. |
alpha |
Numeric. Significance level for Moran's I test. Default: 0.05. |
Value
A list with class "borg_residual_check" containing:
- morans_i
Moran's I of residuals
- p_value
P-value from Moran's I test
- significant
Logical. Whether residual autocorrelation is significant
- variogram
Residual variogram data frame (if computed)
- assessment
Character. "clean", "mild", or "strong"
Has an autoplot() method showing the residual variogram.
Examples
set.seed(42)
d <- data.frame(x = runif(100, 0, 100), y = runif(100, 0, 100))
d$z <- sin(d$x / 10) + rnorm(100, sd = 0.5)
model <- lm(z ~ x + y, data = d)
check <- borg_check_residuals(model, d, coords = c("x", "y"))
check
Compare Random vs Blocked Cross-Validation
Description
Runs both random and blocked cross-validation on the same data and model, providing empirical evidence of metric inflation from ignoring data dependencies.
Usage
borg_compare_cv(
data,
formula,
model_fn = NULL,
predict_fn = NULL,
metric = NULL,
diagnosis = NULL,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
v = 5,
repeats = 10,
seed = NULL,
verbose = TRUE
)
Arguments
data |
A data frame containing predictors and response. |
formula |
A formula specifying the model (e.g., |
model_fn |
A function that fits a model. Should accept |
predict_fn |
A function to generate predictions. Should accept |
metric |
A character string specifying the metric to compute. One of
|
diagnosis |
A |
coords |
Character vector of length 2 specifying coordinate column names. |
time |
Character string specifying the time column name. |
groups |
Character string specifying the grouping column name. |
target |
Character string specifying the response variable name. If NULL, extracted from formula. |
v |
Integer. Number of CV folds. Default: 5. |
repeats |
Integer. Number of times to repeat CV. Default: 10 for stable estimates. |
seed |
Integer. Random seed for reproducibility. |
verbose |
Logical. Print progress messages. Default: TRUE. |
Details
This function provides the "smoking gun" evidence for reviewers. It runs cross-validation twice on the same data:
-
Random CV: Standard k-fold CV ignoring data structure
-
Blocked CV: Structure-aware CV based on BORG diagnosis
The difference in metrics demonstrates empirically how much random CV inflates performance estimates when data dependencies exist.
For stable estimates, the comparison is repeated multiple times (default: 10) and a paired t-test assesses whether the difference is statistically significant.
Value
A borg_comparison object (S3 class) containing:
- random_cv
Data frame of metrics from random CV (one row per repeat)
- blocked_cv
Data frame of metrics from blocked CV (one row per repeat)
- summary
Summary statistics comparing the two approaches
- inflation
Estimated metric inflation from using random CV
- diagnosis
The BorgDiagnosis object used
- p_value
P-value from paired t-test comparing approaches
See Also
borg_diagnose for dependency detection,
borg_cv for generating blocked CV folds.
Examples
# Spatial data example
set.seed(42)
n <- 200
spatial_data <- data.frame(
x = runif(n, 0, 100),
y = runif(n, 0, 100)
)
# Create spatially autocorrelated response
spatial_data$response <- spatial_data$x * 0.5 + rnorm(n, sd = 5)
# Compare CV approaches
comparison <- borg_compare_cv(
spatial_data,
formula = response ~ x + y,
coords = c("x", "y"),
repeats = 5 # Use more repeats in practice
)
print(comparison)
plot(comparison)
Compare Multiple Models with Spatial CV
Description
Evaluates multiple models (or formulas) using the same blocked CV folds, producing a side-by-side comparison table.
Usage
borg_compare_models(
data,
folds,
models,
metric = c("rmse", "mae", "rsq", "auc", "tss", "kappa"),
fit_fun = stats::lm
)
Arguments
data |
Data frame. |
folds |
A |
models |
Named list of either:
|
metric |
Character. Default: |
fit_fun |
Function. Used when |
Value
A data frame with class "borg_model_comparison" containing
model name, mean metric, SD, and rank. Has autoplot() method.
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100),
a = rnorm(100), b = rnorm(100))
d$z <- d$a * 2 + rnorm(100, sd = 0.5)
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
comp <- borg_compare_models(d, cv,
models = list(simple = z ~ a, full = z ~ a + b))
comp
Conformal Prediction with Spatial Dependence
Description
Constructs distribution-free prediction intervals with finite-sample coverage guarantees, adjusted for spatial autocorrelation. Standard conformal prediction assumes exchangeability, which spatial data violates. This function uses spatially-blocked calibration residuals as nonconformity scores, producing intervals that maintain coverage even under dependence.
Usage
borg_conformal(
model,
data,
new = NULL,
target,
coords = NULL,
alpha = 0.1,
method = c("split", "block_jackknife", "block_cv"),
folds = NULL,
n_blocks = 10,
type = c("regression", "classification"),
seed = 42
)
Arguments
model |
A fitted model with a |
data |
Data frame used to compute nonconformity scores (calibration set). |
new |
Data frame of new locations for prediction. If |
target |
Character. Target variable name in |
coords |
Character vector of length 2. Coordinate column names. If provided, uses spatial blocking for calibration. |
alpha |
Numeric. Miscoverage level. Default: 0.1 (90% intervals). |
method |
Character. Conformal method:
|
folds |
A |
n_blocks |
Integer. Number of spatial blocks if |
type |
Character. For classification: |
seed |
Integer. Random seed. Default: 42. |
Details
Why spatial conformal?
Standard split conformal prediction computes residuals on a random calibration set. Under spatial autocorrelation, nearby calibration points produce correlated residuals, leading to underestimated interval widths and actual coverage below the nominal level.
By using spatially-blocked calibration (where residuals come from predictions on spatially separated test folds), the effective sample size of nonconformity scores is honest, and coverage guarantees hold approximately even under dependence.
Methods
"split"Splits data into training and calibration sets using spatial blocks. Fast, but uses only part of the data.
"block_jackknife"Leave-one-block-out: refit the model excluding each block, predict on the held-out block. More data-efficient but slower.
"block_cv"Use pre-computed blocked CV residuals from a
borg_cvobject. Requiresfolds.
Value
A data frame with class "borg_conformal" containing:
- prediction
Point prediction
- lower
Lower bound of prediction interval
- upper
Upper bound of prediction interval
- width
Interval width (upper - lower)
- x and y
Coordinates (if
coordsprovided andnewhas them)
Attributes include alpha, method, coverage_estimate,
and quantile_score (the nonconformity threshold).
References
Mao, H., Martin, R., & Reich, B. J. (2024). Valid prediction inference with spatial conformal methods. arXiv preprint arXiv:2403.14058.
Johnstone, C., & Cox, D. (2023). Conformal prediction with spatial data.
Vovk, V., Gammerman, A., & Shafer, G. (2005). Algorithmic Learning in a Random World. Springer.
Examples
set.seed(42)
d <- data.frame(x = runif(200), y = runif(200), a = rnorm(200))
d$z <- 2 * d$a + sin(d$x * 10) + rnorm(200, sd = 0.5)
model <- lm(z ~ a + x + y, data = d)
# Split conformal with spatial blocking
conf <- borg_conformal(model, d, target = "z", coords = c("x", "y"))
conf
# Predict on new locations
new <- data.frame(x = runif(50), y = runif(50), a = rnorm(50))
pred <- borg_conformal(model, d, new = new, target = "z",
coords = c("x", "y"))
Generate Valid Cross-Validation Scheme
Description
Creates cross-validation folds that respect data dependency structure. When spatial, temporal, or clustered dependencies are detected, random CV is disabled and appropriate blocking strategies are enforced.
Usage
borg_cv(
data,
diagnosis = NULL,
v = 5,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
env = NULL,
dist_mat = NULL,
prediction_points = NULL,
block_size = NULL,
embargo = NULL,
buffer = NULL,
strategy = NULL,
repeats = 1L,
output = c("list", "rsample", "caret", "mlr3"),
allow_random = FALSE,
verbose = FALSE
)
Arguments
data |
A data frame to create CV folds for. |
diagnosis |
A |
v |
Integer. Number of folds. Default: 5. |
coords |
Character vector of length 2 specifying coordinate column names. Required for spatial blocking if diagnosis is NULL. |
time |
Character string specifying the time column name. Required for temporal blocking if diagnosis is NULL. |
groups |
Character string specifying the grouping column name. Required for group CV if diagnosis is NULL. |
target |
Character string specifying the response variable column name. |
env |
Environmental covariates for environmental blocking.
A |
dist_mat |
A distance matrix or |
prediction_points |
Data frame, matrix, |
block_size |
Numeric. For spatial blocking, the minimum block size. If NULL, automatically determined from diagnosis. Should be larger than the autocorrelation range. |
embargo |
Integer. For temporal blocking, minimum gap between train and test. If NULL, automatically determined from diagnosis. |
buffer |
Numeric. Spatial buffer distance (in coordinate units). Training points within this distance of any test-fold point are removed to reduce autocorrelation leakage. Default: NULL (no buffer). |
strategy |
Character. Override the auto-detected CV strategy. Use
|
repeats |
Integer. Number of times to repeat CV fold generation with different random seeds. Default: 1 (no repetition). |
output |
Character. Output format: "list" (default), "rsample", "caret", "mlr3". |
allow_random |
Logical. If TRUE, allows random CV even when dependencies detected. Default: FALSE. Setting to TRUE requires explicit acknowledgment. |
verbose |
Logical. If TRUE, print diagnostic messages. Default: FALSE. |
Details
The Enforcement Principle
Unlike traditional CV helpers, borg_cv enforces valid evaluation:
If spatial autocorrelation is detected, random CV is disabled
If temporal autocorrelation is detected, random CV is disabled
If clustered structure is detected, random CV is disabled
To use random CV on dependent data, you must set
allow_random = TRUEand provide justification (this is logged).
Spatial Blocking
When spatial dependencies are detected, data are partitioned into spatial blocks using k-means clustering on coordinates. Block size is set to exceed the estimated autocorrelation range. This ensures train and test sets are spatially separated.
Temporal Blocking
When temporal dependencies are detected, data are split chronologically with an embargo period between train and test sets. This prevents information from future observations leaking into training.
Group CV
When clustered structure is detected, entire groups (clusters) are held out together. No group appears in both train and test within a fold.
Value
Depending on output:
- "list"
A list with elements:
folds(list of train/test index vectors),diagnosis(the BorgDiagnosis used),strategy(CV strategy name),params(parameters used).- "rsample"
An
rsamplersetobject compatible with tidymodels.- "caret"
A
trainControlobject for caret.- "mlr3"
An
mlr3Resamplingobject.
See Also
Examples
# Spatial data with autocorrelation
set.seed(42)
spatial_data <- data.frame(
x = runif(200, 0, 100),
y = runif(200, 0, 100),
response = rnorm(200)
)
# Diagnose and create CV
cv <- borg_cv(spatial_data, coords = c("x", "y"), target = "response")
str(cv$folds) # List of train/test indices
# Clustered data
clustered_data <- data.frame(
site = rep(1:20, each = 10),
value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)
cv <- borg_cv(clustered_data, groups = "site", target = "value")
cv$strategy # "group_fold"
# Get rsample-compatible output for tidymodels
cv_rsample <- borg_cv(spatial_data, coords = c("x", "y"), output = "rsample")
Spatial+ Debiasing for Spatial Confounding
Description
Implements the Spatial+ approach of Dupont et al. (2022) to address spatial confounding. Spatially structured predictors share variance with the spatial process, inflating their coefficients. Spatial+ removes the spatial component from each predictor by regressing it on smooth spatial coordinates, then uses the residuals as debiased predictors.
Usage
borg_debias(
data,
predictors = NULL,
coords,
target = NULL,
method = c("gam_approx", "tps"),
df = 6,
keep_original = FALSE
)
Arguments
data |
Data frame with predictors and coordinates. |
predictors |
Character vector. Predictor column names to debias.
If |
coords |
Character vector of length 2. Coordinate column names. |
target |
Character. Target variable name (excluded from debiasing). |
method |
Character. Spatial smoothing method:
|
df |
Integer. Degrees of freedom for the spatial smooth. Default: 6. Higher values capture more spatial structure but risk removing real signal. |
keep_original |
Logical. If TRUE, returns both original and debiased
columns (debiased columns have suffix |
Details
When to use
Use when your predictors have spatial structure (e.g. climate variables
that vary smoothly over space). If borg_diagnose() detects
spatial autocorrelation in both residuals and predictors, spatial
confounding is likely. Debiasing is especially important for
inference (coefficient interpretation) rather than prediction.
How it works
For each predictor X_j:
Fit
X_j \sim f(s_1, s_2)wherefis a smooth function of coordinates.Replace
X_jwith the residualsX_j - \hat{f}(s_1, s_2).
The residuals contain only the non-spatial variation in X_j.
Value
A list with class "borg_debias" containing:
- data
Debiased data frame (original columns replaced or augmented)
- spatial_r2
Named numeric vector. R-squared of spatial smooth for each predictor (how much spatial structure was removed)
- predictors
Debiased predictor column names
- method
Smoothing method used
- df
Degrees of freedom
- assessment
Character: "minimal" (less than 10 percent spatial variance), "moderate" (10-40 percent), or "substantial" (over 40 percent)
Has print() and autoplot() methods.
References
Dupont, E., Wood, S. N., & Augustin, N. H. (2022). Spatial+: A novel approach to spatial confounding. Biometrics, 78(4), 1279-1290. doi:10.1111/biom.13656
Examples
set.seed(42)
d <- data.frame(
x = runif(200, 0, 100), y = runif(200, 0, 100),
temp = NA, elev = rnorm(200)
)
# temp has spatial structure
d$temp <- sin(d$x / 20) + cos(d$y / 20) + rnorm(200, sd = 0.3)
d$z <- 0.5 * d$temp + d$elev + rnorm(200, sd = 0.5)
db <- borg_debias(d, coords = c("x", "y"), target = "z")
db
Dissimilarity Index
Description
Computes the weighted Euclidean distance in feature space from each point to its nearest training observation. Points with high DI are dissimilar to the training data and predictions may be unreliable.
Usage
borg_di(train, new = NULL, train_idx = NULL, predictors = NULL, weights = NULL)
Arguments
train |
Data frame of training predictors (or full data with
|
new |
Data frame of new/prediction locations with the same
predictor columns. If |
train_idx |
Integer vector. If provided, |
predictors |
Character vector. Predictor column names. If |
weights |
Numeric vector. Variable importance weights (length =
number of predictors). If |
Value
A numeric vector of DI values (one per row of new, or
per training row if new is NULL). Has class
"borg_di" and attribute "threshold" (mean + sd of
training DI, used as default AOA cutoff).
References
Meyer, H., & Pebesma, E. (2021). Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution, 12(9), 1620-1633. doi:10.1111/2041-210X.13650
Examples
set.seed(42)
train <- data.frame(a = rnorm(80), b = rnorm(80))
new <- data.frame(a = rnorm(20, mean = 3), b = rnorm(20))
di <- borg_di(train, new)
summary(di)
Diagnose Data Dependency Structure
Description
Automatically detects spatial autocorrelation, temporal autocorrelation, and clustered structure in data. Returns a diagnosis object that specifies appropriate cross-validation strategies.
Usage
borg_diagnose(
data,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
alpha = 0.05,
verbose = FALSE
)
Arguments
data |
A data frame, |
coords |
Character vector of length 2 specifying coordinate column names
(e.g., |
time |
Character string specifying the time column name. Can be Date, POSIXct, or numeric. If NULL, temporal autocorrelation is not tested. |
groups |
Character string specifying the grouping column name (e.g., "site_id", "patient_id"). If NULL, clustered structure is not tested. |
target |
Character string specifying the response variable column name. Used for more accurate autocorrelation diagnostics on residuals. Optional. |
alpha |
Numeric. Significance level for autocorrelation tests. Default: 0.05. |
verbose |
Logical. If TRUE, print diagnostic progress. Default: FALSE. |
Details
Spatial Autocorrelation
Detected using Moran's I test on the target variable (or first numeric column).
The autocorrelation range is estimated from the empirical variogram.
Effective sample size is computed as n_{eff} = n / DEFF where
DEFF is the design effect.
Temporal Autocorrelation
Detected using the Ljung-Box test on the target variable. The decorrelation lag is the first lag where ACF drops below the significance threshold. Minimum embargo period is set to the decorrelation lag.
Clustered Structure
Detected by computing the intraclass correlation coefficient (ICC).
An ICC > 0.05 indicates meaningful clustering. The design effect
(DEFF) quantifies variance inflation: DEFF = 1 + (m-1) \times ICC
where m is the average cluster size.
Value
A BorgDiagnosis object containing:
Detected dependency type(s)
Severity assessment
Recommended CV strategy
Detailed diagnostics for each dependency type
Estimated metric inflation from using random CV
Examples
# Spatial data example
set.seed(42)
spatial_data <- data.frame(
x = runif(100, 0, 100),
y = runif(100, 0, 100),
response = rnorm(100)
)
# Add spatial autocorrelation (nearby points are similar)
for (i in 2:100) {
nearest <- which.min((spatial_data$x[1:(i-1)] - spatial_data$x[i])^2 +
(spatial_data$y[1:(i-1)] - spatial_data$y[i])^2)
spatial_data$response[i] <- 0.7 * spatial_data$response[nearest] +
0.3 * rnorm(1)
}
diagnosis <- borg_diagnose(spatial_data, coords = c("x", "y"),
target = "response")
print(diagnosis)
# Clustered data example
clustered_data <- data.frame(
site = rep(1:10, each = 20),
value = rep(rnorm(10, sd = 2), each = 20) + rnorm(200, sd = 0.5)
)
diagnosis <- borg_diagnose(clustered_data, groups = "site", target = "value")
print(diagnosis)
Leave-Disc-Out Cross-Validation
Description
Spatial cross-validation with circular exclusion buffers around test points. For each fold, all training points within a specified radius of any test point are removed, creating a spatial gap that eliminates autocorrelation leakage. More principled than rectangular or k-means blocking for point data.
Usage
borg_disc_cv(
data,
coords,
target = NULL,
radius = NULL,
v = 5,
min_train = 0.5,
seed = 42,
output = c("list", "rsample"),
verbose = FALSE
)
Arguments
data |
A data frame with coordinate and predictor columns. |
coords |
Character vector of length 2. Coordinate column names. |
target |
Character. Target variable name (for diagnostics). |
radius |
Numeric. Exclusion buffer radius in coordinate units.
If |
v |
Integer. Number of folds. Default: 5. |
min_train |
Numeric. Minimum fraction of data that must remain in training after exclusion. Folds that violate this are dropped. Default: 0.5. |
seed |
Integer. Random seed. Default: 42. |
output |
Character. Output format: |
verbose |
Logical. Print diagnostic messages. Default: FALSE. |
Details
How it works
Partition data into
vspatial folds using k-means on coordinates (same asborg_cv).For each fold, identify the test points.
Remove from training all points within
radiusof any test point. These excluded points are neither train nor test — they form a buffer zone.If the remaining training set is too small (
< min_train * n), the fold is dropped with a warning.
Choosing the radius
The radius should match the autocorrelation range of the spatial
process. borg_diagnose() estimates this via variogram
analysis. Setting radius = NULL uses this estimate
automatically.
Value
A list with class "borg_disc_cv" containing:
- folds
List of train/test index vectors
- radius
Exclusion radius used
- n_excluded
Number of training points excluded per fold (in the buffer zone)
- effective_training
Fraction of data available for training per fold (after exclusion)
- strategy
"leave_disc_out"
- params
List of parameters used
Compatible with other BORG functions that accept fold lists.
Examples
set.seed(42)
d <- data.frame(x = runif(200, 0, 100), y = runif(200, 0, 100))
d$z <- sin(d$x / 10) + rnorm(200, sd = 0.5)
cv <- borg_disc_cv(d, coords = c("x", "y"), target = "z", radius = 15)
cv
Detect Distribution Shift Between Training and Deployment Data
Description
Goes beyond Area of Applicability (AOA) by quantifying how the distribution changed and which features drifted. Combines univariate tests (Kolmogorov-Smirnov per feature) with a multivariate classifier two-sample test (train a model to distinguish train from deployment; AUC > 0.5 indicates shift).
Usage
borg_drift(
train,
new,
predictors = NULL,
alpha = 0.05,
n_perm = 100,
seed = 42
)
Arguments
train |
Data frame of training data. |
new |
Data frame of deployment/prediction data. |
predictors |
Character vector. Predictor column names. If |
alpha |
Numeric. Significance level for per-feature KS tests. Default: 0.05. |
n_perm |
Integer. Number of permutations for the classifier two-sample test p-value. Default: 100. Set to 0 to skip. |
seed |
Integer. Random seed. Default: 42. |
Details
Univariate tests
For each feature, a two-sample Kolmogorov-Smirnov test detects distributional differences. Effect size is measured by Cohen's d (standardized mean difference). P-values are Bonferroni-corrected.
Multivariate classifier test
A logistic regression is trained to distinguish training from deployment observations. If the data distributions are identical, the classifier achieves AUC ~ 0.5. Higher AUC indicates multivariate shift that may not be captured by univariate tests alone. Statistical significance is assessed via permutation.
Value
A list with class "borg_drift" containing:
- feature_drift
Data frame with per-feature KS statistic, p-value, effect_size (Cohen's d), shifted (logical), and direction ("higher", "lower", "similar")
- n_shifted
Number of features with significant drift
- classifier_auc
AUC of train-vs-deployment classifier (0.5 = no shift, 1.0 = complete separation)
- classifier_pvalue
Permutation p-value for AUC
- overall_severity
Character: "none", "mild", "moderate", "severe"
- summary
One-sentence summary
Has print() and autoplot() methods.
References
Ginsberg, T., Liang, Z., & Krishnan, R. G. (2023). A learning based hypothesis test for harmful covariate shift. ICLR.
Lopez-Paz, D., & Oquab, M. (2017). Revisiting classifier two-sample tests. ICLR.
Examples
set.seed(42)
train <- data.frame(a = rnorm(200), b = rnorm(200), c = rnorm(200))
# Deployment: feature 'a' has shifted
deploy <- data.frame(a = rnorm(100, mean = 1), b = rnorm(100),
c = rnorm(100))
drift <- borg_drift(train, deploy)
drift
Ensemble Predictions from CV Fold Models
Description
Combines predictions from models fitted on each CV fold into a weighted ensemble. Each fold's model predicts on the full dataset (or new data), and predictions are averaged with optional performance-based weighting.
Usage
borg_ensemble(
data,
folds,
formula,
newdata = NULL,
fit_fun = stats::lm,
weight_by = c("equal", "performance"),
metric = c("rmse", "mae")
)
Arguments
data |
Data frame used for training. |
folds |
A |
formula |
Model formula. |
newdata |
Optional data frame for prediction. If |
fit_fun |
Function. Default: |
weight_by |
Character. Weighting scheme:
|
metric |
Character. Metric for performance weighting. Default: |
Value
A list with class "borg_ensemble" containing:
- prediction
Numeric vector of ensemble predictions
- uncertainty
Per-observation SD across fold predictions
- weights
Fold weights used
- n_models
Number of contributing models
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
ens <- borg_ensemble(d, cv, z ~ x + y)
cor(d$z, ens$prediction)
Calibrate Dissimilarity Index to Prediction Error
Description
Models the relationship between the Dissimilarity Index (DI) and prediction error using isotonic regression. Enables pixel-level uncertainty estimation: for any new prediction point, its DI can be mapped to an expected error.
Usage
borg_error_profile(
data,
folds,
formula,
predictors = NULL,
metric = c("rmse", "mae"),
fit_fun = stats::lm,
n_bins = 10
)
Arguments
data |
Data frame of training data. |
folds |
A |
formula |
Model formula. |
predictors |
Character vector. Predictor columns for DI computation.
If |
metric |
Character. |
fit_fun |
Function. Default: |
n_bins |
Integer. Number of DI bins. Default: 10. |
Value
A list with class "borg_error_profile" containing:
- profile
Data frame: di_bin, mean_di, mean_error, n_obs
- raw
Data frame: di, error (per observation)
- iso_fit
Isotonic regression fit for predicting error from DI
Has an autoplot() method and a predict() method.
References
Meyer, H., & Pebesma, E. (2021). Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Methods in Ecology and Evolution, 12(9), 1620-1633. doi:10.1111/2041-210X.13650
Examples
set.seed(42)
d <- data.frame(x = runif(150, 0, 100), y = runif(150, 0, 100),
a = rnorm(150), b = rnorm(150))
d$z <- d$a * 2 + rnorm(150, sd = 0.5)
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
ep <- borg_error_profile(d, cv, z ~ a + b, predictors = c("a", "b"))
ep
Explain Risks in Plain Language with Actionable Recommendations
Description
Takes a BorgRisk or borg_result object and
returns a prioritized, human-readable action plan. Each risk is
translated into what went wrong, why it matters, how to fix it,
and the expected inflation magnitude.
Usage
borg_explain_risk(x, style = c("console", "markdown", "list"), verbose = FALSE)
Arguments
x |
A |
style |
Character. Output style: |
verbose |
Logical. If |
Value
Depending on style:
- "console"
Invisible
x, prints explanation.- "markdown"
Character string with markdown-formatted explanation.
- "list"
A list of per-risk explanation objects, each with fields:
risk_type,severity,plain_english,why_it_matters,how_to_fix,inflation_est.
See Also
borg, BorgRisk,
borg_assimilate
Examples
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
result <- borg(d, coords = c("x", "y"), target = "z",
train_idx = 1:70, test_idx = 71:100)
borg_explain_risk(result)
Export Validation Certificate
Description
Write a BORG validation certificate to a YAML or JSON file for machine-readable documentation.
Usage
borg_export(diagnosis, data, file, comparison = NULL, cv = NULL)
Arguments
diagnosis |
A |
data |
The data frame that was analyzed. |
file |
Character. Output file path. Extension determines format (.yaml/.yml for YAML, .json for JSON). |
comparison |
Optional. A |
cv |
Optional. A |
Value
Invisibly returns the certificate object.
See Also
borg_certificate for creating certificates.
Examples
spatial_data <- data.frame(
x = runif(100), y = runif(100), response = rnorm(100)
)
diagnosis <- borg_diagnose(spatial_data, coords = c("x", "y"), target = "response")
borg_export(diagnosis, spatial_data, file.path(tempdir(), "validation.yaml"))
borg_export(diagnosis, spatial_data, file.path(tempdir(), "validation.json"))
Extract Raster Values at Point Locations for BORG
Description
Convenience function that extracts environmental raster values at species occurrence (or other point) locations and returns a BORG-ready data frame with coordinates attached.
Usage
borg_extract(
raster,
points,
coords = NULL,
na.rm = TRUE,
coord_names = c("x", "y")
)
Arguments
raster |
A |
points |
Occurrence locations. One of:
|
coords |
Character vector of length 2 giving coordinate column names.
Required when |
na.rm |
Logical. If |
coord_names |
Character vector of length 2. Column names for the
coordinates in the output. Default: |
Details
This bridges the standard SDM workflow (rasters + points) with BORG's
data.frame interface. The returned data frame can be passed directly to
borg(), borg_cv(), or borg_diagnose().
Requires the terra package. If points is an sf object,
the sf package is also required.
The function performs the equivalent of:
env_data <- terra::extract(raster, points, ID = FALSE) coords <- terra::crds(points) cbind(coords, env_data)
but handles edge cases (NA removal, coordinate injection, CRS checks) and attaches metadata so downstream BORG functions can auto-detect spatial structure.
Value
A data frame with columns for coordinates (named by
coord_names), one column per raster layer (environmental
variables), and any additional columns from the input points data.
The returned data frame has a "borg_coords" attribute storing the
coordinate column names, so that borg() can auto-detect them.
Examples
if (requireNamespace("terra", quietly = TRUE)) {
# Create example raster and points
r <- terra::rast(nrows = 50, ncols = 50,
xmin = 0, xmax = 100, ymin = 0, ymax = 100)
terra::values(r) <- runif(terra::ncell(r))
names(r) <- "bio1"
pts <- terra::vect(
cbind(x = runif(100, 0, 100), y = runif(100, 0, 100)),
crs = terra::crs(r)
)
# Extract and run BORG
d <- borg_extract(r, pts)
result <- borg(d, coords = c("x", "y"), target = "bio1")
}
Detect Model Extrapolation
Description
Checks whether prediction locations fall outside the environmental envelope of training data. Uses per-variable range checks and multivariate Mahalanobis distance.
Usage
borg_extrapolation(train, new, predictors = NULL, coords = NULL)
Arguments
train |
Data frame of training data. |
new |
Data frame of prediction data (same predictor columns). |
predictors |
Character vector. Predictor column names. If
|
coords |
Character vector of length 2. Coordinate columns in
|
Value
A data frame with class "borg_extrapolation" containing:
- mahal_dist
Mahalanobis distance to training centroid
- n_vars_outside
Number of variables outside training range
- vars_outside
Comma-separated names of out-of-range variables
- extrapolating
Logical. TRUE if any variable is out of range
- x, y
Coordinates (if provided)
Examples
set.seed(42)
train <- data.frame(a = rnorm(80), b = rnorm(80))
new <- data.frame(a = c(rnorm(10), rnorm(10, mean = 5)),
b = c(rnorm(10), rnorm(10, mean = 5)))
ext <- borg_extrapolation(train, new)
table(ext$extrapolating)
Detect Performance Disparities Across Subgroups Under Blocked CV
Description
Compares model performance across spatial, temporal, or categorical subgroups under both random and blocked CV. Leakage can mask disparate performance — a model that looks uniformly good under random CV may show large subgroup-level gaps when evaluated honestly.
Usage
borg_fairness(
data,
model,
target,
group = NULL,
coords = NULL,
n_groups = 4L,
folds_blocked = NULL,
folds_random = NULL,
metric = c("rmse", "mae", "rsq"),
v = 5L,
...
)
Arguments
data |
A data frame with predictors and response. |
model |
A fitted model object with a |
target |
Character. Response variable name. |
group |
Character. Column name defining subgroups (e.g., region,
time period, site type). If NULL and |
coords |
Character vector of length 2. Coordinate column names.
Used for spatial subgroup clustering if |
n_groups |
Integer. Number of spatial clusters when auto-grouping. Default: 4. |
folds_blocked |
A list of train/test folds from
|
folds_random |
A list of random CV folds. If NULL, generated automatically. |
metric |
Character. Metric to compute: |
v |
Integer. Number of folds if generating automatically. Default: 5. |
... |
Additional arguments passed to |
Details
This function evaluates whether leakage disproportionately affects certain data subgroups. A model may have uniform performance under random CV but show large disparities when spatial or temporal independence is enforced.
For example, in ecological modelling, a random-CV RMSE of 0.5 across all regions may hide the fact that spatially blocked CV yields RMSE of 0.3 in data-rich regions but 1.2 in undersampled regions.
Value
A list with class "borg_fairness" containing:
- subgroup_metrics
Data frame with columns: group, metric_random, metric_blocked, n_obs, disparity (blocked - random), and relative_disparity (percent change).
- overall
List with overall random and blocked metrics.
- max_disparity
The largest subgroup-level performance gap between random and blocked CV.
- worst_group
The subgroup with the worst blocked CV performance.
- hidden_by_leakage
Logical. TRUE if any subgroup's blocked performance is substantially worse than random CV suggested.
- metric
The metric used.
See Also
Examples
set.seed(42)
d <- data.frame(x = runif(200), y = runif(200))
d$z <- sin(d$x * 3) + rnorm(200, sd = 0.3)
model <- lm(z ~ x + y, data = d)
fair <- borg_fairness(d, model, target = "z", coords = c("x", "y"))
fair
Evaluate Per-Fold Model Performance
Description
Fits a model on each fold's training set and evaluates on its test set, returning per-fold metrics with spatial centroids for geographic performance mapping.
Usage
borg_fold_performance(
data,
folds,
formula,
coords = NULL,
metric = c("rmse", "mae", "rsq", "auc", "tss", "kappa", "sensitivity", "specificity",
"accuracy"),
fit_fun = stats::lm,
parallel = FALSE
)
Arguments
data |
Data frame with predictor and target columns. |
folds |
A |
formula |
A model formula (e.g. |
coords |
Character vector of length 2. Coordinate column names for computing fold centroids. Optional. |
metric |
Character. Performance metric: |
fit_fun |
Function to fit a model. Default: |
parallel |
Logical. If |
Value
A data frame with columns:
- fold
Fold index
- metric
Metric name
- value
Metric value
- n_train
Training set size
- n_test
Test set size
- centroid_x, centroid_y
Spatial centroid of test set (if coords provided)
Has class "borg_fold_perf" with an autoplot() method.
Examples
set.seed(42)
d <- data.frame(
x = runif(200, 0, 100), y = runif(200, 0, 100),
z = rnorm(200)
)
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
perf <- borg_fold_performance(d, cv, z ~ x + y, coords = c("x", "y"))
perf
Per-Fold Environmental Similarity (MESS)
Description
Computes the Multivariate Environmental Similarity Surface (MESS) for each CV fold, showing how representative each fold's test set is relative to the training set in environmental space.
Usage
borg_fold_similarity(data, folds, predictors = NULL)
Arguments
data |
Data frame. |
folds |
A |
predictors |
Character vector. If |
Value
A data frame with per-fold similarity metrics.
References
Elith, J., Kearney, M., & Phillips, S. (2010). The art of modelling range-shifting species. Methods in Ecology and Evolution, 1(4), 330-342. doi:10.1111/j.2041-210X.2010.00036.x
Forward Feature Selection with Spatial/Blocked CV
Description
Selects variables using blocked cross-validation instead of random CV, avoiding overfitting to spatial/temporal structure. At each step, adds the variable that most improves the CV metric.
Usage
borg_forward_selection(
data,
target,
predictors = NULL,
folds = NULL,
coords = NULL,
time = NULL,
groups = NULL,
metric = c("rmse", "mae", "rsq"),
fit_fun = stats::lm,
min_vars = 1L,
verbose = FALSE
)
Arguments
data |
Data frame. |
target |
Character. Response variable name. |
predictors |
Character vector. Candidate predictor names. If
|
folds |
A |
coords |
Character vector of coordinate columns (for auto CV). |
time |
Character. Time column (for auto CV). |
groups |
Character. Group column (for auto CV). |
metric |
Character. |
fit_fun |
Function. Model fitting function. Default: |
min_vars |
Integer. Minimum variables to include. Default: 1. |
verbose |
Logical. Default: FALSE. |
Details
Similar to CAST::ffs() but uses BORG's CV infrastructure and supports any dependency structure (spatial, temporal, grouped).
Value
A list with class "borg_ffs" containing:
- selected
Character vector of selected variable names (in order)
- history
Data frame: step, variable_added, metric_value, n_vars
- best_metric
Best CV metric achieved
- all_vars
All candidate variables
Examples
set.seed(42)
d <- data.frame(
x = runif(100), y = runif(100),
important = rnorm(100), noise1 = rnorm(100), noise2 = rnorm(100)
)
d$z <- d$important * 2 + rnorm(100, sd = 0.5)
ffs <- borg_forward_selection(d, target = "z",
predictors = c("important", "noise1", "noise2"),
coords = c("x", "y"))
ffs$selected
Distance Distribution Diagnostics
Description
Computes nearest-neighbor distance distributions in geographic space and/or feature space between training data, CV test sets, and optional prediction locations. Diagnostic for whether CV folds are representative of the prediction task.
Usage
borg_geodist(
data,
folds,
prediction_points = NULL,
coords = NULL,
predictors = NULL,
type = c("both", "geo", "feature")
)
Arguments
data |
Data frame of training/modelling data. |
folds |
A |
prediction_points |
Optional data frame of prediction locations. |
coords |
Character vector of length 2. Coordinate columns for
geographic distance. If |
predictors |
Character vector. Feature columns for feature-space
distance. If |
type |
Character. |
Details
If the CV test-to-train distance distribution differs strongly from the prediction-to-train distribution, the CV does not mimic the real prediction scenario and performance estimates may be misleading.
Value
A list with class "borg_geodist" containing:
- cv_distances
Data frame of NN distances: test-to-train per fold
- prediction_distances
NN distances: prediction-to-train (if provided)
- sample_distances
NN distances: within training data (reference)
- ks_statistic
KS test statistic comparing CV vs prediction distances
Has an autoplot() method showing overlaid density curves.
References
Meyer, H., & Pebesma, E. (2022). Machine learning-based global maps of ecological variables and the challenge of assessing them. Nature Communications, 13, 2208. doi:10.1038/s41467-022-29838-9
Examples
set.seed(42)
d <- data.frame(x = runif(100, 0, 50), y = runif(100, 0, 50),
a = rnorm(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
pred <- data.frame(x = runif(50, 0, 100), y = runif(50, 0, 100),
a = rnorm(50))
gd <- borg_geodist(d, cv, prediction_points = pred, coords = c("x", "y"))
gd
Global (Pooled) Cross-Validation Metrics
Description
Pools all held-out predictions across folds and computes a single performance metric, avoiding bias from unequal fold sizes.
Usage
borg_global_validation(
data,
folds,
formula,
metric = c("rmse", "mae", "rsq"),
fit_fun = stats::lm
)
Arguments
data |
Data frame. |
folds |
A |
formula |
Model formula. |
metric |
Character. |
fit_fun |
Function. Default: |
Value
A list with class "borg_global_validation" containing
the pooled metric and per-fold metrics for comparison.
BORG-Guarded group_vfold_cv
Description
A guarded version of rsample::group_vfold_cv() that validates
group-based CV is appropriate for the data structure.
Usage
borg_group_vfold_cv(
data,
group,
v = NULL,
balance = c("groups", "observations"),
coords = NULL,
time = NULL,
target = NULL,
...
)
Arguments
data |
A data frame. |
group |
Character. Column name for grouping. |
v |
Integer. Number of folds. Default: number of groups. |
balance |
Character. How to balance folds: "groups" or "observations". |
coords |
Character vector. Coordinate columns for spatial check. |
time |
Character. Time column for temporal check. |
target |
Character. Target variable for dependency detection. |
... |
Additional arguments passed to |
Value
An rset object from rsample.
Examples
if (requireNamespace("rsample", quietly = TRUE)) {
# Clustered data - group CV is appropriate
data <- data.frame(
site = rep(1:20, each = 5),
x = rnorm(100),
y = rnorm(100)
)
folds <- borg_group_vfold_cv(data, group = "site", v = 5)
}
Block-Permutation Variable Importance
Description
Computes permutation importance that respects spatial structure. Instead of permuting individual rows (which breaks spatial autocorrelation and inflates importance), permutes values within spatial blocks.
Usage
borg_importance(
model,
data,
target,
coords = NULL,
predictors = NULL,
n_blocks = 10,
n_rep = 10,
metric = c("rmse", "mae"),
seed = 42
)
Arguments
model |
A fitted model with a |
data |
Data frame with predictor columns. |
target |
Character. Target variable name. |
coords |
Character vector of length 2. Coordinate column names.
If |
predictors |
Character vector. Variables to assess. If |
n_blocks |
Integer. Number of spatial blocks for permutation. Default: 10. |
n_rep |
Integer. Number of permutation repeats per variable. Default: 10. |
metric |
Character. |
seed |
Integer. Random seed. Default: 42. |
Value
A data frame with class "borg_importance" containing:
- variable
Predictor name
- importance
Mean increase in error after permutation
- importance_sd
SD across repeats
- rank
Importance rank (1 = most important)
Has an autoplot() method.
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100),
a = rnorm(100), b = rnorm(100))
d$z <- d$a * 2 + rnorm(100, sd = 0.5)
model <- lm(z ~ a + b, data = d)
imp <- borg_importance(model, d, target = "z", coords = c("x", "y"))
imp
BORG-Guarded initial_split
Description
A guarded version of rsample::initial_split() that checks for
temporal ordering when time structure is specified.
Usage
borg_initial_split(
data,
prop = 3/4,
strata = NULL,
time = NULL,
coords = NULL,
groups = NULL,
target = NULL,
...
)
Arguments
data |
A data frame. |
prop |
Numeric. Proportion of data for training. Default: 0.75. |
strata |
Character. Column name for stratification. |
time |
Character. Time column - if provided, ensures chronological split. |
coords |
Character vector. Coordinate columns for spatial check. |
groups |
Character. Group column for clustered check. |
target |
Character. Target variable. |
... |
Additional arguments passed to |
Details
When time is specified, this function ensures the split respects
temporal ordering (training data comes before test data). For spatial data,
it warns if random splitting may cause issues.
Value
An rsplit object.
Examples
if (requireNamespace("rsample", quietly = TRUE)) {
# Temporal data - ensures chronological split
ts_data <- data.frame(
date = seq(as.Date("2020-01-01"), by = "day", length.out = 100),
value = cumsum(rnorm(100))
)
split <- borg_initial_split(ts_data, prop = 0.8, time = "date")
}
Inspect R Objects for Evaluation Risks
Description
borg_inspect() examines R objects for signals of information reuse that
would invalidate model evaluation. It returns a structured assessment of
detected risks.
Usage
borg_inspect(
object,
train_idx = NULL,
test_idx = NULL,
data = NULL,
target = NULL,
coords = NULL,
...
)
Arguments
object |
An R object to inspect. Supported types include:
|
train_idx |
Integer vector of training row indices. Required for data-level inspection. |
test_idx |
Integer vector of test row indices. Required for data-level inspection. |
data |
Optional data frame. Required when inspecting preprocessing objects to compare parameters against train-only statistics. |
target |
Optional name of the target/outcome column. If provided, checks for target leakage (features highly correlated with target). |
coords |
Optional character vector of coordinate column names. If provided, checks spatial separation between train and test. |
... |
Additional arguments passed to type-specific inspectors. |
Details
borg_inspect() dispatches to type-specific inspectors based on the class
of the input object. Each inspector looks for specific leakage patterns:
- Preprocessing objects
Checks if parameters (mean, sd, loadings) were computed on data that includes test indices
- CV objects
Validates that train/test indices do not overlap and that grouping structure is respected
- Feature engineering
Checks if encodings, embeddings, or derived features used test data during computation
Value
A BorgRisk object containing:
- risks
List of detected risk objects
- n_hard
Count of hard violations
- n_soft
Count of soft inflation warnings
- is_valid
TRUE if no hard violations detected
See Also
borg_validate for complete workflow validation,
borg for automated enforcement during evaluation.
Examples
# Inspect a preprocessing object
data(mtcars)
train_idx <- 1:25
test_idx <- 26:32
# BAD: preProcess fitted on full data (will detect leak)
pp_bad <- scale(mtcars[, -1])
# GOOD: preProcess fitted on train only
pp_good <- scale(mtcars[train_idx, -1])
Interactive Leaflet Map of Spatial CV Folds
Description
Creates an interactive leaflet map showing train/test point assignments across CV folds. Requires geographic coordinates (lat/lon).
Usage
borg_leaflet(object, data, coords, fold = 1)
Arguments
object |
A |
data |
Data frame with coordinate columns. Required. |
coords |
Character vector of length 2. Coordinate column names (longitude first, latitude second). |
fold |
Integer or |
Details
Requires the leaflet package. Coordinates must be in WGS84 (longitude/latitude). Points are color-coded: blue = train, red = test. Click points for popups with index and fold assignment.
Value
A leaflet htmlwidget.
Examples
if (requireNamespace("leaflet", quietly = TRUE)) {
set.seed(42)
d <- data.frame(
lon = runif(100, 10, 20),
lat = runif(100, 45, 55),
z = rnorm(100)
)
result <- borg(d, coords = c("lon", "lat"), target = "z")
borg_leaflet(result, data = d, coords = c("lon", "lat"))
}
Scan Methods Text for Common Leakage Patterns
Description
Analyzes a methods section (as text) for descriptions of evaluation practices that commonly lead to data leakage. Useful for reviewing papers, teaching, or auditing your own methods description.
Usage
borg_literature_check(
text,
context = c("auto", "spatial", "temporal", "general"),
strict = FALSE
)
Arguments
text |
Character string. The methods section text to analyze. Can be a single string or character vector (lines). |
context |
Character. What kind of data the methods describe.
One of |
strict |
Logical. If TRUE, flags borderline practices too. Default: FALSE. |
Details
The scanner looks for textual patterns that describe evaluation practices known to cause leakage. It does NOT parse actual code or data — it analyzes the description of methods.
Patterns checked
- Random CV on spatial data
Any mention of random/stratified k-fold CV when spatial coordinates are mentioned.
- Preprocessing before splitting
Normalization, PCA, or feature selection described before train/test splitting.
- No spatial blocking
Spatial data evaluated without spatial CV or buffer zones.
- Feature selection on full data
Variable selection or importance computed before CV.
- Temporal look-ahead
Random splits on time series or panel data.
- No embargo
Temporal CV without gap/embargo between train and test.
Value
A list with class "borg_lit_check" containing:
- flags
Data frame with columns: pattern, severity (
"high","medium","low"), matched_text, explanation, and recommendation.- n_flags
Total number of issues found.
- detected_context
The data context detected or provided.
- summary
One-line summary.
See Also
borg_explain_risk, borg_report
Examples
methods_text <- "We used 10-fold cross-validation to evaluate
species distribution models. Predictors were normalized and
reduced via PCA before model fitting. Occurrence records
were collected across 50 sites in Europe."
result <- borg_literature_check(methods_text)
result
Local Moran's I for Residuals
Description
Computes local Moran's I statistics to identify spatial clusters of high/low residuals (hotspots and coldspots).
Usage
borg_local_moran(residuals, x, y, k = 8L)
Arguments
residuals |
Numeric vector of model residuals. |
x |
Numeric. X-coordinates. |
y |
Numeric. Y-coordinates. |
k |
Integer. Number of nearest neighbors for spatial weights. Default: 8. |
Value
A data frame with columns: x, y, local_i, p_value, cluster_type.
Available Classification Metrics
Description
Returns the valid classification metric names.
Usage
borg_metrics()
Value
Character vector of available classification metrics.
Examples
borg_metrics()
Multi-Scale Performance Assessment
Description
Evaluates model accuracy at increasing spatial aggregation scales. Predictions and observations are averaged within grid cells of increasing size, and performance is computed at each scale. Reveals scale-dependent error patterns.
Usage
borg_multiscale(
data,
predictions,
target,
coords,
scales = NULL,
metric = c("rmse", "mae", "rsq")
)
Arguments
data |
Data frame with coordinates and target variable. |
predictions |
Numeric vector of predicted values (same length as
|
target |
Character. Target variable name. |
coords |
Character vector of length 2. Coordinate columns. |
scales |
Numeric vector. Grid cell sizes to evaluate.
If |
metric |
Character. |
Value
A data frame with class "borg_multiscale" containing:
- scale
Grid cell size
- metric_value
Performance at this scale
- n_cells
Number of occupied grid cells
Has an autoplot() method.
References
Riemann, R., Wilson, B. T., Lister, A., & Parks, S. (2010). An effective assessment protocol for continuous geospatial datasets of forest characteristics using USFS Forest Inventory and Analysis (FIA) data. Remote Sensing of Environment, 114(10), 2337-2352. doi:10.1016/j.rse.2010.05.010
Examples
set.seed(42)
d <- data.frame(x = runif(200, 0, 100), y = runif(200, 0, 100),
z = rnorm(200))
preds <- d$z + rnorm(200, sd = 0.5)
ms <- borg_multiscale(d, preds, target = "z", coords = c("x", "y"))
ms
Null Model Significance Test
Description
Tests whether model performance is significantly better than random expectation by permuting the response variable and re-evaluating with the same CV scheme. Computes z-scores and p-values.
Usage
borg_null_test(
data,
folds,
formula,
metric = c("rmse", "mae", "rsq"),
fit_fun = stats::lm,
n_null = 99L,
seed = 42L,
verbose = FALSE
)
Arguments
data |
Data frame. |
folds |
A |
formula |
Model formula. |
metric |
Character. |
fit_fun |
Function. Default: |
n_null |
Integer. Number of null permutations. Default: 99. |
seed |
Integer. Random seed. Default: 42. |
verbose |
Logical. Default: FALSE. |
Value
A list with class "borg_null_test" containing:
- empirical
Empirical CV metric value
- null_distribution
Numeric vector of null metric values
- z_score
Z-score of empirical vs null
- p_value
One-sided p-value
- significant
Logical. Whether model is significantly better
Has an autoplot() method showing null distribution with
empirical value.
References
Kass, J. M., Muscarella, R., Galante, P. J., Bohl, C. L., Pinilla-Buitrago, G. E., Boria, R. A., Soley-Guardia, M., & Anderson, R. P. (2021). ENMeval 2.0: Redesigned for customizable and reproducible modeling of species' niches and distributions. Methods in Ecology and Evolution, 12(9), 1602-1608. doi:10.1111/2041-210X.13628
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
d$z <- d$x * 3 + rnorm(100)
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
nt <- borg_null_test(d, cv, z ~ x + y, n_null = 19)
nt
Get Current BORG Options
Description
Returns the current state of BORG configuration options.
Usage
borg_options()
Value
A named list of current BORG options.
Examples
borg_options()
Validate an Entire Modeling Pipeline
Description
Walks a tidymodels workflow() or caret::train() object and
validates every step — preprocessing, feature selection, tuning, and
model fitting — for information leakage.
Usage
borg_pipeline(pipeline, train_idx, test_idx, data = NULL, ...)
Arguments
pipeline |
A modeling pipeline object. Supported types:
|
train_idx |
Integer vector of training row indices. |
test_idx |
Integer vector of test row indices. |
data |
Optional data frame. Required for parameter-level checks. |
... |
Additional arguments passed to inspectors. |
Details
borg_pipeline() decomposes a pipeline into stages and inspects each:
-
Preprocessing: Recipe steps, preProcess, PCA, scaling
-
Feature selection: Variable importance, filtering
-
Hyperparameter tuning: Inner CV resamples
-
Model fitting: Training data scope, row counts
-
Post-processing: Threshold optimization, calibration
Each stage gets its own BorgRisk assessment. The overall result aggregates all risks across stages.
Value
An object of class "borg_pipeline" containing:
- stages
Named list of per-stage BorgRisk results
- overall
Aggregated BorgRisk for the full pipeline
- n_stages
Number of stages inspected
- leaking_stages
Character vector of stage names with hard violations
See Also
Examples
if (requireNamespace("caret", quietly = TRUE)) {
ctrl <- caret::trainControl(method = "cv", number = 5)
model <- caret::train(mpg ~ ., data = mtcars[1:25, ], method = "lm",
trControl = ctrl, preProcess = c("center", "scale"))
result <- borg_pipeline(model, train_idx = 1:25, test_idx = 26:32,
data = mtcars)
print(result)
}
Estimate Statistical Power After Blocking
Description
Computes how much statistical power is lost when switching from random to blocked cross-validation. Reports effective sample size, minimum detectable effect size, and whether the dataset is large enough.
Usage
borg_power(
data,
diagnosis = NULL,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
alpha = 0.05,
power = 0.8,
effect_size = NULL,
verbose = FALSE
)
Arguments
data |
A data frame. |
diagnosis |
A |
coords |
Character vector of length 2 for spatial coordinates. |
time |
Character string for the time column. |
groups |
Character string for the grouping column. |
target |
Character string for the response variable. |
alpha |
Significance level. Default: 0.05. |
power |
Target power. Default: 0.80. |
effect_size |
Numeric. Expected effect size (Cohen's d for continuous, OR for binary). If NULL, reports minimum detectable effect size instead. |
verbose |
Logical. Print progress messages. Default: FALSE. |
Details
When data have spatial, temporal, or clustered dependencies, blocked CV reduces the effective sample size. This function quantifies that reduction using the design effect (DEFF):
n_{eff} = n / DEFF
The design effect is computed from:
-
Spatial: Moran's I and the ratio of autocorrelation range to study extent (Griffith, 2005)
-
Temporal: ACF lag-1 autocorrelation (
DEFF \approx (1 + \rho) / (1 - \rho)) -
Clustered: ICC and mean cluster size (
DEFF = 1 + (m - 1) \times ICC)
For mixed dependencies, design effects are combined multiplicatively.
Value
An object of class "borg_power" containing:
- n_actual
Total number of observations
- n_effective
Effective sample size after accounting for dependencies
- design_effect
Variance inflation factor from dependencies
- power_random
Statistical power under random CV
- power_blocked
Statistical power under blocked CV
- power_loss
Absolute power loss (power_random - power_blocked)
- min_detectable_effect
Minimum detectable effect at target power
- min_detectable_effect_random
Same, under random CV (for comparison)
- sufficient
Logical. Is the dataset large enough at target power?
- recommendation
Character. Human-readable recommendation.
- diagnosis
The BorgDiagnosis used
See Also
Examples
# Clustered data
clustered_data <- data.frame(
site = rep(1:20, each = 10),
value = rep(rnorm(20, sd = 2), each = 10) + rnorm(200, sd = 0.5)
)
pw <- borg_power(clustered_data, groups = "site", target = "value")
print(pw)
Predict onto a SpatRaster with AOA Masking
Description
Generates spatial predictions from a fitted model onto a
terra::SpatRaster, computes the dissimilarity index and area
of applicability, and returns a multi-layer raster with prediction,
DI, and AOA mask layers.
Usage
borg_predict_raster(
model,
raster,
train_data,
predictors = NULL,
weights = NULL,
threshold = NULL,
type = "response"
)
Arguments
model |
A fitted model with a |
raster |
A |
train_data |
Data frame of training data used to fit the model. |
predictors |
Character vector. Predictor column names. If
|
weights |
Numeric vector. Variable importance weights for DI. |
threshold |
Numeric. Manual AOA threshold. If |
type |
Character. Prediction type passed to |
Details
Requires the terra package.
Value
A terra::SpatRaster with three layers:
- prediction
Model predictions
- di
Dissimilarity index
- aoa
Area of applicability (1 = inside, 0 = outside)
Examples
if (requireNamespace("terra", quietly = TRUE)) {
set.seed(42)
r <- terra::rast(nrows = 20, ncols = 20, xmin = 0, xmax = 100,
ymin = 0, ymax = 100, nlyrs = 2)
terra::values(r) <- cbind(rnorm(400), rnorm(400))
names(r) <- c("bio1", "bio2")
train <- data.frame(bio1 = rnorm(50), bio2 = rnorm(50))
train$y <- train$bio1 * 2 + rnorm(50, sd = 0.5)
model <- lm(y ~ bio1 + bio2, data = train)
result <- borg_predict_raster(model, r, train, predictors = c("bio1", "bio2"))
names(result)
}
Spatial Prediction Uncertainty Map
Description
Collects predictions from all CV folds at each test location, computes per-location mean and standard deviation, and provides a spatial uncertainty map.
Usage
borg_prediction_map(data, folds, formula, coords, fit_fun = stats::lm)
Arguments
data |
Data frame with predictor and coordinate columns. |
folds |
A |
formula |
Model formula. |
coords |
Character vector of length 2. Coordinate column names. |
fit_fun |
Function. Default: |
Value
A data frame with class "borg_pred_map" containing:
- x, y
Coordinates
- pred_mean
Mean prediction across folds where this obs was in test set
- pred_sd
SD of predictions (uncertainty)
- actual
Observed target value
- residual_mean
Mean residual
- n_folds_tested
Number of folds where this obs appeared in test set
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
pm <- borg_prediction_map(d, cv, z ~ x + y, coords = c("x", "y"))
head(pm)
Register BORG Hooks
Description
Registers BORG validation hooks that automatically check data dependencies when using common ML framework functions. This is an experimental feature.
Usage
borg_register_hooks(
frameworks = c("rsample", "caret", "mlr3"),
action = c("error", "warn", "message")
)
Arguments
frameworks |
Character vector. Which frameworks to hook into. Options: "rsample", "caret", "mlr3". Default: all available. |
action |
Character. What to do when dependencies detected: "error" (block), "warn" (warn but proceed), "message" (info only). |
Details
This function uses R's trace mechanism to add BORG checks to framework functions. The hooks are session-specific and do not persist.
To remove hooks, use borg_unregister_hooks().
Value
Invisible NULL. Called for side effect.
Examples
if (requireNamespace("rsample", quietly = TRUE)) {
# Register hooks for rsample
borg_register_hooks("rsample")
# Now vfold_cv() will check for dependencies
spatial_data <- data.frame(
lon = runif(50), lat = runif(50), response = rnorm(50)
)
options(borg.check_data = spatial_data)
options(borg.check_coords = c("lon", "lat"))
# Remove hooks
borg_unregister_hooks()
}
Repeated Blocked Cross-Validation
Description
Repeats borg_cv multiple times with different random seeds
and aggregates the results. Repeated CV provides better variance
estimates and more stable performance metrics than a single CV run,
particularly important for spatial/temporal blocking where fold
assignment depends on the random seed.
Usage
borg_repeated_cv(
data,
repeats = 10L,
v = 5L,
seeds = NULL,
aggregate = TRUE,
...
)
Arguments
data |
A data frame. |
repeats |
Integer. Number of repetitions. Default: 10. |
v |
Integer. Number of folds per repetition. Default: 5. |
seeds |
Integer vector of length |
aggregate |
Logical. If TRUE (default), returns aggregated metrics. If FALSE, returns all individual fold results. |
... |
Additional arguments passed to |
Details
Each repetition calls borg_cv() with a different seed, producing
different fold assignments. The diagnosis is computed once and reused
across repetitions for consistency.
For aggregation with a model, use borg_fold_performance()
on each repetition's folds and combine the results.
Value
A list with class "borg_repeated_cv" containing:
- folds
List of length
repeats, each element a list ofvtrain/test fold pairs.- repeats
Number of repetitions.
- v
Number of folds.
- strategy
CV strategy used.
- seeds
Seeds used for each repetition.
- diagnosis
The
BorgDiagnosisfrom the first run.
Has print() and autoplot() methods.
See Also
borg_cv, borg_fold_performance
Examples
set.seed(42)
d <- data.frame(
x = runif(200), y = runif(200), z = rnorm(200)
)
rcv <- borg_repeated_cv(d, repeats = 3, v = 5,
coords = c("x", "y"), target = "z")
rcv
length(rcv$folds) # 3 repetitions
length(rcv$folds[[1]]) # 5 folds each
Generate BORG HTML Diagnostic Report
Description
Creates a self-contained HTML report with embedded plots summarizing the full BORG analysis: diagnosis, variogram, CV folds, performance, and risk assessment.
Usage
borg_report(
object,
file = "borg_report.html",
title = "BORG Diagnostic Report",
open = interactive()
)
Arguments
object |
A |
file |
Character. Output file path. Default: |
title |
Character. Report title. |
open |
Logical. Open in browser after generation.
Default: |
Details
Plots are embedded as base64-encoded PNGs. Requires ggplot2. No rmarkdown or pandoc dependency.
Value
Invisible path to the generated HTML file.
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
wf <- borg_workflow(d, z ~ x + y, coords = c("x", "y"))
borg_report(wf, file = tempfile(fileext = ".html"), open = FALSE)
Convert BORG Folds to an rsample rset Object
Description
Creates a proper rset object from BORG cross-validation folds,
enabling direct use with tune::tune_grid(),
tune::fit_resamples(), and other tidymodels infrastructure
without manual conversion.
Usage
borg_rset(data = NULL, folds = NULL, cv_obj = NULL)
Arguments
data |
A data frame. Required in all cases because |
folds |
A list of lists, each with |
cv_obj |
A |
Details
The returned object works directly with tune::tune_grid(),
tune::fit_resamples(), rsample::assessment(), and
rsample::analysis(). The folds preserve BORG's spatial/temporal
blocking structure.
Value
An rset object (inheriting from tbl_df) compatible
with the tidymodels ecosystem. Each row has an rsplit column
containing train/test index information and an id column.
See Also
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
rset <- borg_rset(data = d, cv_obj = cv)
class(rset) # "borg_rset" "rset" "tbl_df" ...
Suggest Sampling Locations to Improve AOA
Description
Identifies locations in the prediction domain where new training data would most reduce the dissimilarity index, thereby expanding the area of applicability.
Usage
borg_sample_design(
train,
prediction,
predictors = NULL,
coords = NULL,
n = 10L,
weights = NULL
)
Arguments
train |
Data frame of existing training data. |
prediction |
Data frame of prediction locations (with coordinates). |
predictors |
Character vector. Predictor columns for DI computation. |
coords |
Character vector of length 2. Coordinate columns in
|
n |
Integer. Number of suggested sampling locations. Default: 10. |
weights |
Numeric vector. Variable importance weights for DI. |
Value
A data frame with class "borg_sample_design" containing
the top-n prediction locations ranked by DI (highest first), with
columns: x, y, di, rank.
Examples
set.seed(42)
train <- data.frame(x = runif(50, 0, 50), y = runif(50, 0, 50),
a = rnorm(50))
pred <- data.frame(x = runif(200, 0, 100), y = runif(200, 0, 100),
a = rnorm(200))
design <- borg_sample_design(train, pred, predictors = "a",
coords = c("x", "y"), n = 5)
design
Spatial SHAP Values
Description
Approximates SHAP values using block-conditional expectations instead of naive row permutation. Standard SHAP marginalizes over all observations, which creates impossible feature combinations when data is spatially structured. Spatial SHAP marginalizes within spatial blocks, preserving realistic covariate relationships.
Usage
borg_shap(
model,
data,
target,
coords = NULL,
predictors = NULL,
n_blocks = 10,
n_samples = 50,
explain_idx = NULL,
seed = 42
)
Arguments
model |
A fitted model with a |
data |
Data frame with predictors. |
target |
Character. Target variable name (excluded from SHAP). |
coords |
Character vector of length 2. Coordinate column names. If provided, uses spatial blocking for marginal expectations. |
predictors |
Character vector. Variables to compute SHAP for.
If |
n_blocks |
Integer. Number of spatial blocks for marginal expectations. Default: 10. |
n_samples |
Integer. Number of background samples per block for marginal expectations. Default: 50. |
explain_idx |
Integer vector. Row indices to explain. If |
seed |
Integer. Random seed. Default: 42. |
Details
Algorithm
For each observation x_i and feature j:
Identify the spatial block containing
x_i.Sample background points from other blocks.
Compute marginal contribution: replace feature
jwith values from background points, keeping all other features fixed.Average the change in prediction.
This approximates the Shapley value while respecting spatial structure.
Why not standard SHAP?
Standard kernel SHAP or marginal SHAP samples replacement values uniformly from the dataset. For spatial data, this creates combinations that never occur in reality (e.g., tropical temperature with arctic precipitation), biasing the SHAP values.
Value
A list with class "borg_shap" containing:
- shap_values
Matrix (n_explain x n_predictors) of SHAP values
- baseline
Mean prediction (expected value)
- feature_importance
Named vector of mean |SHAP| per feature
- predictors
Feature names
- method
"spatial_block" or "standard"
Has print() and autoplot() methods.
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100),
a = rnorm(100), b = rnorm(100))
d$z <- 3 * d$a - d$b + rnorm(100, sd = 0.5)
model <- lm(z ~ a + b, data = d)
shap <- borg_shap(model, d, target = "z", coords = c("x", "y"),
explain_idx = 1:20)
shap
Generate Synthetic Data with Known Leakage
Description
Creates datasets with controlled spatial autocorrelation, temporal dependence, or target leakage. The true inflation magnitude is known, enabling benchmarking of leakage detection methods and quantifying how much performance metrics inflate under different CV strategies.
Usage
borg_simulate(
n = 500L,
type = c("spatial", "temporal", "target_leak", "preprocessing_leak", "combined",
"independent"),
n_predictors = 5L,
signal_strength = 0.3,
autocorrelation = 0.5,
leak_strength = 0.9,
grid_size = NULL,
seed = 42L
)
Arguments
n |
Integer. Number of observations. Default: 500. |
type |
Character. Type of dependency to simulate. One of
|
n_predictors |
Integer. Number of predictor variables. Default: 5. |
signal_strength |
Numeric in |
autocorrelation |
Numeric in |
leak_strength |
Numeric in |
grid_size |
Integer. For spatial data, side length of the
spatial grid (total points = |
seed |
Integer. Random seed for reproducibility. Default: 42. |
Details
Simulation types
"spatial"Predictors and response have Gaussian spatial autocorrelation on a 2D grid. Random CV inflates R2 relative to spatial block CV.
"temporal"Predictors and response are AR(1) time series. Random CV inflates metrics relative to temporal CV.
"target_leak"One predictor is a noisy copy of the response (post-hoc information).
"preprocessing_leak"Predictors are normalized on the full dataset, not within each fold. The data itself is independent but leakage occurs through shared statistics.
"combined"Spatial autocorrelation plus one target-leaked predictor.
"independent"No dependencies. Control scenario where random and blocked CV should yield similar results.
Value
A list with class "borg_simulation" containing:
- data
The generated data frame with predictors, response, and (if spatial/temporal) coordinate/time columns.
- true_r2
The true R-squared if the model could perfectly recover the signal (upper bound for honest evaluation).
- type
The dependency type used.
- params
List of all generation parameters.
- leaked_vars
Character vector of intentionally leaked variable names (for target_leak/preprocessing_leak types).
- coords
Character vector of coordinate column names (if spatial).
- time_col
Name of time column (if temporal).
See Also
Examples
# Spatial leakage benchmark
sim <- borg_simulate(n = 300, type = "spatial", autocorrelation = 0.7)
str(sim$data)
sim$true_r2
# Compare random vs spatial CV
cv_spatial <- borg_cv(sim$data, coords = sim$coords, target = "y")
Spatial Block Cross-Validation (rsample-compatible)
Description
Creates spatial block CV folds that plug directly into
tune::tune_grid() and tune::fit_resamples().
Usage
borg_spatial_cv(
data,
coords,
target = NULL,
v = 5,
buffer = NULL,
repeats = 1L,
...
)
Arguments
data |
A data frame, |
coords |
Character vector of length 2. Coordinate column names. |
target |
Character. Target variable for autocorrelation diagnosis. |
v |
Integer. Number of folds. Default: 5. |
buffer |
Numeric. Optional spatial buffer distance for exclusion. |
repeats |
Integer. Number of repeated fold sets. Default: 1. |
... |
Additional arguments passed to |
Value
A borg_rset object (subclass of rset) compatible
with tidymodels.
Examples
if (requireNamespace("rsample", quietly = TRUE)) {
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
folds <- borg_spatial_cv(d, coords = c("x", "y"), target = "z")
}
Buffered Leave-One-Out Cross-Validation
Description
For each observation, holds it out as the test set and removes all training points within a buffer distance. This ensures spatial independence between train and test in each iteration, following the approach recommended by Roberts et al. (2017).
Usage
borg_spatial_loo(
data,
coords,
buffer = NULL,
target = NULL,
diagnosis = NULL,
max_iter = NULL,
seed = 42L,
verbose = FALSE
)
Arguments
data |
A data frame. |
coords |
Character vector of length 2. Coordinate column names. |
buffer |
Numeric. Exclusion buffer distance (in coordinate units).
Training points within this distance of the test point are removed.
If NULL, uses the estimated autocorrelation range from
|
target |
Character. Response variable name (used for diagnosis
if |
diagnosis |
A |
max_iter |
Integer or NULL. Maximum number of LOO iterations.
If NULL, uses all |
seed |
Integer. Random seed for subsampling. Default: 42. |
verbose |
Logical. Print progress. Default: FALSE. |
Value
A list with class "borg_spatial_loo" containing:
- folds
List of length
n(ormax_iter), each withtrainandtestinteger index vectors.- buffer
Buffer distance used.
- n_excluded
Integer vector: number of training points excluded per iteration.
- effective_train_size
Integer vector: training set size per iteration after exclusion.
References
Roberts, D.R., et al. (2017). Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography, 40(8), 913-929. doi:10.1111/ecog.02881
See Also
Examples
set.seed(42)
d <- data.frame(x = runif(50), y = runif(50), z = rnorm(50))
loo <- borg_spatial_loo(d, coords = c("x", "y"), buffer = 0.2)
length(loo$folds) # 50 iterations
mean(loo$n_excluded) # avg points excluded per iteration
Analyze CV Fold Stability Across Repeats
Description
Quantifies how much CV results and fold assignments change across
repeated runs. Uses a borg_cv object with repeats > 1.
Usage
borg_stability(
object,
data,
formula,
metric = c("rmse", "mae", "rsq"),
fit_fun = stats::lm
)
Arguments
object |
A |
data |
Data frame for computing per-repeat performance. |
formula |
Model formula. |
metric |
Character. |
fit_fun |
Function. Model fitting function. Default: |
Value
A list with class "borg_stability" containing:
- metric_stability
Data frame: repeat, mean_metric, sd_metric
- metric_cv
Coefficient of variation of mean metric across repeats
- assignment_stability
Mean Jaccard similarity of test sets between repeats
- summary
Character assessment: "stable", "moderate", or "unstable"
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z", repeats = 5)
stab <- borg_stability(cv, d, z ~ x + y)
stab
Prediction Stability Map
Description
Creates a continuous spatial map of prediction variance across
cross-validation folds. Unlike borg_stability() which
returns summary statistics, this function produces per-location
stability scores showing where predictions are consistent versus
volatile across different train/test partitions.
Usage
borg_stability_map(
model,
data,
new = NULL,
target,
coords,
formula = NULL,
fit_fun = NULL,
folds = NULL,
v = 10,
seed = 42
)
Arguments
model |
A fitted model with a |
data |
Data frame with predictors, target, and coordinates. |
new |
Data frame of prediction locations. If |
target |
Character. Target variable name. |
coords |
Character vector of length 2. Coordinate column names. |
formula |
Model formula. Required if |
fit_fun |
Function. Model fitting function. If |
folds |
A |
v |
Integer. Number of folds if generating automatically. Default: 10. |
seed |
Integer. Random seed. Default: 42. |
Details
How it works
For each CV fold:
Refit the model on the training set.
Predict on all locations (not just test set).
Store per-location predictions.
Then compute per-location statistics across folds. Locations with
high pred_sd are unstable — the prediction changes substantially
depending on which data is included in training.
Interpretation
High instability at a location can indicate:
The location is in a data-sparse region (near AOA boundary)
The model is overfitting to nearby training points
The underlying relationship changes spatially (non-stationarity)
Value
A data frame with class "borg_stability_map" containing:
- pred_mean
Mean prediction across folds
- pred_sd
Standard deviation of predictions across folds
- pred_cv
Coefficient of variation (|sd/mean|)
- pred_range
Range of predictions (max - min)
- n_folds_predicted
Number of folds where this point was in the test set (or could be predicted)
- x, y
Coordinates
Has print() and autoplot() methods.
Examples
set.seed(42)
d <- data.frame(x = runif(100, 0, 50), y = runif(100, 0, 50))
d$z <- sin(d$x / 5) + rnorm(100, sd = 0.5)
model <- lm(z ~ x + y, data = d)
sm <- borg_stability_map(model, d, target = "z", coords = c("x", "y"),
v = 5)
sm
Temporal Block Cross-Validation (rsample-compatible)
Description
Creates temporal block CV folds that plug directly into
tune::tune_grid() and tune::fit_resamples().
Usage
borg_temporal_cv(data, time, target = NULL, v = 5, embargo = NULL, ...)
Arguments
data |
A data frame. |
time |
Character. Time column name. |
target |
Character. Target variable. |
v |
Integer. Number of folds. Default: 5. |
embargo |
Numeric. Time gap between train and test sets. |
... |
Additional arguments passed to |
Value
A borg_rset object (subclass of rset) compatible
with tidymodels.
Examples
if (requireNamespace("rsample", quietly = TRUE)) {
d <- data.frame(time = 1:100, x = rnorm(100), y = cumsum(rnorm(100)))
folds <- borg_temporal_cv(d, time = "time", target = "y")
}
Spatially Thin Occurrence Data
Description
Reduces spatial clustering by ensuring a minimum distance between all
retained observations. Uses iterative nearest-neighbor removal: at each
step, the point with the smallest nearest-neighbor distance is removed
until all pairwise distances exceed min_dist.
Usage
borg_thin(data, coords = NULL, min_dist = NULL, verbose = FALSE)
Arguments
data |
A data frame, |
coords |
Character vector of length 2. Coordinate column names. Required for data frames; ignored for sf/SpatVector. |
min_dist |
Numeric. Minimum distance between retained points.
Units match the coordinate system (degrees for lat/lon, meters for
projected CRS). If |
verbose |
Logical. Print thinning progress. Default: FALSE. |
Details
Spatial thinning is standard practice in species distribution modelling to reduce sampling bias. Clustered occurrences inflate apparent model performance and bias spatial CV fold sizes.
For geographic coordinates (lat/lon), distances are computed using the Haversine formula (meters). For projected coordinates, Euclidean distance is used.
Value
The input data filtered to retained rows. Has attribute
"thinned_idx" with the original row indices of kept observations,
and "n_removed" with the count of removed points.
Examples
# Thin clustered points to 5-unit minimum spacing
set.seed(42)
d <- data.frame(
x = c(rnorm(50, 0, 1), rnorm(50, 10, 1)),
y = c(rnorm(50, 0, 1), rnorm(50, 10, 1)),
species = "A"
)
d_thin <- borg_thin(d, coords = c("x", "y"), min_dist = 1)
nrow(d_thin) # fewer rows
Convert BORG Folds to biomod2 Format
Description
Converts BORG CV folds to the data split table format expected by
biomod2's BIOMOD_Modeling() function.
Usage
borg_to_biomod2(borg_cv)
Arguments
borg_cv |
A |
Value
A matrix where each column is a CV run and each row is an
observation. Values are TRUE (calibration) or FALSE
(validation), matching biomod2's DataSplitTable format.
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
split_table <- borg_to_biomod2(cv)
dim(split_table)
Convert BORG Folds to ENMeval Partition Format
Description
Converts BORG CV folds to the partition format expected by ENMeval's
ENMevaluate() function (a vector of fold assignments).
Usage
borg_to_enmeval(borg_cv)
Arguments
borg_cv |
A |
Value
A named list with:
- occs.grp
Integer vector of fold assignments for occurrence points
- bg.grp
Integer vector for background points (all assigned to fold 0)
Examples
set.seed(42)
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
parts <- borg_to_enmeval(cv)
table(parts$occs.grp)
mlr3 Resampling for BORG Cross-Validation
Description
Creates a native mlr3::Resampling object from BORG
cross-validation folds, enabling direct use with mlr3 benchmarking
infrastructure.
Usage
borg_to_mlr3(data = NULL, folds = NULL, cv_obj = NULL, id = "borg")
Arguments
data |
A data frame. |
folds |
A list of lists, each with |
cv_obj |
A |
id |
Character. Resampling identifier. Default: |
Details
The returned resampling is pre-instantiated — it contains fixed
train/test splits that respect BORG's spatial/temporal blocking.
Calling $instantiate() is not needed (and will not
overwrite the existing splits).
Value
An mlr3::Resampling R6 object that can be used
directly with mlr3::resample(), mlr3::benchmark(),
and other mlr3 infrastructure.
See Also
Examples
if (requireNamespace("mlr3", quietly = TRUE)) {
d <- data.frame(x = runif(100), y = runif(100), z = rnorm(100))
cv <- borg_cv(d, coords = c("x", "y"), target = "z")
resampling <- borg_to_mlr3(cv_obj = cv)
resampling$iters # number of folds
}
BORG-Guarded trainControl
Description
A guarded version of caret::trainControl() that validates CV settings
against data dependencies.
Usage
borg_trainControl(
data,
method = "cv",
number = 10,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
allow_override = FALSE,
...
)
Arguments
data |
A data frame. Required for dependency checking. |
method |
Character. Resampling method. |
number |
Integer. Number of folds or iterations. |
coords |
Character vector. Coordinate columns for spatial check. |
time |
Character. Time column for temporal check. |
groups |
Character. Group column for clustered check. |
target |
Character. Target variable. |
allow_override |
Logical. Allow random CV despite dependencies. |
... |
Additional arguments passed to |
Value
A trainControl object, potentially modified for blocked CV.
Examples
if (requireNamespace("caret", quietly = TRUE)) {
spatial_data <- data.frame(
lon = runif(50), lat = runif(50), response = rnorm(50)
)
ctrl <- borg_trainControl(
data = spatial_data,
method = "cv",
number = 5,
coords = c("lon", "lat")
)
}
Assess Geographic Transferability
Description
Evaluates how well a model transfers across geographic regions by splitting data into spatial zones, training on each zone, and testing on all others. Quantifies performance decay with geographic distance.
Usage
borg_transferability(
data,
formula,
coords,
n_regions = 4L,
metric = c("rmse", "mae", "rsq"),
fit_fun = stats::lm
)
Arguments
data |
Data frame with coordinates and predictors. |
formula |
Model formula. |
coords |
Character vector of length 2. Coordinate columns. |
n_regions |
Integer. Number of geographic regions to create. Default: 4. |
metric |
Character. Default: |
fit_fun |
Function. Default: |
Value
A list with class "borg_transferability" containing:
- matrix
Performance matrix (n_regions x n_regions): entry (i, j) = metric when training on region i, testing on region j
- distance_matrix
Geographic distance between region centroids
- decay
Data frame: distance, metric_value for plotting decay
- mean_transfer
Mean cross-region performance
- mean_within
Mean within-region performance
Examples
set.seed(42)
d <- data.frame(x = runif(200, 0, 100), y = runif(200, 0, 100),
a = rnorm(200))
d$z <- d$a * 2 + sin(d$x / 20) + rnorm(200, sd = 0.5)
tf <- borg_transferability(d, z ~ a, coords = c("x", "y"), n_regions = 4)
tf
Unregister BORG Hooks
Description
Removes BORG validation hooks from framework functions.
Usage
borg_unregister_hooks()
Value
Invisible NULL.
Validate Complete Evaluation Workflow
Description
borg_validate() performs post-hoc validation of an entire evaluation
workflow, checking all components for information leakage.
Usage
borg_validate(workflow, strict = FALSE)
Arguments
workflow |
A list containing the evaluation workflow components:
|
strict |
Logical. If TRUE, any hard violation causes an error. Default: FALSE (returns report only). |
Details
borg_validate() inspects each component of an evaluation workflow:
-
Split validation: Checks train/test index isolation
-
Preprocessing audit: Traces preprocessing parameters to verify train-only origin
-
Feature audit: Checks for target leakage and proxy features
-
Model audit: Validates that model used only training data
-
Threshold audit: Checks if any thresholds were optimized on test data
Value
A BorgRisk object containing a comprehensive
assessment of the workflow.
See Also
borg for proactive enforcement,
borg_inspect for single-object inspection.
Examples
# Validate an existing workflow
data <- data.frame(x = rnorm(100), y = rnorm(100))
result <- borg_validate(list(
data = data,
train_idx = 1:70,
test_idx = 71:100
))
# Check validity
if (!result@is_valid) {
print(result) # Shows detailed risk report
}
BORG-Guarded vfold_cv
Description
A guarded version of rsample::vfold_cv() that checks for data
dependencies before creating folds. If spatial, temporal, or clustered
dependencies are detected, random CV is blocked.
Usage
borg_vfold_cv(
data,
v = 10,
repeats = 1,
strata = NULL,
coords = NULL,
time = NULL,
groups = NULL,
target = NULL,
allow_override = FALSE,
auto_block = FALSE,
...
)
Arguments
data |
A data frame. |
v |
Integer. Number of folds. Default: 10. |
repeats |
Integer. Number of repeats. Default: 1. |
strata |
Character. Column name for stratification. |
coords |
Character vector of length 2. Coordinate columns for spatial check. |
time |
Character. Time column for temporal check. |
groups |
Character. Group column for clustered check. |
target |
Character. Target variable for dependency detection. |
allow_override |
Logical. If TRUE, allow random CV with explicit confirmation. Default: FALSE. |
auto_block |
Logical. If TRUE, automatically switch to blocked CV when dependencies detected. If FALSE, throw error. Default: FALSE. |
... |
Additional arguments passed to |
Value
If no dependencies detected or allow_override = TRUE, returns
an rset object from rsample. If dependencies detected and
auto_block = TRUE, returns BORG-generated blocked CV folds.
See Also
borg_cv for direct blocked CV generation.
Examples
if (requireNamespace("rsample", quietly = TRUE)) {
# Safe: no dependencies
data <- data.frame(x = rnorm(100), y = rnorm(100))
folds <- borg_vfold_cv(data, v = 5)
# Use auto_block to automatically switch to spatial CV:
spatial_data <- data.frame(
lon = runif(100, -10, 10),
lat = runif(100, -10, 10),
response = rnorm(100)
)
folds <- borg_vfold_cv(spatial_data, coords = c("lon", "lat"),
target = "response", auto_block = TRUE)
}
Willmott's Index of Agreement
Description
Computes Willmott's d (original, refined d1, and modified dr) for spatial model assessment.
Usage
borg_willmott(actual, predicted)
Arguments
actual |
Numeric vector of observed values. |
predicted |
Numeric vector of predicted values. |
Value
A list with d (original), d1 (refined),
and dr (modified).
References
Willmott, C. J. (1981). On the validation of models. Physical Geography, 2(2), 184-194. doi:10.1080/02723646.1981.10642213
Create a BORG Validation Workflow
Description
Wraps the full model evaluation pipeline into a single trackable object: diagnose data dependencies, generate appropriate CV folds, fit a model per fold, and validate each split for leakage.
Usage
borg_workflow(
data,
formula,
coords = NULL,
time = NULL,
groups = NULL,
v = 5,
fit_fun = stats::lm,
metric = c("rmse", "mae", "rsq"),
buffer = NULL,
parallel = FALSE,
verbose = FALSE,
...
)
Arguments
data |
Data frame with predictor and target columns. |
formula |
Model formula (e.g. |
coords |
Character vector of coordinate column names. |
time |
Character. Time column name. |
groups |
Character. Group column name. |
v |
Integer. Number of folds. Default: 5. |
fit_fun |
Function to fit a model. Default: |
metric |
Character. Performance metric. Default: |
buffer |
Numeric. Optional spatial buffer. |
parallel |
Logical. If |
verbose |
Logical. Default: FALSE. |
... |
Additional arguments passed to |
Value
A borg_workflow object (list) containing:
- diagnosis
BorgDiagnosis object
- cv
borg_cv object with folds
- models
List of fitted models (one per fold)
- predictions
List of prediction vectors (one per fold)
- performance
borg_fold_perf data frame
- risks
List of BorgRisk objects (one per fold)
- data
Original data
- formula
Model formula
- params
Workflow parameters
Examples
set.seed(42)
d <- data.frame(
x = runif(100, 0, 100), y = runif(100, 0, 100),
z = rnorm(100)
)
wf <- borg_workflow(d, z ~ x + y, coords = c("x", "y"))
wf
Generate CV Leakage Report
Description
Generates a detailed report of cross-validation leakage issues.
Usage
cv_leakage_report(cv_object, train_idx, test_idx)
Arguments
cv_object |
A cross-validation object (trainControl, vfold_cv, etc.). |
train_idx |
Integer vector of training indices. |
test_idx |
Integer vector of test indices. |
Value
A list with detailed CV leakage information.
Examples
# Using caret trainControl
if (requireNamespace("caret", quietly = TRUE)) {
folds <- list(Fold1 = 1:10, Fold2 = 11:20, Fold3 = 21:25)
ctrl <- caret::trainControl(method = "cv", index = folds)
report <- cv_leakage_report(ctrl, train_idx = 1:25, test_idx = 26:32)
print(report)
}
Plot BORG Objects
Description
S3 plot method for BORG risk assessment objects.
Usage
## S3 method for class 'BorgRisk'
plot(x, title = NULL, max_risks = 10, ...)
Arguments
x |
A |
title |
Optional custom plot title. |
max_risks |
Maximum number of risks to display. Default: 10. |
... |
Additional arguments (currently unused). |
Details
Displays a visual summary of detected risks:
Hard violations shown in red
Soft inflation risks shown in yellow/orange
Green "OK" when no risks detected
Value
Invisibly returns NULL. Called for plotting side effect.
Examples
# No risks
data <- data.frame(x = 1:100, y = 101:200)
result <- borg_inspect(data, train_idx = 1:70, test_idx = 71:100)
plot(result)
# With overlap violation
result_bad <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
plot(result_bad)
Plot CV Comparison Results
Description
Creates a visualization comparing random vs blocked CV performance.
Usage
## S3 method for class 'borg_comparison'
plot(x, type = c("boxplot", "density", "paired"), ...)
Arguments
x |
A |
type |
Character. Plot type: |
... |
Additional arguments passed to plotting functions. |
Value
The borg_comparison object x, returned invisibly.
Called for the side effect of producing a plot.
Plot BORG Result Objects
Description
S3 plot method for borg_result objects from borg().
Usage
## S3 method for class 'borg_result'
plot(
x,
type = c("split", "risk", "temporal", "groups"),
fold = 1,
time = NULL,
groups = NULL,
title = NULL,
...
)
Arguments
x |
A |
type |
Character. Plot type: |
fold |
Integer. Which fold to plot (for split visualization). Default: 1. |
time |
Column name or values for temporal plots. |
groups |
Column name or values for group plots. |
title |
Optional custom plot title. |
... |
Additional arguments passed to internal plot functions. |
Value
Invisibly returns NULL. Called for plotting side effect.
Examples
set.seed(42)
data <- data.frame(
x = runif(100, 0, 100),
y = runif(100, 0, 100),
response = rnorm(100)
)
result <- borg(data, coords = c("x", "y"), target = "response")
plot(result) # Split visualization for first fold
Print CV Leakage Report
Description
Print CV Leakage Report
Usage
## S3 method for class 'borg_cv_report'
print(x, ...)
Arguments
x |
A borg_cv_report object. |
... |
Additional arguments (ignored). |
Value
The borg_cv_report object x, returned invisibly.
Called for the side effect of printing a human-readable leakage summary
to the console.
Summarize BORG Diagnosis
Description
Generate a methods section summary for publication from a BorgDiagnosis object.
Usage
## S3 method for class 'BorgDiagnosis'
summary(
object,
comparison = NULL,
v = 5,
style = c("apa", "nature", "ecology"),
include_citation = TRUE,
...
)
Arguments
object |
A |
comparison |
Optional. A |
v |
Integer. Number of CV folds used. Default: 5. |
style |
Character. Citation style: |
include_citation |
Logical. Include BORG package citation. Default: TRUE. |
... |
Additional arguments (currently unused). |
Value
Character string with methods section text (invisibly). Also prints the text to the console.
Examples
set.seed(42)
data <- data.frame(
x = runif(100, 0, 100),
y = runif(100, 0, 100),
response = rnorm(100)
)
diagnosis <- borg_diagnose(data, coords = c("x", "y"), target = "response",
verbose = FALSE)
summary(diagnosis)
Summarize BORG Risk Assessment
Description
Print a summary of detected risks.
Usage
## S3 method for class 'BorgRisk'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments (currently unused). |
Value
The object invisibly.
Examples
data <- data.frame(x = 1:100, y = 101:200)
risk <- borg_inspect(data, train_idx = 1:60, test_idx = 51:100)
summary(risk)
Summarize BORG Cross-Validation
Description
Summarize BORG Cross-Validation
Usage
## S3 method for class 'borg_cv'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments (currently unused). |
Value
A list with strategy, fold count, and fold size statistics (invisibly).
Summarize BORG Pipeline Validation
Description
Summarize BORG Pipeline Validation
Usage
## S3 method for class 'borg_pipeline'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments (currently unused). |
Value
A list with per-stage risk counts (invisibly).
Summarize BORG Power Analysis
Description
Summarize BORG Power Analysis
Usage
## S3 method for class 'borg_power'
summary(object, ...)
Arguments
object |
A |
... |
Additional arguments (currently unused). |
Value
A list with key power metrics (invisibly).
Summarize BORG Result
Description
Generate a methods section summary for publication from a borg_result object.
Usage
## S3 method for class 'borg_result'
summary(
object,
comparison = NULL,
v = 5,
style = c("apa", "nature", "ecology"),
include_citation = TRUE,
...
)
Arguments
object |
A |
comparison |
Optional. A |
v |
Integer. Number of CV folds. Default: 5. |
style |
Character. Citation style. |
include_citation |
Logical. Include BORG citation. |
... |
Additional arguments (currently unused). |
Value
Character string with methods text (invisibly).
Examples
set.seed(42)
data <- data.frame(
x = runif(100, 0, 100),
y = runif(100, 0, 100),
response = rnorm(100)
)
result <- borg(data, coords = c("x", "y"), target = "response")
summary(result)