| Type: | Package |
| Title: | Data Leakage Detection Tools for Machine Learning |
| Version: | 0.1.0 |
| Description: | Provides utilities to detect common data leakage patterns including train/test contamination, temporal leakage, and data duplication, enhancing model reliability and reproducibility in machine learning workflows. Generates diagnostic reports and visual summaries to support data validation. Methods based on best practices from Hastie, Tibshirani, and Friedman (2009, ISBN:978-0387848570). |
| Imports: | ggplot2, arrow, data.table, digest, htmltools, openxlsx, readxl, stringr, workflows, jsonlite |
| Suggests: | testthat (≥ 3.0.0), caret, mlr3, tidymodels, knitr, rmarkdown |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2025-10-22 08:43:45 UTC; Isabella |
| Author: | Cheryl Isabella Lim [aut, cre] |
| Maintainer: | Cheryl Isabella Lim <cheryl.academic@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-10-26 18:50:02 UTC |
leakr: Data Leakage Detection for Machine Learning in R
Description
leakr: Data Leakage Detection for Machine Learning in R
Details
The leakr package provides tools to automatically detect common data leakage patterns in machine learning workflows for tabular data. It identifies train/test contamination, target leakage, and duplicate rows with clear diagnostic reports and visualisations.
Key Features
-
Train/Test Contamination: Detects ID overlaps and distributional shifts between training and test sets
-
Target Leakage: Identifies features with suspicious correlations to the target variable
-
Duplication Detection: Finds exact and near-duplicate rows
-
Clear Reports: Generates severity-ranked diagnostics with actionable recommendations
-
Visualisations: Creates diagnostic plots to highlight issues
Main Functions
-
leakr_audit: Main function for comprehensive leakage detection -
leakr_summarise: Generate human-readable summaries -
leakr_plot: Create diagnostic visualisations
Built-in Detectors
-
train_test_contamination: Checks for overlap between train/test sets -
target_leakage: Identifies suspicious feature-target relationships -
duplication_detection: Finds duplicate rows in datasets
Data Compatibility
Accepts data.frame, tibble, and data.table objects.
Quick Start
# Audit a dataset for leakage library(leakr) report <- leakr_audit(my_data, target = "outcome") # View summary of issues found leakr_summarise(report) # Create diagnostic plots leakr_plot(report)
Author(s)
Maintainer: Cheryl Isabella Lim cheryl.academic@gmail.com
See Also
Report bugs at https://github.com/cherylisabella/leakr/issues
Initialise built-in detectors
Description
Initialise built-in detectors
Usage
.onLoad(libname, pkgname)
Enhanced column name cleaning with better robustness
Description
Enhanced column name cleaning with better robustness
Usage
clean_column_names(names)
Arguments
names |
Character vector of column names |
Value
Cleaned column names
Enhanced report compilation with numeric severity scores
Description
This function compiles a report with enhanced sorting, severity scoring, and detailed metadata, including configuration information.
Usage
compile_report(
results,
audit_data,
config,
show_config = FALSE,
top_n = 10,
report = "default"
)
Arguments
results |
A list containing detection results. |
audit_data |
The audit data used for the report. |
config |
Configuration settings, including whether to use numeric severity scores. |
show_config |
Logical, whether to display the configuration used for report generation. Defaults to FALSE. |
top_n |
Numeric, the number of top results to display in the report. Defaults to 10. |
report |
A string indicating the type of report to generate. Defaults to "default". |
Value
A leakr_report object containing the summary, evidence, and metadata for the report.
Enhanced date detection handling multiple formats and data types
Description
Enhanced date detection handling multiple formats and data types
Usage
detect_and_convert_dates_enhanced(data, verbose)
Arguments
data |
Input data.frame |
verbose |
Whether to show messages |
Value
data.frame with converted dates
Detect file format from extension and content
Description
Detect file format from extension and content
Usage
detect_file_format(file_path, verbose = TRUE)
Arguments
file_path |
Path to the file |
verbose |
Whether to show detection messages |
Value
Character string indicating detected format
Registry-based Detector System
Description
This section of the package manages a registry for various data leakage detectors. Detectors are stored in the .detector_registry environment and are accessible by name. The system allows for easy registration of detectors, providing their descriptions and registration times. Detectors can be queried by name or listed.
Usage
.detector_registry
Format
An object of class environment of length 2.
Determine risk level and CSS class from severity counts.
Description
Determine risk level and CSS class from severity counts.
Usage
determine_risk_level(severity_counts)
Arguments
severity_counts |
Named integer vector of severity frequencies. |
Value
List with 'level' and CSS 'class'.
Helper function to return an empty snapshot info dataframe
Description
Helper function to return an empty snapshot info dataframe
Usage
empty_snapshot_info()
Value
Empty data.frame with correct structure
Export data with consistent messaging
Description
Export data with consistent messaging
Usage
export_data_internal(data, file_path, format, verbose, ...)
Arguments
data |
Data.frame to export |
file_path |
Output file path |
format |
Output format |
verbose |
Whether to show messages |
... |
TODO: Add description |
Value
Path to exported file
Format detector names for display.
Description
Format detector names by converting them to title case and separating words by spaces.
Usage
format_detector_name(detector_name)
Arguments
detector_name |
A string to format, typically a detector name with underscores. |
Value
A title-cased, space-separated string.
Generate diagnostic plots for a leakr_report
Description
Generate diagnostic plots for a leakr_report
Usage
generate_diagnostic_plots(report)
Arguments
report |
TODO: Document Generate diagnostic plots for a leakr_report |
Value
A named list of ggplot objects (currently empty stub)
Generate evidence section with format-specific handling and DRY logic.
Description
Generate evidence section with format-specific handling and DRY logic.
Usage
generate_evidence_section(report, format)
Arguments
report |
TODO: Document |
format |
TODO: describe |
Value
Formatted evidence section string.
Report generator
Description
Generate an executive summary text for the leakage audit report.
Usage
generate_executive_summary_text(report)
Arguments
report |
A 'leakr_report' object containing summarized issues. |
Value
Formatted summary string (Markdown/HTML-friendly).
Generate detailed issues section with output formatting and truncation.
Description
Generate detailed issues section with output formatting and truncation.
FIX ME
Usage
generate_issues_section(report, format)
Arguments
report |
TODO: Document |
format |
TODO: Document |
Value
Formatted issues section string.
Generate actionable recommendations based on report findings.
Description
This function generates actionable recommendations based on the findings in a leakr_report object.
Usage
generate_recommendations(report)
Arguments
report |
A |
Value
A character vector of recommendations.
Examples
## Not run:
# Requires a leakr_report object
report <- leakr_audit(iris, target = "Species")
recommendations <- generate_recommendations(report)
## End(Not run)
Format recommendations for output.
Description
Format recommendations for output.
Usage
generate_recommendations_section(report, format)
Arguments
report |
A leakr_report object. |
format |
TODO: Add description |
Value
Formatted recommendation section string.
Get detector information
Description
Retrieves information about detectors, optionally filtering by the detector name.
Usage
get_detector_info(name = NULL)
Arguments
name |
Optional detector name. If NULL, returns info for all detectors. |
Value
A list with detector information, including description and registration date.
Examples
# Get information for all detectors
get_detector_info()
# Get information for specific detectors that actually exist
get_detector_info("file_format")
Null-coalescing operator for clean default value handling
Description
Null-coalescing operator for clean default value handling
Usage
x %||% y
Arguments
x |
First value to check |
y |
Fallback value if x is NULL |
Value
x if not NULL, otherwise y
Import CSV files with robust parsing
Description
Import CSV files with robust parsing
Usage
import_csv(file_path, encoding, verbose, ...)
Arguments
file_path |
Path to CSV file |
encoding |
Character encoding |
verbose |
Whether to show messages |
... |
TODO: Add description |
Value
data.frame
Import Excel files with enhanced sheet support
Description
Import Excel files with enhanced sheet support
Usage
import_excel(file_path, sheet, verbose, ...)
Arguments
file_path |
Path to Excel file |
sheet |
Sheet name or number |
verbose |
Whether to show messages |
... |
TODO: Add description |
Value
data.frame
Import JSON files with better structure handling
Description
Import and process JSON files, converting them into a standardized data.frame.
Usage
import_json(file_path, verbose = FALSE, ...)
Arguments
file_path |
Path to the JSON file. |
verbose |
Logical flag indicating whether to show progress messages (default is FALSE). |
... |
Additional arguments passed to |
Value
A data.frame with the content from the JSON file, flattened.
Import Parquet files
Description
Import and process Parquet files into a standardized data.frame.
Usage
import_parquet(file_path, verbose = FALSE, ...)
Arguments
file_path |
Path to the Parquet file. |
verbose |
Logical flag indicating whether to show progress messages (default is FALSE). |
... |
Additional arguments passed to |
Value
A data.frame with the content from the Parquet file.
Import RDS files with validation
Description
Import RDS files with validation
Usage
import_rds(file_path, verbose, ...)
Arguments
file_path |
Path to RDS file |
verbose |
Whether to show messages |
... |
TODO: Add description |
Value
data.frame
Import TSV files with robust parsing
Description
Import TSV files with robust parsing
Usage
import_tsv(file_path, encoding, verbose, ...)
Arguments
file_path |
Path to TSV file |
encoding |
Character encoding |
verbose |
Whether to show messages |
... |
TODO: Add description |
Value
data.frame
Audit dataset for data leakage
Description
This function audits a dataset for potential data leakage, running a series of predefined detectors and generating a comprehensive report with detailed findings.
Usage
leakr_audit(
data,
target = NULL,
split = NULL,
id = NULL,
detectors = NULL,
config = list()
)
Arguments
data |
The dataset to be audited (data frame or tibble). |
target |
The target variable (optional). If NULL, no target variable is assumed. |
split |
The split variable used for training/test split (optional). If NULL, no split is assumed. |
id |
The unique identifier for each row (optional). If NULL, no id is used. |
detectors |
A vector of detector names to run (optional). If NULL, all available detectors will be used. |
config |
A list of configuration parameters for the audit. Defaults to an empty list. |
Value
A leakr_report object containing the audit results, including summary, evidence, and metadata.
Examples
# Basic audit on iris dataset
report <- leakr_audit(iris, target = "Species")
print(report)
Create data snapshots with improved metadata handling
Description
Save data and metadata for reproducible leakage analysis with optimised performance.
Usage
leakr_create_snapshot(
data,
output_dir = file.path(tempdir(), "leakr_snapshots"),
snapshot_name = NULL,
metadata = list(),
sample_for_hash = TRUE
)
Arguments
data |
Data.frame to snapshot |
output_dir |
Directory for snapshot files |
snapshot_name |
Name for this snapshot |
metadata |
Additional metadata to store |
sample_for_hash |
Whether to sample large datasets for faster hashing |
Value
Path to snapshot directory
Export data in various formats
Description
Save processed data to different file formats with consistent behaviour.
Usage
leakr_export_data(data, file_path, format = "csv", verbose = TRUE, ...)
Arguments
data |
Data.frame to export |
file_path |
Output file path |
format |
Output format: "csv", "excel", "rds", "json", "parquet" |
verbose |
Whether to show export messages |
... |
TODO: Add description |
Value
Path to exported file (invisibly)
Convert caret training objects to standard format
Description
Extract data from caret train objects for leakage analysis.
Usage
leakr_from_caret(train_obj, original_data = NULL, target_name = "target")
Arguments
train_obj |
caret train object |
original_data |
Original training data (if available) |
target_name |
Custom name for target variable (default: "target") |
Value
List with data and metadata
Convert mlr3 Task objects to standard format
Description
Extract data from mlr3 Task objects for leakage analysis.
Usage
leakr_from_mlr3(task, include_target = TRUE)
Arguments
task |
mlr3 Task object (TaskClassif, TaskRegr, etc.) |
include_target |
Whether to include target variable in output |
Value
List with data, target, and metadata
Convert tidymodels workflow to standard format
Description
Extract data from tidymodels workflows for leakage analysis.
Usage
leakr_from_tidymodels(workflow, data)
Arguments
workflow |
tidymodels workflow object |
data |
Original training data |
Value
List with data and metadata
Import data from various sources for leakage analysis
Description
Flexible data import function supporting multiple formats with automatic format detection and preprocessing for leakage analysis.
Usage
leakr_import(
source,
format = "auto",
preprocessing = list(),
encoding = "UTF-8",
sheet = NULL,
verbose = TRUE,
...
)
Arguments
source |
Path to data file, data.frame, or other supported object. |
format |
Data format: "auto", "csv", "excel", "rds", "json", "parquet", "tsv". If "auto", the format will be detected from the file extension. |
preprocessing |
List of preprocessing options to apply after import. |
encoding |
Character encoding for reading files. Default is "UTF-8". |
sheet |
Sheet name or index to read (for Excel files). Default is NULL. |
verbose |
Logical indicating whether to print progress messages. Default TRUE. |
... |
Additional arguments passed to specific import functions. |
Value
Standardised data.frame suitable for leakage analysis
A standardized data.frame suitable for leakage analysis.
List available snapshots with enhanced information
Description
Display comprehensive information about available data snapshots.
Usage
leakr_list_snapshots(
snapshots_dir = file.path(tempdir(), "leakr_snapshots"),
include_metadata = TRUE
)
Arguments
snapshots_dir |
Directory containing snapshots |
include_metadata |
Whether to load detailed metadata for each snapshot |
Value
Data.frame with snapshot information
Load data snapshot with enhanced validation
Description
Restore data from a previously created snapshot with integrity checking.
Usage
leakr_load_snapshot(snapshot_path, format = "rds", verify_integrity = TRUE)
Arguments
snapshot_path |
Path to snapshot directory |
format |
Format to load: "rds" (recommended), "csv" |
verify_integrity |
Whether to verify data integrity using hash |
Value
Data.frame from snapshot
Plot leakage detection results
Description
Plot leakage detection results
Usage
leakr_plot(x, ...)
Arguments
x |
Results from leakr_audit |
... |
TODO: Add description Plot leakage detection results |
Value
A ggplot object
Fast import with default preprocessing
Description
Minimal quick import for typical user workflows. Uses leakr_import internally.
Usage
leakr_quick_import(source, ...)
Arguments
source |
File path or data.frame |
... |
TODO: Add description |
Value
Standardised data.frame
Enhanced summarise with better formatting
Description
This function provides a formatted summary of the leakage audit report. It displays a summary of the leakage issues, including the severity and top issues detected. Optionally, it can also display configuration details used for the audit.
Usage
leakr_summarise(
report,
top_n = 10,
show_config = FALSE,
config = NULL,
audit_data = NULL,
detectors = NULL,
libname = NULL,
pkgname = NULL
)
Arguments
report |
A |
top_n |
Maximum number of issues to display in the summary. Defaults to 10. |
show_config |
Whether to display the configuration details used for the audit. Defaults to |
config |
(Optional) A configuration list. This argument is not used directly in the function,
but is referenced in the report metadata. Defaults to |
audit_data |
(Optional) The data used for auditing. This argument is not used directly in the function,
but is part of the report metadata. Defaults to |
detectors |
(Optional) A vector of detectors used for the audit. This argument is not used directly in
the function but is part of the report metadata. Defaults to |
libname |
(Optional) The name of the library. This is included for internal package functionality. |
pkgname |
(Optional) The name of the package. This is included for internal package functionality. |
Value
An invisible data.frame summarizing the top n issues detected.
Examples
# Create and summarise a report
report <- leakr_audit(iris, target = "Species")
leakr_summarise(report, top_n = 5)
List Registered Detectors
Description
Returns the names of all detectors currently registered in the system. This is useful for checking which detectors are available.
Usage
list_registered_detectors()
Value
A character vector containing the names of all registered detectors.
Examples
list_registered_detectors()
Create a new temporal detector
Description
Create a new temporal detector
Usage
new_temporal_detector(time_col, lookahead_window = 1)
Arguments
time_col |
Character. Name of the time column |
lookahead_window |
Numeric. Lookahead window size (default 1) Create a new temporal detector |
Value
A temporal_detector object
A temporal_detector object
Create a new train-test detector
Description
Create a new train-test detector
Usage
new_train_test_detector(threshold = 0.1)
Arguments
threshold |
TODO: Document Create a new train-test detector |
Value
A train_test_detector object
Plot a detector_result object
Description
Plot a detector_result object
Plot a detector_result object
Usage
## S3 method for class 'detector_result'
plot(x, palette = NULL, ...)
Arguments
x |
TODO: Document |
palette |
TODO: Document |
... |
TODO: Document |
Value
A ggplot object, invisibly. Printed if interactive
A ggplot object, invisibly. Printed if interactive
Plot a udld_report object
Description
This function generates a bar plot of leakage issues detected by different detectors.
The plot displays the count of issues by severity level for each detector in a udld_report object.
Usage
## S3 method for class 'udld_report'
plot(x, palette = NULL, ...)
Arguments
x |
A |
palette |
Optional. A |
... |
Additional arguments passed to |
Value
A ggplot object, invisibly. The plot is printed if the session is interactive.
Enhanced data preparation with robust preprocessing
Description
This function performs robust data preprocessing and prepares the data for leakage detection. It handles intelligent sampling, adjusts for the presence of a target variable, and structures the data for further audit and analysis.
Usage
prepare_audit_data(data, target, split, id, config)
Arguments
data |
A data frame containing the dataset to be audited. |
target |
The name of the target variable (optional). Used for stratified sampling if provided. |
split |
A vector or a column name specifying the data split (e.g., training/test split). |
id |
The unique identifier column for the dataset (optional). |
config |
A list of configuration settings, including sample size and other audit parameters. |
Value
A list of class audit_data containing preprocessed data along with metadata, such as:
-
data: The processed data. -
target: The target variable name. -
split: The split vector or column name. -
n_rows: The number of rows in the data. -
n_cols: The number of columns in the data. -
was_sampled: A logical indicating whether sampling was performed.
Examples
## Not run:
audit_data <- prepare_audit_data(data, target = "target_column",
split = "train_test_split",
id = "id_column",
config = list(sample_size = 50000))
## End(Not run)
Enhanced preprocessing with better performance and robustness
Description
A preprocessing function to handle common data issues, such as removing empty rows/columns, handling dates, and converting character columns to factors. This function improves data quality before further analysis.
Usage
preprocess_imported_data(data, preprocessing = list(), verbose = FALSE)
Arguments
data |
Input data.frame to be preprocessed. |
preprocessing |
A list of preprocessing options, such as removing empty rows or handling dates. |
verbose |
Logical flag indicating whether to show progress messages (default is FALSE). |
Value
A preprocessed data.frame.
Print method for leakr_report
Description
Print method for leakr_report
Usage
## S3 method for class 'leakr_report'
print(x, ...)
Arguments
x |
leakr_report object |
... |
TODO: Add description |
Register a new detector
Description
Register a new data leakage detector function
Usage
register_detector(name, fun, description = "")
Arguments
name |
Name of the detector |
fun |
TODO: Add description |
description |
TODO: Add description |
Value
Invisibly returns registration status
Run a detector on data
Description
Run a detector on data
Usage
run_detector(detector, data, split = NULL, id = NULL, config = list())
Arguments
detector |
A detector object |
data |
Data frame to analyze |
split |
Split vector indicating train/test assignment (optional) |
id |
Optional ID column name |
config |
Optional configuration list |
Value
A detector result object
A detector result object
Run multiple detectors on audit data
Description
This function runs multiple leakage detectors on the provided audit data and returns the results for each detector.
Usage
run_detectors(detectors, audit_data, config)
Arguments
detectors |
A list of detector configurations. Each detector can be either a function
or an object that contains a |
audit_data |
A data.frame, tibble, or data.table to audit. |
config |
A list of configuration settings to be passed to each detector. |
Value
A list where each element contains the results of running a detector. If a detector fails, an error message is included in the result.
Examples
## Not run:
detectors <- list(
temporal = list(func = temporal_detector_func),
train_test = new_train_test_detector()
)
results <- run_detectors(detectors, audit_data = iris, config = list(sample_size = 50000))
## End(Not run)
Stratified sampling helper
Description
This function performs stratified sampling based on the provided target vector. The sampling is done proportionally to the distribution of values in the target vector.
Usage
stratified_sample(target_vec, n_sample)
Arguments
target_vec |
A vector representing the target variable used for stratification. The function will sample from each class (level) proportionally. |
n_sample |
The total number of samples to draw. |
Value
A vector of indices representing the sampled observations.
Robust data validation and preprocessing
Description
This function performs data validation and preprocessing for audit purposes. It checks the validity of the input data, ensures that the target and ID columns exist, and handles empty or problematic columns.
Usage
validate_and_preprocess_data(data, target, split, id)
Arguments
data |
A data frame, tibble, or data table to be validated and preprocessed. |
target |
The name of the target column, which should be present in the |
split |
A vector specifying the split column, which will be checked in the |
id |
The name of the ID column, which should be present in the |
Value
The validated and preprocessed data.
Examples
## Not run:
# Example data
data <- data.frame(target = rnorm(100), id = 1:100)
target <- "target"
id <- "id"
validated_data <- validate_and_preprocess_data(data, target, NULL, id)
## End(Not run)
Enhanced data validation with better error messages
Description
Enhanced data validation with better error messages
Usage
validate_imported_data(data, source)
Arguments
data |
Input data.frame |
source |
Source identifier for error messages |
Value
TRUE (invisibly) if validation passes