% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/bootHRT.R
\name{bootHRT}
\alias{bootHRT}
\title{Calculate Cellwise Flags for Anomaly Detection Using Bayesian Bootstrap}
\usage{
bootHRT(a, contamination = 0.08, boot_max_it = 1000L)
}
\arguments{
\item{a}{A long-format \code{data.frame} object with survey data. For details see information on the data format.}

\item{contamination}{A number between zero and one used as a threshold when identifying outliers from the fuzzy scores.
By default, the algorithm will identify approximately 8\% of the data entries as anomalies.}

\item{boot_max_it}{An integer number determining the iterations performed by Bayesian bootstrap algorithm. It is set to \code{1000} by default.}
}
\value{
A data frame with the same columns as the input data frame, plus the following additional columns:
  \describe{
    \item{score}{The raw anomaly score for each cell.}
    \item{outlier_1qt}{A boolean indicating if the cell is an outlier based on the first quantile threshold.}
    \item{outlier_2qt}{A boolean indicating if the cell is an outlier based on the second quantile (median) threshold.}
    \item{outlier_3qt}{A boolean indicating if the cell is an outlier based on the third quantile threshold.}
    \item{outlier_1mn}{A boolean indicating if the cell is an outlier based on the mean threshold minus one standard deviation.}
    \item{outlier_2mn}{A boolean indicating if the cell is an outlier based on the mean threshold.}
    \item{outlier_3mn}{A boolean indicating if the cell is an outlier based on the mean threshold plus one standard deviation.}
    \item{anomaly_flag}{A character string indicating the type of anomaly detected, if any (e.g., "h", "t", "r").}
  }
  The returned object also includes an attribute \code{"thresholds"} which is a numeric vector of length \code{boot_max_it} containing samples from the posterior distribution of the contamination threshold.
}
\description{
The function uses Bayesian bootstrap to determine if a data entry is an outlier or not.
The function takes a long-format \code{data.frame} object as input and returns it with appended vectors.
The output includes flags for different quantiles and means of the contamination threshold distribution.
}
\details{
The argument \code{a} is provided as an object of class \code{data.frame}.
This object is considered as a long-format \code{data.frame}, and it must have at least five columns with the following names:
\describe{
  \item{\code{"strata"}}{a \code{character} or \code{factor} column containing the information on the stratification.}
  \item{\code{"unit_id"}}{a \code{character} or \code{factor} column containing the ID of the statistical unit in the survey sample(x, size, replace = FALSE, prob = NULL).}
  \item{\code{"master_varname"}}{a \code{character} column containing the name of the observed variable.}
  \item{\code{"current_value_num"}}{a \code{numeric} the observed value, i.e., a data entry}
  \item{\code{"pred_value"}}{a \code{numeric} a value observed on a previous survey for the same variable if available. If not available, the value can be set to \code{NA} or \code{NaN}. When working with longitudinal data, the value can be set to a time-series forecast or a filtered value.}}
The \code{data.frame} object in input can have more columns, but the extra columns would be ignored in the analyses.
However, these extra columns would be preserved in the system memory and returned along with the results from the cellwise outlier-detection analysis.
The use of the R-packages \code{dplyr}, \code{purrr}, and \code{tidyr} is highly recommended to simplify the conversion of datasets between long and wide formats.
}
\examples{
# Load the package
library(HRTnomaly)
set.seed(2025L)
# Load the 'toy' data
data(toy)
# Detect cellwise outliers
res <- bootHRT(toy, boot_max_it = 10)
}
\author{
Luca Sartore \email{drwolf85@gmail.com}
}
\keyword{distribution}
\keyword{outliers}
\keyword{probability}
