The SONO (Scores Of Nominal Outlyingness) package is used to compute scores of outlyingness for data sets consisting of nominal features. The scores are computed using the framework of Costa, E., & Papatsouma, I. (2025). SONO also includes several functions that compute evaluation metrics typically used for outlier identification algorithms that produce scores.
The main function of the package is sono
which computes
the scores. The score of nominal outlyingness for an observation \(\boldsymbol{x}_i\) is given by: \[s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq
\boldsymbol{x}_{i}: \\ \text{supp}(d) \notin (\sigma_d, n], \\ \lvert d
\rvert \leq \mathrm{MAXLEN}}} \frac{\sigma_d}{\text{supp}(d) \times
\lvert d \rvert^r}, \ r> 0, \ i=1,\dots,n,\] for highly
infrequent itemsets and: \[s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq
\boldsymbol{x}_{i}: \\ \text{supp}(d) \notin [0, \sigma_d), \\ \lvert d
\rvert \leq \mathrm{MAXLEN}}} \frac{\text{supp}(d)}{\sigma_d \times
\left( \text{MAXLEN} - \lvert d \rvert + 1 \right)^r}, \ r> 0, \
i=1,\dots,n,\] for highly frequent itemsets. In the above, \(\text{supp}(d)\) is the support of itemset
\(d\), \(\sigma_d\) is the the maximum/minimum
support threshold and \(\text{MAXLEN}\)
is the maximum length of sequences considered, while \(r\) is an exponent term to be determined by
the user.
The sono
function only requires two input arguments; a
data frame data
that needs to contain factors only and a
list of probability vectors probs
. Each element in
probs
must be a probability vector consisting of as many
elements as the number of unique factors in the corresponding column of
data
. The length of probs
must be equal to
ncol(data)
.
Additional input arguments include alpha
,
r
, MAXLEN
, frequent
and
verbose
. These are set by default equal to 0.01, 2, 0,
FALSE
and TRUE
, respectively but they can be
changed by the user. alpha
is the significance level of the
confidence interval constructed for determining the minimum/maximum
support threshold values \(\sigma_d\).
The exponent r
is as in the definition of the score and
same for MAXLEN
. Setting MAXLEN = 0
(the
default option) triggers an automatic search for the value of
MAXLEN
so as to ensure that no redundant computations are
being done, while accounting for the sparsity in the contingency tables
introduced by considering several nominal variables. This can be
estimated beforehand using the MAXLEN_est
function, so that
the user can set it to a lower value (but not larger) if needed.
Finally, frequent = FALSE
considers highly infrequent
itemsets as outlying, whereas frequent = TRUE
treats highly
frequent itemsets as more likely to be outliers. Progress messages are
printed by setting verbose = TRUE
.
Below, we generate an artificial data set and illustrate how
sono
works.
library(SONO)
# Generate data
set.seed(1)
X <- sample(c(1:3), 500, replace = TRUE, prob = c(0.2, 0.3, 0.5))
X <- cbind(X, sample(c(1:2), 500, replace = TRUE, prob = c(0.1, 0.9)))
X <- cbind(X, sample(c(1:5), 500, replace = TRUE, prob = rep(0.2, 5)))
X <- data.frame(X)
# Ensure every column is a factor
for (i in 1:ncol(X)){
X[, i] <- factor(X[, i])
}
# Run SONO with probability vectors matching data generating process
prob_vecs <- list(c(0.2, 0.3, 0.5),
c(0.1, 0.9),
rep(0.2, 5))
# Run SONO with true probabilities and r = 2
sono_res1 <- sono(data = X,
probs = prob_vecs,
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE,
verbose = TRUE)
#> MAXLEN: 3
#> Power set object created.
#> Pre-processing done.
#> Outlyingness scores for discrete variables calculated.
# See summary of scores
summary(sono_res1[[2]][, 2])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0 0 0 0 0 0
As expected, since the probabilities match the ones used to generate
the data, all scores are equal to 0. The rest of the output elements of
sono
are MAXLEN
, the matrix of variable
contributions and the Nominal Outlyingness Depth for each observation.
These additional concepts can be used to assess which and how many
variables contribute to the outlier score of each observation. In order
to showcase the rest of the functions of the SONO
package,
we use misspecified probabilities for probs
.
# Run SONO with misspecified probability vectors
prob_vecs_mis <- list(c(0.4, 0.4, 0.2),
c(0.9, 0.1),
rep(0.2, 5))
# Run SONO with true probabilities and r = 2
sono_res2 <- sono(data = X,
probs = prob_vecs_mis,
alpha = 0.01,
r = 2,
MAXLEN = 0,
frequent = FALSE,
verbose = TRUE)
#> MAXLEN: 3
#> Power set object created.
#> Pre-processing done.
#> Outlyingness scores for discrete variables calculated.
# See summary of scores
summary(sono_res2[[2]][, 2])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.000 0.000 1.281 1.566 1.821 9.356
Based on the misspecified probability vectors, we expect that the first 2 levels of the first variable are infrequent, as well as the first level of the second variable. The largest misspecification is for the latter, so we expect these observations to have the largest scores; we print summaries for the scores of each observation possessing these aforementioned levels, confirming our claim. Also notice how the minimum score and the mean score for observations possessing the first level of the first nominal variable are higher than the respective values for the second level of the same feature; this is because the misspecification is larger in the former case.
# See summary of scores for each case
summary(sono_res2[[2]][which(X[, 1] == 1), 2])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.821 1.821 1.821 2.773 1.821 9.356
summary(sono_res2[[2]][which(X[, 1] == 2), 2])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.281 1.281 1.281 2.174 1.281 8.816
summary(sono_res2[[2]][which(X[, 2] == 1), 2])
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 7.534 7.534 7.534 8.265 8.816 9.356
We can also visualise the contribution matrix for some observations.
The function vis_contribs
offers this possibility. The
arguments include contribs_mat
for the matrix of
contributions, subset
which only plots the rows for
specific observations (default is NULL
plots all rows of
the matrix) and scale
for scaling the values. The possible
scaling options are "none"
for no scaling (which is the
default option), "row"
for row-wise scaling so that each
row sums to a unit and "max"
, so that each row has a
maximum value of a unit. Here we apply max scaling and only plot the
observations for which a non-zero score is obtained.
# Plot matrix of contributions
vis_contribs(contribs_mat = sono_res2[[3]],
subset = which(sono_res2[[2]][, 2] > 0),
scale = "max")
As can be seen, the third variable does not contribute at all to the clustering. The largest contribution is for the second variable, which is the one with the largest misspecification.
We finally illustrate how the evaluation functions included in the
package work. The first one is avg_rank_outs
, which
computes the average rank of outliers. We will only treat the
observations for which V2 = 1
as outlying. The function
takes as input arguments the scores
, a vector of outlier
indices outs
and a way to handle ties in scores. The
default setting for ties
is "min"
(see the
documentation for more options).
We also compute the proportion of outliers correctly detected at the
Top \(K\%\) scores; this is the
function recall_at_k
that takes as input argument the
scores
, the outlier indices outs
and a
grid
for the values of \(K\). The ROC-AUC is also computed using the
roc_auc
function, with the exact same input arguments. We
see that the ROC-AUC is constantly equal to a unit, showing that the
outliers are indeed ranked so that they have the largest outlier
scores.
outliers <- which(X[, 2] == 1)
# Compute average rank of outliers
avg_rank <- avg_rank_outs(scores = sono_res2[[2]][, 2],
outs = outliers,
ties = "min")
cat('Average rank of outliers:', avg_rank, '\n')
#> Average rank of outliers: 18.7931
grid_vals <- c(1, 2.5, seq(5, 100, by = 5))/100
recall <- recall_at_k(scores = sono_res2[[2]][, 2],
outs = outliers,
grid = grid_vals)
for (i in 1:length(grid_vals)){
cat('Recall at', grid_vals[i], ':', recall[i], '\n')
}
#> Recall at 0.01 : 0.0862069
#> Recall at 0.025 : 0.2068966
#> Recall at 0.05 : 0.4310345
#> Recall at 0.1 : 0.862069
#> Recall at 0.15 : 1
#> Recall at 0.2 : 1
#> Recall at 0.25 : 1
#> Recall at 0.3 : 1
#> Recall at 0.35 : 1
#> Recall at 0.4 : 1
#> Recall at 0.45 : 1
#> Recall at 0.5 : 1
#> Recall at 0.55 : 1
#> Recall at 0.6 : 1
#> Recall at 0.65 : 1
#> Recall at 0.7 : 1
#> Recall at 0.75 : 1
#> Recall at 0.8 : 1
#> Recall at 0.85 : 1
#> Recall at 0.9 : 1
#> Recall at 0.95 : 1
#> Recall at 1 : 1
roc_auc_vals <- roc_auc(scores = sono_res2[[2]][, 2],
outs = outliers,
grid = grid_vals)
for (i in 1:length(grid_vals)){
cat('ROC AUC at', grid_vals[i], ':', roc_auc_vals[i], '\n')
}
#> ROC AUC at 0.01 : 1
#> ROC AUC at 0.025 : 1
#> ROC AUC at 0.05 : 1
#> ROC AUC at 0.1 : 1
#> ROC AUC at 0.15 : 1
#> ROC AUC at 0.2 : 1
#> ROC AUC at 0.25 : 1
#> ROC AUC at 0.3 : 1
#> ROC AUC at 0.35 : 1
#> ROC AUC at 0.4 : 1
#> ROC AUC at 0.45 : 1
#> ROC AUC at 0.5 : 1
#> ROC AUC at 0.55 : 1
#> ROC AUC at 0.6 : 1
#> ROC AUC at 0.65 : 1
#> ROC AUC at 0.7 : 1
#> ROC AUC at 0.75 : 1
#> ROC AUC at 0.8 : 1
#> ROC AUC at 0.85 : 1
#> ROC AUC at 0.9 : 1
#> ROC AUC at 0.95 : 1
#> ROC AUC at 1 : 1