Introduction to the SONO package

The SONO (Scores Of Nominal Outlyingness) package is used to compute scores of outlyingness for data sets consisting of nominal features. The scores are computed using the framework of Costa, E., & Papatsouma, I. (2025). SONO also includes several functions that compute evaluation metrics typically used for outlier identification algorithms that produce scores.

The main function of the package is sono which computes the scores. The score of nominal outlyingness for an observation \(\boldsymbol{x}_i\) is given by: \[s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin (\sigma_d, n], \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\sigma_d}{\text{supp}(d) \times \lvert d \rvert^r}, \ r> 0, \ i=1,\dots,n,\] for highly infrequent itemsets and: \[s(\boldsymbol{x}_i)=\sum_{\substack{d \subseteq \boldsymbol{x}_{i}: \\ \text{supp}(d) \notin [0, \sigma_d), \\ \lvert d \rvert \leq \mathrm{MAXLEN}}} \frac{\text{supp}(d)}{\sigma_d \times \left( \text{MAXLEN} - \lvert d \rvert + 1 \right)^r}, \ r> 0, \ i=1,\dots,n,\] for highly frequent itemsets. In the above, \(\text{supp}(d)\) is the support of itemset \(d\), \(\sigma_d\) is the the maximum/minimum support threshold and \(\text{MAXLEN}\) is the maximum length of sequences considered, while \(r\) is an exponent term to be determined by the user.

The sono function only requires two input arguments; a data frame data that needs to contain factors only and a list of probability vectors probs. Each element in probs must be a probability vector consisting of as many elements as the number of unique factors in the corresponding column of data. The length of probs must be equal to ncol(data).

Additional input arguments include alpha, r, MAXLEN, frequent and verbose. These are set by default equal to 0.01, 2, 0, FALSE and TRUE, respectively but they can be changed by the user. alpha is the significance level of the confidence interval constructed for determining the minimum/maximum support threshold values \(\sigma_d\). The exponent r is as in the definition of the score and same for MAXLEN. Setting MAXLEN = 0 (the default option) triggers an automatic search for the value of MAXLEN so as to ensure that no redundant computations are being done, while accounting for the sparsity in the contingency tables introduced by considering several nominal variables. This can be estimated beforehand using the MAXLEN_est function, so that the user can set it to a lower value (but not larger) if needed. Finally, frequent = FALSE considers highly infrequent itemsets as outlying, whereas frequent = TRUE treats highly frequent itemsets as more likely to be outliers. Progress messages are printed by setting verbose = TRUE.

Below, we generate an artificial data set and illustrate how sono works.

library(SONO)
# Generate data
set.seed(1)
X <- sample(c(1:3), 500, replace = TRUE, prob = c(0.2, 0.3, 0.5))
X <- cbind(X, sample(c(1:2), 500, replace = TRUE, prob = c(0.1, 0.9)))
X <- cbind(X, sample(c(1:5), 500, replace = TRUE, prob = rep(0.2, 5)))
X <- data.frame(X)
# Ensure every column is a factor
for (i in 1:ncol(X)){
  X[, i] <- factor(X[, i])
}

# Run SONO with probability vectors matching data generating process
prob_vecs <- list(c(0.2, 0.3, 0.5),
                  c(0.1, 0.9),
                  rep(0.2, 5))
# Run SONO with true probabilities and r = 2
sono_res1 <- sono(data = X,
                  probs = prob_vecs,
                  alpha = 0.01,
                  r = 2,
                  MAXLEN = 0,
                  frequent = FALSE,
                  verbose = TRUE)
#> MAXLEN: 3
#> Power set object created.
#> Pre-processing done.
#> Outlyingness scores for discrete variables calculated.
# See summary of scores
summary(sono_res1[[2]][, 2])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       0       0       0       0       0       0

As expected, since the probabilities match the ones used to generate the data, all scores are equal to 0. The rest of the output elements of sono are MAXLEN, the matrix of variable contributions and the Nominal Outlyingness Depth for each observation. These additional concepts can be used to assess which and how many variables contribute to the outlier score of each observation. In order to showcase the rest of the functions of the SONO package, we use misspecified probabilities for probs.

# Run SONO with misspecified probability vectors
prob_vecs_mis <- list(c(0.4, 0.4, 0.2),
                      c(0.9, 0.1),
                      rep(0.2, 5))
# Run SONO with true probabilities and r = 2
sono_res2 <- sono(data = X,
                  probs = prob_vecs_mis,
                  alpha = 0.01,
                  r = 2,
                  MAXLEN = 0,
                  frequent = FALSE,
                  verbose = TRUE)
#> MAXLEN: 3
#> Power set object created.
#> Pre-processing done.
#> Outlyingness scores for discrete variables calculated.
# See summary of scores
summary(sono_res2[[2]][, 2])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   0.000   0.000   1.281   1.566   1.821   9.356

Based on the misspecified probability vectors, we expect that the first 2 levels of the first variable are infrequent, as well as the first level of the second variable. The largest misspecification is for the latter, so we expect these observations to have the largest scores; we print summaries for the scores of each observation possessing these aforementioned levels, confirming our claim. Also notice how the minimum score and the mean score for observations possessing the first level of the first nominal variable are higher than the respective values for the second level of the same feature; this is because the misspecification is larger in the former case.

# See summary of scores for each case
summary(sono_res2[[2]][which(X[, 1] == 1), 2])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.821   1.821   1.821   2.773   1.821   9.356
summary(sono_res2[[2]][which(X[, 1] == 2), 2])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   1.281   1.281   1.281   2.174   1.281   8.816
summary(sono_res2[[2]][which(X[, 2] == 1), 2])
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   7.534   7.534   7.534   8.265   8.816   9.356

We can also visualise the contribution matrix for some observations. The function vis_contribs offers this possibility. The arguments include contribs_mat for the matrix of contributions, subset which only plots the rows for specific observations (default is NULL plots all rows of the matrix) and scale for scaling the values. The possible scaling options are "none" for no scaling (which is the default option), "row" for row-wise scaling so that each row sums to a unit and "max", so that each row has a maximum value of a unit. Here we apply max scaling and only plot the observations for which a non-zero score is obtained.

# Plot matrix of contributions
vis_contribs(contribs_mat = sono_res2[[3]],
             subset = which(sono_res2[[2]][, 2] > 0),
             scale = "max")

As can be seen, the third variable does not contribute at all to the clustering. The largest contribution is for the second variable, which is the one with the largest misspecification.

We finally illustrate how the evaluation functions included in the package work. The first one is avg_rank_outs, which computes the average rank of outliers. We will only treat the observations for which V2 = 1 as outlying. The function takes as input arguments the scores, a vector of outlier indices outs and a way to handle ties in scores. The default setting for ties is "min" (see the documentation for more options).

We also compute the proportion of outliers correctly detected at the Top \(K\%\) scores; this is the function recall_at_k that takes as input argument the scores, the outlier indices outs and a grid for the values of \(K\). The ROC-AUC is also computed using the roc_auc function, with the exact same input arguments. We see that the ROC-AUC is constantly equal to a unit, showing that the outliers are indeed ranked so that they have the largest outlier scores.

outliers <- which(X[, 2] == 1)
# Compute average rank of outliers
avg_rank <- avg_rank_outs(scores = sono_res2[[2]][, 2],
                          outs = outliers,
                          ties = "min")
cat('Average rank of outliers:', avg_rank, '\n')
#> Average rank of outliers: 18.7931
grid_vals <- c(1, 2.5, seq(5, 100, by = 5))/100
recall <- recall_at_k(scores = sono_res2[[2]][, 2],
                      outs = outliers,
                      grid = grid_vals)
for (i in 1:length(grid_vals)){
  cat('Recall at', grid_vals[i], ':', recall[i], '\n')
}
#> Recall at 0.01 : 0.0862069 
#> Recall at 0.025 : 0.2068966 
#> Recall at 0.05 : 0.4310345 
#> Recall at 0.1 : 0.862069 
#> Recall at 0.15 : 1 
#> Recall at 0.2 : 1 
#> Recall at 0.25 : 1 
#> Recall at 0.3 : 1 
#> Recall at 0.35 : 1 
#> Recall at 0.4 : 1 
#> Recall at 0.45 : 1 
#> Recall at 0.5 : 1 
#> Recall at 0.55 : 1 
#> Recall at 0.6 : 1 
#> Recall at 0.65 : 1 
#> Recall at 0.7 : 1 
#> Recall at 0.75 : 1 
#> Recall at 0.8 : 1 
#> Recall at 0.85 : 1 
#> Recall at 0.9 : 1 
#> Recall at 0.95 : 1 
#> Recall at 1 : 1
roc_auc_vals <- roc_auc(scores = sono_res2[[2]][, 2],
                        outs = outliers,
                        grid = grid_vals)
for (i in 1:length(grid_vals)){
  cat('ROC AUC at', grid_vals[i], ':', roc_auc_vals[i], '\n')
}
#> ROC AUC at 0.01 : 1 
#> ROC AUC at 0.025 : 1 
#> ROC AUC at 0.05 : 1 
#> ROC AUC at 0.1 : 1 
#> ROC AUC at 0.15 : 1 
#> ROC AUC at 0.2 : 1 
#> ROC AUC at 0.25 : 1 
#> ROC AUC at 0.3 : 1 
#> ROC AUC at 0.35 : 1 
#> ROC AUC at 0.4 : 1 
#> ROC AUC at 0.45 : 1 
#> ROC AUC at 0.5 : 1 
#> ROC AUC at 0.55 : 1 
#> ROC AUC at 0.6 : 1 
#> ROC AUC at 0.65 : 1 
#> ROC AUC at 0.7 : 1 
#> ROC AUC at 0.75 : 1 
#> ROC AUC at 0.8 : 1 
#> ROC AUC at 0.85 : 1 
#> ROC AUC at 0.9 : 1 
#> ROC AUC at 0.95 : 1 
#> ROC AUC at 1 : 1