Introduction to INCVCommunityDetection

Overview

INCVCommunityDetection implements Inductive Node-Splitting Cross-Validation (INCV) for selecting the number of communities in Stochastic Block Models (SBM). The package also provides competing methods — CROISSANT, Edge Cross-Validation (ECV), and Node Cross-Validation (NCV) — for comprehensive model selection in network analysis.

Simulating a network

We start by generating a network from a planted-partition SBM with 3 communities, 150 nodes, within-community connection probability 0.5, and between-community probability 0.05.

library(INCVCommunityDetection)

set.seed(42)
net <- community.sim(k = 3, n = 150, n1 = 50, p = 0.5, q = 0.05)
table(net$membership)
#> 
#>  1  2  3 
#> 50 50 50

The adjacency matrix is a 150 × 150 binary symmetric matrix:

dim(net$adjacency)
#> [1] 150 150
ord <- order(net$membership)
image(net$adjacency[ord, ord],
      main = "Adjacency matrix (3-community SBM, reordered)",
      xlab = "Node", ylab = "Node")

Selecting K with INCV (f-fold)

The main function nscv.f.fold() partitions nodes into f folds and uses spectral clustering on the training subgraph. Held-out nodes are assigned to communities based on their connections to training nodes, and the held-out negative log-likelihood and MSE are computed.

result <- nscv.f.fold(net$adjacency, k.vec = 2:6, f = 5)
result$k.loss   # K selected by neg-log-likelihood
#> [1] 3
result$k.mse    # K selected by MSE
#> [1] 3

We can inspect the full CV loss curve:

plot(2:6, result$cv.loss, type = "b", pch = 19,
     xlab = "Number of communities (K)",
     ylab = "CV Negative Log-Likelihood",
     main = "INCV f-fold: CV loss by K")
abline(v = result$k.loss, lty = 2, col = "red")

Selecting K with INCV (random split)

An alternative is to use repeated random node splits instead of fixed folds:

result2 <- nscv.random.split(net$adjacency, k.vec = 2:6,
                             split = 0.66, ite = 20)
result2$k.chosen
#> [1] 3
plot(2:6, result2$cv.loss, type = "b", pch = 19,
     xlab = "Number of communities (K)",
     ylab = "CV Negative Log-Likelihood",
     main = "INCV random-split: CV loss by K")
abline(v = result2$k.chosen, lty = 2, col = "red")

Comparing with ECV and NCV

Edge Cross-Validation

ECV holds out random edges and evaluates the predictive fit of a blockmodel reconstruction. It jointly selects between SBM and DCBM.

ecv <- ECV.for.blockmodel(net$adjacency, max.K = 6, B = 3)
ecv$dev.model   # best by deviance
#> [1] "SBM-3"
ecv$l2.model    # best by L2
#> [1] "SBM-3"
ecv$auc.model   # best by AUC
#> [1] "SBM-6"

Node Cross-Validation

NCV holds out random nodes and evaluates predictions on the held-out sub-network:

ncv <- NCV.for.blockmodel(net$adjacency, max.K = 6, cv = 3)
ncv$dev.model
#> [1] "SBM-3"
ncv$l2.model
#> [1] "SBM-3"

Summary of methods

Method Function Splits Selects K Selects model type
INCV f-fold nscv.f.fold() Nodes into f folds Yes No (SBM only)
INCV random nscv.random.split() Random node split Yes No (SBM only)
ECV ECV.for.blockmodel() Random edge holdout Yes Yes (SBM vs DCBM)
NCV NCV.for.blockmodel() Node folds Yes Yes (SBM vs DCBM)
CROISSANT croissant.blockmodel() Overlapping subsamples Yes Yes (SBM vs DCBM)

Spectral clustering and probability estimation

The building blocks are also available directly:

cl <- SBM.spectral.clustering(net$adjacency, k = 3)
table(cl$cluster)
#> 
#>  1  2  3 
#> 50 50 50

prob <- SBM.prob(cl$cluster, k = 3, A = net$adjacency, restricted = TRUE)
round(prob$p.matrix, 3)
#>       [,1]  [,2]  [,3]
#> [1,] 0.502 0.050 0.050
#> [2,] 0.050 0.502 0.050
#> [3,] 0.050 0.050 0.502

Distance-decaying SBM simulation

For more realistic simulations, community.sim.sbm() generates networks where block probabilities decay with community distance:

net2 <- community.sim.sbm(n = 120, n1 = 40, eta = 0.3, rho = 0.2, K = 4)
round(net2$conn, 4)
#>        [,1]  [,2]  [,3]   [,4]
#> [1,] 0.2000 0.060 0.018 0.0054
#> [2,] 0.0600 0.200 0.060 0.0180
#> [3,] 0.0180 0.060 0.200 0.0600
#> [4,] 0.0054 0.018 0.060 0.2000

Session info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sonoma 14.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US/en_US/C/en_US/en_US
#> 
#> time zone: America/Los_Angeles
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] INCVCommunityDetection_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-4          mvnfast_0.2.8         gtable_0.3.6         
#>  [4] jsonlite_2.0.0        compiler_4.5.2        Rcpp_1.1.1           
#>  [7] slam_0.1-55           parallel_4.5.2        cluster_2.1.8.2      
#> [10] jquerylib_0.1.4       scales_1.4.0          yaml_2.3.12          
#> [13] fastmap_1.2.0         lattice_0.22-7        ggplot2_4.0.2        
#> [16] R6_2.6.1              knitr_1.51            zigg_0.0.2           
#> [19] bslib_0.10.0          RColorBrewer_1.1-3    rlang_1.1.7          
#> [22] cachem_1.1.0          ClusterR_1.3.6        xfun_0.56            
#> [25] sass_0.4.10           S7_0.2.1              RcppParallel_5.1.11-2
#> [28] otel_0.2.0            viridisLite_0.4.3     cli_3.6.5            
#> [31] digest_0.6.39         grid_4.5.2            irlba_2.3.7          
#> [34] gmp_0.7-5.1           mclust_6.1.2          lifecycle_1.0.5      
#> [37] vctrs_0.7.1           Rfast_2.1.5.2         data.table_1.18.2.1  
#> [40] IMIFA_2.2.0           RSpectra_0.16-2       evaluate_1.0.5       
#> [43] glue_1.8.0            farver_2.1.2          rmarkdown_2.30       
#> [46] matrixStats_1.5.0     tools_4.5.2           htmltools_0.5.9