Introduction to INCVCommunityDetection

Overview

INCVCommunityDetection implements Inductive Node-Splitting Cross-Validation (INCV) for selecting the number of communities in Stochastic Block Models (SBM). The package also provides competing methods — CROISSANT, Edge Cross-Validation (ECV), and Node Cross-Validation (NCV) — for comprehensive model selection in network analysis.

Simulating a network

We start by generating a network from a planted-partition SBM with 3 communities, 150 nodes, within-community connection probability 0.5, and between-community probability 0.05.

library(INCVCommunityDetection)

set.seed(42)
net <- community.sim(k = 3, n = 150, n1 = 50, p = 0.5, q = 0.05)
table(net$membership)
#> 
#>  1  2  3 
#> 50 50 50

The adjacency matrix is a 150 × 150 binary symmetric matrix:

dim(net$adjacency)
#> [1] 150 150
ord <- order(net$membership)
image(net$adjacency[ord, ord],
      main = "Adjacency matrix (3-community SBM, reordered)",
      xlab = "Node", ylab = "Node")

Selecting K with INCV (f-fold)

The main function nscv.f.fold() partitions nodes into f folds and uses spectral clustering on the training subgraph. Held-out nodes are assigned to communities based on their connections to training nodes, and the held-out negative log-likelihood and MSE are computed.

result <- nscv.f.fold(net$adjacency, k.vec = 2:6, f = 5)
result$k.loss   # K selected by neg-log-likelihood
#> [1] 3
result$k.mse    # K selected by MSE
#> [1] 3

We can inspect the full CV loss curve:

plot(2:6, result$cv.loss, type = "b", pch = 19,
     xlab = "Number of communities (K)",
     ylab = "CV Negative Log-Likelihood",
     main = "INCV f-fold: CV loss by K")
abline(v = result$k.loss, lty = 2, col = "red")

Selecting K with INCV (random split)

An alternative is to use repeated random node splits instead of fixed folds:

result2 <- nscv.random.split(net$adjacency, k.vec = 2:6,
                             split = 0.66, ite = 20)
result2$k.chosen
#> [1] 3

plot(2:6, result2$cv.loss, type = "b", pch = 19,
     xlab = "Number of communities (K)",
     ylab = "CV Negative Log-Likelihood",
     main = "INCV random-split: CV loss by K")
abline(v = result2$k.chosen, lty = 2, col = "red")

Comparing with ECV and NCV

Edge Cross-Validation

ECV holds out random edges and evaluates the predictive fit of a blockmodel reconstruction. It jointly selects between SBM and DCBM.

ecv <- ECV.for.blockmodel(net$adjacency, max.K = 6, B = 3)
ecv$dev.model   # best by deviance
#> [1] "SBM-3"
ecv$l2.model    # best by L2
#> [1] "SBM-3"
ecv$auc.model   # best by AUC
#> [1] "SBM-6"

Node Cross-Validation

NCV holds out random nodes and evaluates predictions on the held-out sub-network:

ncv <- NCV.for.blockmodel(net$adjacency, max.K = 6, cv = 3)
ncv$dev.model
#> [1] "SBM-3"
ncv$l2.model
#> [1] "SBM-3"

Summary of methods

Method	Function	Splits	Selects K	Selects model type
INCV f-fold	`nscv.f.fold()`	Nodes into f folds	Yes	No (SBM only)
INCV random	`nscv.random.split()`	Random node split	Yes	No (SBM only)
ECV	`ECV.for.blockmodel()`	Random edge holdout	Yes	Yes (SBM vs DCBM)
NCV	`NCV.for.blockmodel()`	Node folds	Yes	Yes (SBM vs DCBM)
CROISSANT	`croissant.blockmodel()`	Overlapping subsamples	Yes	Yes (SBM vs DCBM)

Spectral clustering and probability estimation

The building blocks are also available directly:

cl <- SBM.spectral.clustering(net$adjacency, k = 3)
table(cl$cluster)
#> 
#>  1  2  3 
#> 50 50 50

prob <- SBM.prob(cl$cluster, k = 3, A = net$adjacency, restricted = TRUE)
round(prob$p.matrix, 3)
#>       [,1]  [,2]  [,3]
#> [1,] 0.502 0.050 0.050
#> [2,] 0.050 0.502 0.050
#> [3,] 0.050 0.050 0.502

Distance-decaying SBM simulation

For more realistic simulations, community.sim.sbm() generates networks where block probabilities decay with community distance:

net2 <- community.sim.sbm(n = 120, n1 = 40, eta = 0.3, rho = 0.2, K = 4)
round(net2$conn, 4)
#>        [,1]  [,2]  [,3]   [,4]
#> [1,] 0.2000 0.060 0.018 0.0054
#> [2,] 0.0600 0.200 0.060 0.0180
#> [3,] 0.0180 0.060 0.200 0.0600
#> [4,] 0.0054 0.018 0.060 0.2000

Session info

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sonoma 14.6.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.5-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
#> 
#> locale:
#> [1] C/en_US/en_US/C/en_US/en_US
#> 
#> time zone: America/Los_Angeles
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] INCVCommunityDetection_0.1.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] Matrix_1.7-4          mvnfast_0.2.8         gtable_0.3.6         
#>  [4] jsonlite_2.0.0        compiler_4.5.2        Rcpp_1.1.1           
#>  [7] slam_0.1-55           parallel_4.5.2        cluster_2.1.8.2      
#> [10] jquerylib_0.1.4       scales_1.4.0          yaml_2.3.12          
#> [13] fastmap_1.2.0         lattice_0.22-7        ggplot2_4.0.2        
#> [16] R6_2.6.1              knitr_1.51            zigg_0.0.2           
#> [19] bslib_0.10.0          RColorBrewer_1.1-3    rlang_1.1.7          
#> [22] cachem_1.1.0          ClusterR_1.3.6        xfun_0.56            
#> [25] sass_0.4.10           S7_0.2.1              RcppParallel_5.1.11-2
#> [28] otel_0.2.0            viridisLite_0.4.3     cli_3.6.5            
#> [31] digest_0.6.39         grid_4.5.2            irlba_2.3.7          
#> [34] gmp_0.7-5.1           mclust_6.1.2          lifecycle_1.0.5      
#> [37] vctrs_0.7.1           Rfast_2.1.5.2         data.table_1.18.2.1  
#> [40] IMIFA_2.2.0           RSpectra_0.16-2       evaluate_1.0.5       
#> [43] glue_1.8.0            farver_2.1.2          rmarkdown_2.30       
#> [46] matrixStats_1.5.0     tools_4.5.2           htmltools_0.5.9