--- title: "Introduction to `ssp.glm`: Subsampling for Generalized Linear Models" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Introduction to `ssp.glm`: Subsampling for Generalized Linear Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette introduces the usage of the `ssp.glm` using logistic regression as an example of generalized linear models (GLM). The statistical theory and algorithms in this implementation can be found in the relevant reference papers. The log-likelihood function for a GLM is $$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left\{y_i u(\beta^{\top} x_i) - \psi \left[ u(\beta^{\top} x_i) \right] \right\}, $$ where $u$ and $\psi$ are known functions depend on the distribution from the exponential family. For the binomial distribution, the log-likelihood function becomes $$ \max_{\beta} L(\beta) = \frac{1}{N} \sum_{i=1}^N \left[y_i \beta^{\top} x_i - \log\left(1 + e^{\beta^\top x_i}\right) \right]. $$ The idea of subsampling methods is as follows: instead of fitting the model on the size $N$ full dataset, a subsampling probability is assigned to each observation and a smaller, informative subsample is drawn. The model is then fitted on the subsample to obtain an estimator with reduced computational cost. ## Installation You can install the development version of subsampling from [GitHub](https://github.com/) with: ``` r # install.packages("devtools") devtools::install_github("dqksnow/Subsampling") ``` ```{r setup} library(subsampling) ``` ## Terminology - Full dataset: The whole dataset used as input. - Full data estimator: The estimator obtained by fitting the model on the full dataset. - Subsample: A subset of observations drawn from the full dataset. - Subsample estimator: The estimator obtained by fitting the model on the subsample. - Subsampling probability ($\pi$): The probability assigned to each observation for inclusion in the subsample. ## Example: Logistic Regression with Simulated Data We introduce the usage of `ssp.glm` with simulated data. $X$ contains $d=6$ covariates drawn from multinormal distribution, and $Y$ is the binary response variable. The full dataset size is $N = 1 \times 10^4$. ```{r} set.seed(1) N <- 1e4 beta0 <- rep(-0.5, 7) d <- length(beta0) - 1 corr <- 0.5 sigmax <- matrix(corr, d, d) + diag(1-corr, d) X <- MASS::mvrnorm(N, rep(0, d), sigmax) colnames(X) <- paste("V", 1:ncol(X), sep = "") P <- 1 - 1 / (1 + exp(beta0[1] + X %*% beta0[-1])) Y <- rbinom(N, 1, P) data <- as.data.frame(cbind(Y, X)) formula <- Y ~ . head(data) ``` ## Key Arguments The function usage is ```{r, eval = FALSE} ssp.glm( formula, data, subset = NULL, n.plt, n.ssp, family = "quasibinomial", criterion = "optL", sampling.method = "poisson", likelihood = "weighted", control = list(...), contrasts = NULL, ... ) ``` The core functionality of `ssp.glm` revolves around three key questions: - How are subsampling probabilities computed? (Controlled by the `criterion` argument) - How is the subsample drawn? (Controlled by the `sampling.method` argument) - How is the likelihood adjusted to correct for bias? (Controlled by the `likelihood` argument) ### `criterion` The choices of `criterion` include `optA`, `optL`(default), `LCC` and `uniform`. The optimal subsampling criterion `optA` and `optL` are derived by minimizing the asymptotic covariance of subsample estimator, proposed by @wang2018optimal. `LCC` and `uniform` are baseline methods. ### `sampling.method` The options for the `sampling.method` argument include `withReplacement` and `poisson` (default). `withReplacement` stands for drawing `n.ssp` subsamples from full dataset with replacement, using the specified subsampling probabilities. `poisson` stands for drawing subsamples one by one by comparing the subsampling probability with a realization of uniform random variable $U(0,1)$. The expected number of drawn samples are `n.ssp`. More details see @wang2019more. ### `likelihood` The available choices for `likelihood` include `weighted` (default) and `logOddsCorrection`. Both of these likelihood functions can derive an unbiased estimator. Theoretical results indicate that `logOddsCorrection` is more efficient than `weighted` in the context of logistic regression. See @wang2022maximum. ## Outputs After drawing subsample, `ssp.glm` utilizes `survey::svyglm` to fit the model on the subsample, which eventually uses `glm`. Arguments accepted by `svyglm` can be passed through `...` in `ssp.glm`. Below are two examples demonstrating the use of `ssp.glm` with different configurations. ```{r} n.plt <- 200 n.ssp <- 600 ssp.results <- ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "optL", sampling.method = "withReplacement", likelihood = "weighted" ) summary(ssp.results) ``` ```{r} ssp.results <- ssp.glm(formula = formula, data = data, n.plt = n.plt, n.ssp = n.ssp, family = "quasibinomial", criterion = "optA", sampling.method = "poisson", likelihood = "logOddsCorrection" ) summary(ssp.results) ``` As recommended by `survey::svyglm`, when working with binomial models, it is advisable to use use `family=quasibinomial()` to avoid a warning issued by `glm`. Refer to [svyglm() help documentation Details ](https://www.rdocumentation.org/packages/survey/versions/4.4-2/topics/svyglm). The 'quasi' version of the family objects provide the same point estimates. ### Returned object The object returned by `ssp.glm` contains estimation results and indices of the drawn subsample in the full dataset. ```{r} names(ssp.results) ``` Some key returned variables: - `index.plt` and `index` are the row indices of drawn pilot subsamples and optimal subsamples in the full data. - `coef.ssp` is the subsample estimator for $\beta$ and `coef` is the linear combination of `coef.plt` (pilot estimator) and `coef.ssp`. - `cov.ssp` and `cov` are estimated covariance matrices of `coef.ssp` and `coef`. The coefficients and standard errors printed by `summary()` are `coef` and the square root of `diag(cov)`. See the help documentation of `ssp.glm` for details. ## Other Families We also provide examples for poisson regression and gamma regression in the help documentation of `ssp.glm`. Note that `likelihood = logOddsCorrection` is currently implemented only for logistic regression (family = `binomial` or `quasibonomial`). ## References