library(GINAX)
This vignette explores two toy examples (binary data and count data)
to illustrate how the functions provided in GINAX
perform
the GINA-X procedure. Data has been simulated under a generalized linear
mixed model from 9,000 SNPs of 328 A. Thaliana ecotypes. The
GINAX
package includes as R
objects the
simulated data; 9,000 SNPs, the simulated phenotypes (both binary and
Poisson), and the kinship matrix used to simulate the data. Further, the
Github repo that contains the GINAX
package also contains
the data for the A. Thaliana case study.
The GINAX package implements the novel Genome-wide Iterative fiNe-mApping method for non-Gaussian data (GINA-X) proposed by Xu et al. (2025). As shown in that paper, traditional fine-mapping methods often fail to identify causal variants with smaller effect sizes, and do not properly correct for multiple comparisons. In contrast, GINA-X efficiently extracts information from GWAS data in a way that reduces FDR and increases recall.
The function implemented in GINAX
is described
below:
GINAX
Performs GINA-X, using generalized linear mixed
models for a given numeric phenotype vector, either binary or Poisson
distributed Y
, a SNP matrix encoded numerically
SNPs
, fixed covariates Fixed
, and random
effects and their projection matrices (covariance
and
Z
respectively). The GINAX
function returns
the indices of the SNP matrix that were identified in the best model
found by the GINA-X procedure.The model used in the GINAX
package is
\[\begin{equation*} \textbf{y} \sim F(\cdot|\theta) // g(\theta) = X \boldsymbol{\beta} + X_f \boldsymbol{\beta}_f + Z_1 \boldsymbol{\alpha}_1 + \ldots + Z_l \boldsymbol{\alpha}_l \end{equation*}\]
where
Currently, GINAX
can analyze binary responses
(family = "bernoulli"
) and Poisson responses
(family = "poisson"
).
The GINAX
function requires a vector of observed
phenotypes (either binary or assumed Poisson distributed), a matrix of
SNPs, and the specification of the random effects. First, the vector of
observed phenotypes must be a numeric vector or a numeric \(n \times 1\) matrix. In the
GINAX
package, there are two simulated phenotype vectors.
The first simulated phenotype vector comes from a Poisson generalized
linear mixed model with both a kinship random effect and an
overdispersion random effect. The data is assumed to have 15 replicates
for each A. Thaliana ecotype. The first five elements of the
Poisson simulated vector of phenotypes are
data("Y_poisson")
1:5]
Y_poisson[#> [1] 2387 179 299 139 59805
The second simulated phenotype vector comes from a binary generalized linear mixed model with only a kinship random effect. The first five elements of the binary simulated vector of phenotypes are
data("Y_binary")
1:5]
Y_binary[#> [1] 0 0 1 1 0
Second, the SNP matrix has to contain numeric values where each column corresponds to a SNP of interest and the \(i\)th row corresponds to the \(i\)th observation. In this example, the SNPs are a subset of the A. Thaliana TAIR9 genotype dataset and all SNPs have minor allele frequency greater than 0.01. Each simulated phenotype vector is simulated using this SNP matrix. Here are the first five rows and five columns of the SNP matrix:
data("SNPs")
1:5,1:5]
SNPs[#> SNP2555 SNP2556 SNP2557 SNP2558 SNP2559
#> [1,] 1 1 1 0 0
#> [2,] 0 1 1 1 1
#> [3,] 0 0 1 1 1
#> [4,] 1 1 0 0 1
#> [5,] 1 1 1 1 1
Third, the kinship matrix is an \(n \times n\) positive semi-definite matrix containing only numeric values. The \(i\)th row or \(i\)th column quantifies how observation \(i\) is related to other observations. Since both simulated phenotype vectors are simulated from the same SNP matrix, they have the same kinship structure. The first five rows and five columns of the kinship matrix are
data("kinship")
1:5,1:5]
kinship[#> V1 V2 V3 V4 V5
#> [1,] 0.78515873 0.15800700 0.04264546 0.02057071 0.05643574
#> [2,] 0.15800700 0.78146476 0.05135891 0.01476357 0.05482448
#> [3,] 0.04264546 0.05135891 0.80199976 0.10558970 0.04888596
#> [4,] 0.02057071 0.01476357 0.10558970 0.80030413 0.02935703
#> [5,] 0.05643574 0.05482448 0.04888596 0.02935703 0.78401489
The function GINAX
implements the GINA-X method for
generalized linear mixed models with either Poisson or Bernoulli
distributed responses. This function takes as inputs the observed
phenotypes, the SNPs coded numerically, the distributional family of the
phenotype, a matrix of fixed covariates, the covariance matrices of the
random effects, the design matrices of the random effects, and an
offset. Further, the other inputs of GINAX
are the FDR
nominal level, the maximum number of iterations of the genetic algorithm
in the model selection step, and the number of consecutive iterations of
the genetic algorithm with the same best model for convergence.
Here we illustrate the use of GINAX
with a nominal FDR
of 0.05 with Poisson count data. First we specify the covariance
matrices for the random effects. The first random effect is assumed to
be \(\boldsymbol{\alpha}_1 \sim N(0,\kappa_1
K)\), where \(K\) is the
realized relationship matrix or kinship matrix. The second random effect
is assumed to be \(\boldsymbol{\alpha}_1 \sim
N(0,\kappa_2 I)\), where the covariance matrix is an identity
matrix times a scalar. This second random effect is to account for
overdispersion in the Poisson model. The Covariance
argument takes a list of random effect covariance matrices. For this
example, the list of covariance matrices is set as:
<- length(Y_poisson)
n <- list()
covariance 1]] <- kinship
covariance[[2]] <- diag(1, nrow = n, ncol = n) covariance[[
The design matrices \(Z_i\) do not
need to be specified in Z
as the observations have no other
structure such as a grouping structure. Z
is set to be NULL
implying that \(Z_i = I_{n \times n}\).
Further, because the number of ecotype replications is 15, in this
example we set the offset to log(15). The call to the GINAX function
is
# This example is computationally intensive and is shown but not evaluated in this vignette.
# You can run it manually in your R session.
<- GINAX(Y=Y_poisson, Covariance=covariance, SNPs=SNPs, family="poisson", Z=NULL, offset=log(15),FDR_Nominal = 0.05, maxiterations = 1000, runs_til_stop = 200)
output_poisson output_poisson
The data was generated with causal SNPs at positions 450, 1350, 2250,
3150, 4050, 4950, 5850, 6750, 7650, and 8550. GINAX
outputs
the column indices of the SNPs
matrix that are in best
model or column indices of SNPs perfectly correlated to SNPs in the best
model.
Here we illustrate the use of GINAX
with a nominal FDR
of 0.05 with binary data. First we specify the covariance matrices for
the random effects. The only random effect is assumed to be \(\boldsymbol{\alpha} \sim N(0,\kappa_1 K)\),
where \(K\) is the realized
relationship matrix or kinship matrix. For this example, the list of
covariance matrices is set as:
<- list()
covariance 1]] <- kinship covariance[[
In this example, the design matrices \(Z_i\) do not need to be specified in
Z
as the observations have no other structure such as a
grouping structure. Z
is set to be NULL implying that \(Z_i = I_{n \times n}\). With binary data,
setting the number of replicates provides no computation gain and is not
required.
# This example is computationally intensive and is shown but not evaluated in this vignette.
# You can run it manually in your R session.
<- GINAX(Y=Y_binary, Covariance=covariance, SNPs = SNPs, family = "bernoulli", Z=NULL, offset=NULL, FDR_Nominal = 0.05, maxiterations = 2000, runs_til_stop = 400)
output_binary output_binary
Similarly to the Poisson example, the data was generated with causal SNPs at positions 450, 1350, 2250, 3150, 4050, 4950, 5850, 6750, 7650,and 8550. GINAX identifies 1 false SNPs and 4 true causal SNPs.