The ceser package contains an implementation in R of the Cluster Estimated Standard Error (CESE) method proposed by Jackson (2019). The method estimates the covariance matrix of the estimated coefficients of linear models in grouped data sets with correlation among observations within groups. CESE is an alternative solution for the classical Cluster Robust Standard Error (CRSE) (Greene (2012),Eicker (1967),White (1980),Liang and Zeger (1986),MacKinnon and Webb (2017)), which underestimates the standard errors in most of the situations encountered in practice (Esarey and Menger (2018)).
Jackson (2019) proposes an approach labeled CESE to estimate the standard errors in grouped data with within-group correlation in the residuals. The approach is based on the estimated expectation of the product of the residuals. Assuming that the residuals have the same variance-covariance matrix within the groups, if we denote by σig=σ2g and ρig=ρg the variance and covariance, respectively, of the residuals within the group g, then the expectation of the product of the residuals is given by (see Jackson (2019) for details):
Σg=E[egeTg]=σ2g(Ig−Pg)+ρg[ιgιTg−(Ig−Pg)−(PgιgιTg+ιgιTgPg)+Xg(XTX)−1(G∑g=1XTgιgιTgXg)(XTX)−1Xg]
where ιg is a unitary column vector, Ig is a g×g identity matrix, and Pg=Xg(XTX)−1XTg. The equation above can be rewritten concisely as:
Σg=σ2gQ1g+ρgQ2g.
This equation explicitly shows that the expectation of the cross-product of the residuals is a function the data through Q1g, Q2g, the unknown variance σ2g, and correlation ρg of the residuals eg in each group g. The CESE solution is to explore the linear structure of that equation and estimate σ2g and ρg as if the estimated values of egeTg were random deviances from their expectations. Denote ξ that deviance. Then
egeTg=E[egeTg]+ξ=σ2gQ1g+ρgQ2g+ξ=Σg+ξ.
The estimates of σ2g and ρg are obtained using the OLS estimator. That is, if we denote Ωg=(σ2g,ρg)T, q1g (or q2g) the vectorized diagonal and lower triangle of Q1g (or Q2g) stacked into a ng(ng+1)/2 column vector, qg=[q1g,q2g], and seg the corresponding elements of egeTg stacked into a column vector as well, then the OLS CESE estimator ˆΩg=(ˆσ2g,ˆρg)T of the variance and correlation of the residuals in group g is given by
ˆΩg=arg minΩg(seg−qgΩg)T(seg−qgΩg).
If we assume that qTgqg is invertible, the first order condition gives:
ˆΩg=(qTgqg)−1qTgseg.
We can rewrite the equation above as:
[ˆσ2gˆρg]=[qT1gq1gqT1gq2gqT2gq1gqT2gq2g]−1[qT1gsegqT2gseg].
The estimators of σ2g and ρg do not require any assumption on ξ, unless we want to construct confidence intervals for the estimates of those parameters.
Jackson (2019) shows that CESE produces larger standard errors for the coefficients and much more conservative confidence intervals than the CRSE, which is known to be biased downward. CESE is also less sensitive to the number of clusters and to the heterogeneity of the clusters, which can be a problem for both CRSE and bootstrap methods.
The package CESER provides a function vcovCESE()
that takes the output of the function lm()
(or any other that produces compatible outputs) and computes the CESE. The basic structure of the function is:
The parameter mod
receives the output of the lm()
function. The parameter cluster
can receive a right-hand side R formula with the summation of the variables in the data that will be used to cluster the standard errors. For instance, if one wants to cluster the standard error by country, one can use:
To cluster by country and gender, simply use:
The parameter cluster
can also receive, instead of a formula, a string vector with the name of the variables that contain the groups to cluster the standard errors. If cluster = NULL
, each observation is considered its own group to cluster the standard errors.
The parameter type
receives the procedure to use for heterokedasticity correction. Heterokedasticity occurs when the diagonal elements of Σ are not constant across observations. The correction can also be used to deal with underestimation of the true variance of the residuals due to leverage produced by outliers. We include five types of correction. In particular, type
can be either “HC0”, “HC1”, “HC2”, “HC3”, and “HC4” (Hayes and Cai (2007)). Denote ec the corrected residuals. Each option produce the following corretion:
eic=ei eic=ei() eic=ei() eic=ei() eic=ei()
where k is the number of covariates, hii is the ith diagonal element of the matrix P=X(XTX)−1XT, and δi=min.
The estimation also corrects for cases in which \rho_{g} > \sigma^2{g}. Following Jackson (2019), we use \hat{\sigma}_{g}^{2} = (\hat{\rho}_g + 0.02 ) in those cases.
To ilustrate how to use the ceser package, we use the data set dcese
, which is provided with the package. The data set contains information of 310 (i=1,…, 310) observations across 51 countries (g=1,…,51).
The outcome variable is the number of effective legislative parties (enep). The explanatory variables are: the number of presidential candidates (enpc); a measure of presidential power (fapres); the proximity of presidential and legislative elections (proximity); the effective number of ethnic groups (eneg); the log of average district magnitudes (logmag); an interaction term between the number of presidential candidates and the presidential power (enpcfapres = enpc \times fapres), and another interaction term between the log of the district magnitude and the number of ethnic groups (logmag_eneg = logmag \times eneg). In the example below, we regress enpc on fapres, enpc, their interaction, and other controls.
library(ceser)
#>
#> ## ------------------------------------------
#> ## Cluster Estimated Standard Errors (ceser)
#> ## ------------------------------------------
#> Author(s): Diogo Ferrari and John Jackson
#>
data(dcese)
Let’s estimate a linear model using the lm()
function.
For all the examples below, we use the HC3 correction. To compute the variance covariance matrix of the estimated coefficients \hat{\beta} using clustered standard errors by country, just execute:
vcovCESE(mod, cluster = ~country, type="HC3")
#> (Intercept) enpc fapres enpcfapres proximity
#> (Intercept) 1.59804215 -0.3565889644 -0.326044522 0.0928614018 -0.086958688
#> enpc -0.35658896 0.1254735451 0.104834222 -0.0354704134 -0.003332760
#> fapres -0.32604452 0.1048342220 0.143205927 -0.0389793642 -0.017878997
#> enpcfapres 0.09286140 -0.0354704134 -0.038979364 0.0126978025 0.003217871
#> proximity -0.08695869 -0.0033327597 -0.017878997 0.0032178709 0.139694748
#> eneg -0.08737194 0.0028258190 -0.007080605 0.0010940285 -0.005680272
#> logmag -0.22422347 0.0009844983 0.006687954 0.0038080340 0.009776123
#> logmag_eneg 0.08380931 -0.0058250477 -0.011500450 0.0008568949 0.004471824
#> eneg logmag logmag_eneg
#> (Intercept) -0.087371945 -0.2242234723 0.0838093142
#> enpc 0.002825819 0.0009844983 -0.0058250477
#> fapres -0.007080605 0.0066879544 -0.0115004496
#> enpcfapres 0.001094029 0.0038080340 0.0008568949
#> proximity -0.005680272 0.0097761226 0.0044718243
#> eneg 0.039433115 0.0481561426 -0.0231003489
#> logmag 0.048156143 0.2244236813 -0.1048418029
#> logmag_eneg -0.023100349 -0.1048418029 0.0606625677
We can obtain the summary table with CESE by country by running:
library(lmtest)
#> Loading required package: zoo
#>
#> Attaching package: 'zoo'
#> The following objects are masked from 'package:base':
#>
#> as.Date, as.Date.numeric
coeftest(mod, vcov = vcovCESE, cluster = ~ country, type="HC3")
#>
#> t test of coefficients:
#>
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.704309 1.264137 2.1393 0.03325 *
#> enpc 0.303988 0.354222 0.8582 0.39150
#> fapres -0.611787 0.378426 -1.6167 0.10703
#> enpcfapres 0.207791 0.112685 1.8440 0.06620 .
#> proximity -0.022429 0.373758 -0.0600 0.95219
#> eneg -0.065662 0.198578 -0.3307 0.74114
#> logmag -0.181462 0.473734 -0.3830 0.70197
#> logmag_eneg 0.360513 0.246298 1.4637 0.14435
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Eicker, Friedhelm. 1967. “Limit Theorems for Regressions with Unequal and Dependent Errors.” In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1:59–82. 1.
Esarey, Justin, and Andrew Menger. 2018. “Practical and Effective Approaches to Dealing with Clustered Data.” Political Science Research and Methods, 1–19.
Greene, William H. 2012. Econometric Analysis. Upper Saddle River, NJ: Pearson Prentice Hall. http://mirlyn.lib.umich.edu/Record/009998038.
Hayes, Andrew F, and Li Cai. 2007. “Using Heteroskedasticity-Consistent Standard Error Estimators in Ols Regression: An Introduction and Software Implementation.” Behavior Research Methods 39 (4): 709–22.
Jackson, John. 2019. “Corrected Standard Errors with Clustered Data.” Political Analysis forthcoming.
Liang, Kung-Yee, and Scott L Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22.
MacKinnon, James G, and Matthew D Webb. 2017. “Wild Bootstrap Inference for Wildly Different Cluster Sizes.” Journal of Applied Econometrics 32 (2): 233–54.
White, Halbert L. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica 48 (4): 817–38.