Several methods may be found for selecting a subset of regressors
from a set of k candidate variables in multiple linear regression. One
possibility is to evaluate all possible regression models and comparing
them using Mallows’s Cp statistic (Cp).
1. Gilmour, S.G. The
interpretation of Mallows’s Cp-statistic. The Statistician. 1996. 45(1)
49-56. doi–10.2307/2348411. https://rss.onlinelibrary.wiley.com/doi/10.2307/2348411
2. Boisbunon, Aurélie; Canu, Stephane; Fourdrinier, Dominique;
Strawderman, William; Wells, Martin T. (2013). “AIC, Cp and estimators
of loss for elliptically symmetric distributions”. arXiv–1308.2766
Generally, the presented package is suitable when the number of
observations (n) is small and the number of regressors (k) is relatively
low. However, the condition n > k + 3 should be satisfied.
Short theoretical background.
For a particular submodel with
p parameters, the Cp statistic is defined as:
\(Cp = \frac{SSEp}{\sigma^2}-n+2p\)
where SSEₚ is the sum of squared errors for the submodel, and MSE is the mean square error from the full model with all possible regressors, used as an estimate of the error variance σ². Gilmour proposed an adjusted version of the Cp statistic to improve model comparison, given by:
\(\bar{C}_p = C_p - \frac{2(k - p + 1)}{(n - k – 3)}\)
where k is the number of regressors in the full model, and p - 1 is the number of regressors in the submodel (thus, p includes the intercept term). Further details and theoretical background can be found in Gilmour’s original paper. For example, in the case of the trivial model (i.e., a model with only an intercept), the adjusted statistic is:
\(\bar{C}_1 = C_1 - \frac{2(k - p + 1)}{(n - k – 3)}= \frac{var(y)(n-1)}{MSE}-n+2-\frac{2(k - p + 1)}{(n - k – 3)}\)
where the function var() of sample variation is used for calculation of sum of squares in the trivial model and MSE is estimation of \(\sigma^2\).
The package contains two primary functions: submodels(d) and
final_model(d).
The functions, submodels(d) and final_model(d),
share the same input argument d, which must be a table in the data.frame
format. The first column of this data frame should contain the response
(dependent) numerical variable (y), while the remaining columns should
contain the explanatory (independent) numerical variables (regressors).
By default, each row represents a single observation.
The number of observations (n) should be higher than
the number of regressors (k) plus three (n>k+3).
The
functions ignore column names (colnames) and row names (rownames);
however, the position of the first column is
critical and must always contain the response variable.
The package includes eight illustrative datasets:
Gilmour9p, T1,
T2, Trivial, Modified_Gilmour9p, Parks5p, Patents5p and EU2019.
Additionally, the standard R dataset “mtcars” may be also used.
The procedure consists of two main steps:
Initial
Selection:
Adjusted \(\bar{C}_p\) values are calculated for all
possible combinations of regressors. For example, 1023 submodels are for
“mtcars” data, see the two following statements:
submodels(mtcars)$submodels_number
final_model(mtcars)$submodels_number
Subsequently, the
submodel with the minimum \(\bar{C}_p\)
value is selected and labeled as ModelMin.
See the two following
statements:
submodels(mtcars)$model_min
final_model(mtcars)$model_min
Reduction and Testing in
the function “final_model(d)”:
All submodels nested within
ModelMin are considered. Among them, the submodel with the lowest \(\bar{C}_p\) is selected as a new candidate.
A hypothesis test is then performed where the null hypothesis states
that ModelMin provides a better fit than the candidate submodel. If the
null hypothesis is not rejected, ModelMin is accepted as the final
model. If the null hypothesis is rejected, the candidate submodel
becomes the new ModelMin, and the process is repeated with submodels
nested within this new model. The procedure continues iteratively until
a null hypothesis is not rejected or until the so-called trivial model
(i.e., a model using only the arithmetic mean) is reached. See the two
diagrams below.
The function “final_model(d)” returns full
regression results for:
the Full Model, the selected ModelMin,
and the resulting Final Model.
Additionally, it outputs \(\bar{C}_p\) values for all submodels,
enabling users to easily generate a adjusted \(\bar{C}_p\) ~ p plot (where p is the
number of regressors+1), as recommended by Gilmour.
The statements
which draw the plots are below and in the help of the package.
“Initial Selection” is in the two functions: submodels() and
final_model()
“Reduction and Testing” is in the function
final_model()
Examples:
The package includes eigth illustrative datasets. Additionally, the
standard R dataset “mtcars” may be used as a nine example in the help
documentation (see: mtcars dataset). In this dataset, the first column
(mpg, representing car fuel consumption) is used as the response
variable. Some example datasets have no substantive meaning and are
included purely to demonstrate technical capabilities of the functions.
However, three of the datasets were used in previously published studies
and carry practical relevance.
References of three data files
with practical meaning:
Gilmour9p:
Gilmour, S.G. The
interpretation of Mallows’s Cp-statistic. The Statistician. 1996.
45(1):49-56. doi:10.2307/2348411
Parks5p:
Stemberk
Josef, Josef Dolejs, Petra Maresova, Kamil Kuca. Factors affecting the
number of Visitors in National Parks in the Czech Republic, Germany and
Austria. International Journal of Geo-Information. https://www.mdpi.com/2220-9964/7/3/124. ISPRS Int. J.
Geo-Inf. 2018, 7(3), 124; doi:10.3390/ijgi7030124
Patents5p:
Perspective and Suitable Research Area in Public Research—Patent
Analysis of the Czech Public Universities. Education and Urban Society,
54(7), https://doi.org/10.1177/00131245211027362
The
First Example: mtcars
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Data “mtcars” may be used as simple illustration and no other
treatment is necessary.
Two simple statements may be used to obtain
all results.
MyResults=final_model(mtcars)
This
is what the output looks like when applying the final_model(d)
function:
[1] “Number of
regressors in model_min 3”
[1]
“model_min: y~x5+x6+x8”
[1] “Cq+1:
-0.634206365802602”
[1] “The
starting value of q for the test of submodel is: 3”
[1] “For q = 3: Ho is rejected, new submodel goes to
the next test.”
[1] “For q = 2: the
End, Ho is not rejected, the resulting model is: y~x5+x6+const. F=
11.79722 Cq+1 = 0.98767”
[1] “Final
value of q is : 2”
This is how you can display the basic
graph of the Gilmour method, where in the first step the model with the
minimum value of the adjusted statistic \(\bar{C}_p\) is searched.
yCp= as.numeric(MyResults$submodels[,3]) ; xp=
as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9*
min(yCp))
YRange=c( ymin
,1.5*max(xp)) ; FinalCp= MyResults$FinalCp
plot(yCp ~ xp, xlab=“Number of Parameters in
Submodel”,ylab=““, ylim=YRange ,
col=ifelse( ( round(yCp,4)== round(min(yCp),4)|
round(yCp,4)== round(FinalCp,4) ), “red”, “darkblue”) )
lines(xp, xp, col=“red”)
mtext(bquote(paste( bar(C) , “p”)), side=2,
line=3, padj=1, cex=1.2)
mtext(bquote(paste(“All Submodels:”,bar(C),“p ~ p”)),
side=3, line=3, padj=1, cex=1.2)
Relationship \(\bar{C}_p\)~p for all submodels calculated
for data “mtcars”.
Notes: red circles corresponds to ModelMin and FinalModel.
The following commands show additional outputs for the full model and
for the final model.
MyResults$full_model
summary(MyResults$full_model_results)
MyResults$final_model
summary(MyResults$final_model_results)
Information about the model “ModelMin” with the minimum value of the
adjusted C̄p characteristic can be obtained as follows:
MyResults$model_min
summary(MyResults$model_min_results)
This submodel can be identical to the final model and, in
principle, to the full model.
The next 8 illustrative tables are
also available in the package:
Gilmour9p;
T1; T2; Trivial; Modified_Gilmour9p; Parks5p; Patents5p;
EU2019
Further, individual data are gradually
evaluated in this illustrative manner (the outputs of the functions
submodels(d) and final_model(d) can also be used in other ways).
Data “Gilmour9p” were taken from the original study by
Gilmour.
It illustrates the entire method and is also of practical
importance.
The data contains 9 predictors and 24 observations
(House price data).
The outputs of the functions
“submodels()” and “final_model()”are identical to the
results in the original work. d=Gilmour9p
MyResults=final_model(d)
[1] “Number of regressors in model_min 2”
[1] “model_min: y~x1+x2”
[1] “Cq+1: -0.248045011348888”
[1] “The starting value of q for the test of submodel
is: 2”
[1] “For q = 2: Ho is
rejected, new submodel goes to the next test.”
[1] “For q = 1: the End, Ho is not rejected,
y~x1+constant, F = 72.17679 Cq+1 = 0.89705”
[1] “Final value of q is : 1”
This is how you can display the basic graph of the Gilmour method,
where in the first step the model with the minimum value of the adjusted
statistic \(\bar{C}_p\) is
searched.
yCp=
as.numeric(MyResults$submodels[,3]) ; xp=
as.numeric(MyResults$submodels[,2])
ymin= ifelse(min(yCp)<0, 1.1* min(yCp), 0.9*
min(yCp))
YRange=c( ymin
,1.5*max(xp)) ; FinalCp= MyResults$FinalCp
plot(yCp ~ xp, xlab=“Number of Parameters in
Submodel”,ylab=““, ylim=YRange ,
col=ifelse( ( round(yCp,4)== round(min(yCp),4)|
round(yCp,4)== round(FinalCp,4) ), “red”, “darkblue”) )
lines(xp, xp, col=“red”)
mtext(bquote(paste( bar(C) , “p”)), side=2,
line=3, padj=1, cex=1.2)
mtext(bquote(paste(“All Submodels:”,bar(C),“p ~ p”)),
side=3, line=3, padj=1, cex=1.2)
Relationship \(\bar{C}_p\)~p for all
submodels calculated for data “Gilmour9p”.
Notes: red circles corresponds to ModelMin and
FinalModel.
Simmilar commands as was used for data mtcars may may be used
for all other tables:
Gilmour9p; T1;
T2; Trivial; Modified_Gilmour9p; Parks5p; Patents5p; EU2019
For example, if d=T1
is used then table “d” and other commands may be used ….
All
commands are available also in the section “Examples” in the help of the
function “final_model(d)”.