This vignette shows how to use the main functionalities of the
projpred-package, which implements the projective variable selection (Goutis and Robert, 1998; Dupuis and Robert, 2003) for generalized linear models. The package is compatible with
rstanarm but also other reference models could be used (see section Custom reference models). The methods implemented in the package are described in detail in Piironen et al. (2018) and evaluated in comparison to many other methods in Piironen and Vehtari (2017a).
First load the packages that are needed, and specify the number of available cores (this speeds up the sampling when multiple cores are available). Uncomment the last line in the following commands (this is commented out to avoid possible problems when building the vignette along the package installation in special environments such as computing clusters).
library(rstanarm) library(projpred) library(ggplot2) library(bayesplot) theme_set(theme_classic()) #options(mc.cores = parallel::detectCores())
The package contains a simple Gaussian example dataset accessible with the
data-command. This dataset is one of the example cases from the
glmnet-package. The following command loads a dataframe
df_gaussian with the predictor matrix
x and the corresponding targets
y into the workspace.
data('df_gaussian', package = 'projpred')
We then fit the model with all the variables and sparsifying horseshoe prior (Carvalho et al., 2010) on the regression coefficients. This gives us the full Bayesian solution to the problem. To specify the prior beliefs about the number of relevant variables, we use the framework discussed by Piironen and Vehtari (2017b,c), in which the prior for the global shrinkage parameter is defined based on our prior guess for the number of relevant variables.
n <- nrow(df_gaussian$x) # 100 D <- ncol(df_gaussian$x) # 20 p0 <- 5 # prior guess for the number of relevant variables tau0 <- p0/(D-p0) * 1/sqrt(n) # scale for tau (notice that stan_glm will automatically scale this by sigma) prior_coeff <- hs(global_scale = tau0, slab_scale = 1) # regularized horseshoe prior fit <- stan_glm(y ~ x, family=gaussian(), data=df_gaussian, prior=prior_coeff, # to make this vignette build fast, we use only 2 chains and # 500 draws. In practice, more conservative values, eg. 4 chains # and 2000 draws might be required for reliable inference. seed=1, chains=2, iter=500)
The variable selection can then be excecuted with the command
varsel. This return an object that contains the relevant information about the variable selection, such as the ordering of the variables. The search heuristic can be specified by the keyword
method. Currently the package implements forward search and Lasso type L1-penalization to find the variable ordering (Tran et al., 2012). The latter is much faster for large problems but the forward search can sometimes be more accurate.
vs <- varsel(fit, method='forward') vs$vind # variables ordered as they enter during the search
## x1 x14 x20 x5 x3 x6 x8 x11 x7 x10 x15 x12 x18 x4 x2 x13 x16 x17 ## 1 14 20 5 3 6 8 11 7 10 15 12 18 4 2 13 16 17 ## x9 x19 ## 9 19
We can then plot some statistics computed on the training data, such as the sum of log predictive densities (ELPD) and root mean squared error (RMSE) as the function of number of variables added. By default, the statistics are shown on absolute scale, but with
deltas=T the plot shows results relative to the full model.
# plot predictive performance on training data varsel_plot(vs, stats=c('elpd', 'rmse'))
The statistics computed on the training data typically give us a rough idea of how many variables are needed in order to capture all the predictive power of the full model. However, because these statistics are computed using the same data that was used to fit the models, the results can be biased. More reliable assessment can be obtained by cross-validating both the full model and the variable selection process.
cvs <- cv_varsel(fit, method='forward')
In this case the cross-validated results look quite similar to those computed on the training data, showing that after 6 or 7 variables the predictions do not change markedly. The model size suggested by the program is stored in the variable
ssize in the returned object, and it can be obtained also by calling function
# model size suggested by the program suggest_size(cvs)
##  7
# plot the validation results, this time relative to the full model varsel_plot(cvs, stats = c('elpd', 'rmse'), deltas=T)
We can now perform the projection for a submodel of desired size using the function
project. The projection can also be coerced to a matrix with draws of the selected variables and sigma. The draws can be visualized with e.g. the
mcmc_areas function in the
bayesplot package. Below we compare how the projection affects the three most relevant variables.
# Visualise the three most relevant variables in the full model mcmc_areas(as.matrix(fit), pars = c('(Intercept)', names(vs$vind[1:3]), 'sigma')) + coord_cartesian(xlim = c(-2, 2))
# Visualise the projected three most relevant variables proj <- project(vs, nv = 3, ns = 500) mcmc_areas(as.matrix(proj)) + coord_cartesian(xlim = c(-2, 2))
We can make predictions with the projected submodels. For point estimates we can use method
proj_linpred. Test inputs can be provided using the keyword
xnew. If also the test targets
ynew are provided, then the function evaluates the log predictive density at these points. For instance, the following computes the mean of the predictive distribution and evaluates the log density at the training points using the 6 most relevant variables.
pred <- proj_linpred(vs, xnew=df_gaussian$x, ynew=df_gaussian$y, nv = 6, integrated = TRUE)
Visualize the predictions
ggplot() + geom_point(aes(x=pred$pred,y=df_gaussian$y)) + geom_abline(slope = 1, color='red') + labs(x = 'prediction', y = 'y')
We can also draw samples from the projected predictive distribution. Here’s an example prediction for the first data point using 6 variables (the observed value is marked by the red line)
y1_rep <- proj_predict(vs, xnew=df_gaussian$x[1,,drop=F], nv=6, seed=7560) qplot(as.vector(y1_rep), bins=25) + geom_vline(xintercept = df_gaussian$y, color='red') + xlab('y1_rep')
This section shows an example of the variable selection for a logistic regression model (binary classification). Everything is very similar to the Gaussian case. First load the data (this dataset is also from the
data('df_binom', package = 'projpred')
Then fit the full model:
# fit the full model n <- nrow(df_binom$x) D <- ncol(df_binom$x) p0 <- 5 # prior guess for the number of relevant variables sigma <- 2 # approximate plug-in value for observation information (Piironen and Vehtari, 2017b) tau0 <- p0/(D-p0) * sigma/sqrt(n) prior_coeff <- hs(global_scale = tau0, slab_scale = 1) fit <- stan_glm(y ~ x, family=binomial(), data=df_binom, prior=prior_coeff, seed=1, chains=2, iter=500)
Run the variable selection and print out the variable ordering
vs <- varsel(fit, method='forward') vs$vind
## x4 x10 x2 x6 x9 x25 x3 x8 x16 x26 x28 x22 x23 x29 x5 x30 x11 x1 ## 4 10 2 6 9 25 3 8 16 26 28 22 23 29 5 30 11 1 ## x12 x18 ## 12 18
Plot the ELPD and classification accuracy on the training data:
varsel_plot(vs, stats=c('elpd', 'acc'), deltas=F)
Cross-validate the full model and the variable selection. Let’s use K-fold cross validation this time for illustration (here we use only K=3 folds to make this vignette build faster, but in practice it is customary to use more folds, e.g. K=5 or K=10):
cvs <- cv_varsel(fit, method='forward', cv_method='kfold', K=3, seed=823217)
## Fitting model 1 out of 3
## Fitting model 2 out of 3
## Fitting model 3 out of 3
Plot the cross-validated performance estimates for the submodels relative to the full model. In this case the cross-validated results differ from the training statistics substantially.
# model size suggested by the program suggest_size(cvs)
##  8
# plot the validation results varsel_plot(cvs, stats=c('elpd', 'acc'))
Finally, for illustration, we compute here the predictive class probabilities using only the two most relevant variables, and visualize the results.
# evaluate the predictive distribution in a 2d grid ng <- 20 x1g <- seq(-4,4,len=ng) x2g <- seq(-4,4,len=ng) xnew <- cbind( rep(x1g,each=ng), rep(x2g,ng) ) vind <- vs$vind[1:2] pr <- proj_linpred(vs, xnew, vind=vind, transform=T, integrated=T) # visualize the contours showing the predicted class probabilities pp <- ggplot() pp <- pp + geom_contour(aes(x=xnew[,1],y=xnew[,2], z=pr, colour=..level..)) pp <- pp + scale_colour_gradient(low = "red", high = "green") pp <- pp + geom_point(aes(x=df_binom$x[,vind],y=df_binom$x[,vind]), color=df_binom$y+2) pp <- pp + xlab(sprintf('Feature %d',vind)) + ylab(sprintf('Feature %d',vind)) pp
The package supports using virtually any reference model, the only requirement is that one can make predictions using the reference model. Custom reference models can be initialized using the function
Let’s first generate some data where there are lots of correlating features, all of which are about equally predictive about the target variable, but carry similar information.
set.seed(324092) n <- 100 # number of observations d <- 200 # number of features rho <- 0.5 # correlation between the features f <- rnorm(n) # the true underlying function x <- sqrt(1-rho)*matrix(rnorm(n*d), ncol=d) + sqrt(rho)*f # features are noisy observations from f y <- f + rnorm(n) # target variable
We now fit a model to these data, but instead of using the original features, we use only a few first principal components. The following example utilizes
rstanarm for fitting the model, but notice that any other tool could be used (this choice simply makes the example more concise).
# dimension reduction nc <- 3 # number of principal components to use dr <- prcomp(x, rank. = nc) z <- dr$x # fit the model fit <- stan_glm(y~., data=data.frame(z,y), prior=normal(scale=2)) draws <- as.matrix(fit) # posterior draws sigma <- draws[,'sigma'] # noise std beta <- draws[,1+c(1:nc)] # regression coefficients alpha <- draws[,'(Intercept)'] # intercept
Now we can construct the reference model. In the following,
predfun is a function that computes the expected value for the target variable separately for each posterior draw given the new predictor values (see documentation of
init_refmodel for more information). Notice also that the reference model uses different features (the principal components) than those from which we are selecting the submodel (the original features).
# prediction with the reference model predfun <- function(zt) t( beta %*% t(zt) + alpha ) # initialize the reference model object. notice here that the first argument z # denotes the features of the reference model, and x the features from which # we want to select from ref <- init_refmodel(z, y, gaussian(), x=x, predfun=predfun, dis=sigma)
Now we can perform everything similarly as in the previous examples. The following computes the LOO validation for the projective selection with L1-search.
# LOO validation for the feature selection cvs <- cv_varsel(ref) varsel_plot(cvs, stats=c('elpd','rmse'))
It is also possible to use
projpred for computing traditional point estimates, such as Lasso (Tibshirani, 1996) or forward stepwise regression. These are fast to compute (require no reference model) and are therefore useful especially as baseline methods.
These can be computed by passing in the original data as “reference”, and then using all the same functions as earlier. We illustrate by computing the Lasso solution for simulated data we used in the previous section
# initialize the data reference dref <- init_refmodel(x,y,family=gaussian()) lasso <- cv_varsel(dref, K=10, method='l1', nv_max=30) # tenfold CV for Lasso varsel_plot(lasso, stats='rmse')
Also Lasso gives a good fit, comparable to that of the Bayesian model in terms of RMSE, but the projection finds a submodel with similar performance with fewer features. Furthermore, estimation of the noise variance is less straightforward for Lasso, so if we were interested in computing well calibrated predictive distributions for future data, projection would be the preferred choice also in this respect.
Here’s the greedy forward search with small L2-penalty for the coefficients.
Forward search finds a sparser model with comparable performance to Lasso but starts overfitting when more features are added.
Carvalho, C.M., Polson, N.G., Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika 97(2):465–480.
Dupuis, J. A. and Robert, C. P. (2003). Variable selection in qualitative models via an entropic explanatory power. Journal of Statistical Planning and Inference, 111(1-2):77–94.
Goutis, C. and Robert, C. P. (1998). Model choice in generalised linear models: a Bayesian approach via Kullback–Leibler projections. Biometrika, 85(1):29–37.
Piironen, Juho and Vehtari, Aki (2017a). Comparison of Bayesian predictive methods for model selection. Statistics and Computing 27(3):711-735. DOI 10.1007/s11222-016-9649-y. Online
Piironen, Juho and Vehtari, Aki (2017b). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR 54:905-913, 2017. Online
Piironen, Juho and Vehtari, Aki (2017c). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2): 5018–5051. Online
Piironen, Juho, Paasiniemi, Markus and Vehtari, Aki (2018). Projective Inference in High-dimensional Problems: Prediction and Feature Selection. Preprint
Tibshirani, Robert (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
Tran, M.N., Nott, D.J., Leng, C. (2012): The predictive Lasso. Statistics and Computing 22(5):1069-1084.