Initializes a structure that can be used as a reference fit for the projective variable selection. This function is provided to allow construction of the reference fit from arbitrary fitted models, because only limited information is needed for the actual projection and variable selection.

init_refmodel(z, y, family, x = NULL, predfun = NULL, dis = NULL,
  offset = NULL, wobs = NULL, wsample = NULL, intercept = TRUE,
  cvfun = NULL, cvfits = NULL, ...)

Arguments

z

Predictor matrix of dimension n-by-dz containing the training features for the reference model. Rows denote the observations and columns the different features.

y

Vector of length n giving the target variable values.

family

family object giving the model family

x

Predictor matrix of dimension n-by-dx containing the candidate features for selection (i.e. variables from which to select the submodel). Rows denote the observations and columns the different features. Notice that this can different from z. If missing, same as z by default.

predfun

Function that takes a nt-by-dz test predictor matrix zt as an input (nt = # test points, dz = number of features in the reference model) and outputs a nt-by-S matrix of expected values for the target variable y, each column corresponding to one posterior draw for the parameters in the reference model (the number of draws S can also be 1). Notice that the output should be computed without any offsets, these are automatically taken into account internally, e.g. in cross-validation. If omitted, then the returned object will be 'data reference', that is, it can be used to compute penalized maximum likelihood solutions such as Lasso (see examples below and in the quickstart vignette.)

dis

Vector of length S giving the posterior draws for the dispersion parameter in the reference model if there is such a parameter in the model family. For Gaussian observation model this is the noise std sigma.

offset

Offset to be added to the linear predictor in the projection. (Same as in function glm.)

wobs

Observation weights. If omitted, equal weights are assumed.

wsample

vector of length S giving the weights for the posterior draws. If omitted, equal weights are assumed.

intercept

Whether to use intercept. Default is TRUE.

cvfun

Function for performing K-fold cross-validation. The input is an n-element vector where each value is an integer between 1 and K denoting the fold for each observation. Should return a list with K elements, each of which is a list with fields predfun and dis (if the model has a dispersion parameter) which are defined the same way as the arguments predfun and dis above but are computed using only the corresponding subset of the data. More precisely, if cvres denotes the list returned by cvfun, then cvres[[k]]$predfun and cvres[[k]]$dis must be computed using only data from indices folds != k, where folds is the n-element input for cvfun. Can be omitted but either cvfun or cvfits is needed for K-fold cross-validation for genuine reference models. See example below.

cvfits

A list with K elements, that has the same format as the value returned by cvind but each element of cvfits must also contain a field omitted which indicates the indices that were left out for the corresponding fold. Usually it is easier to specify cvfun but this can be useful if you have already computed the cross-validation for the reference model and would like to avoid recomputing it. Can be omitted but either cvfun or cvfits is needed for K-fold cross-validation for genuine reference models.

...

Currently ignored.

Value

An object that can be passed to all the functions that take the reference fit as the first argument, such as varsel, cv_varsel, proj_predict and proj_linpred.

Examples

# generate some toy data set.seed(1) n <- 100 d <- 10 x <- matrix(rnorm(n*d), nrow=n, ncol=d) b <- c(c(1,1),rep(0,d-2)) # first two variables are relevant y <- x %*% b + rnorm(n) # fit the model (this uses rstanarm for posterior inference, # but any other tool could also be used) fit <- stan_glm(y~x, family=gaussian(), data=data.frame(x=I(x),y=y))
#> Error in stan_glm(y ~ x, family = gaussian(), data = data.frame(x = I(x), y = y)): could not find function "stan_glm"
draws <- as.matrix(fit)
#> Error in as.matrix(fit): object 'fit' not found
a <- draws[,1] # intercept
#> Error in eval(expr, envir, enclos): object 'draws' not found
b <- draws[,2:(ncol(draws)-1)] # regression coefficients
#> Error in eval(expr, envir, enclos): object 'draws' not found
sigma <- draws[,ncol(draws)] # noise std
#> Error in eval(expr, envir, enclos): object 'draws' not found
# initialize the reference model structure predfun <- function(xt) t( b %*% t(xt) + a ) ref <- init_refmodel(x,y, gaussian(), predfun=predfun, dis=sigma)
#> Error in t(b %*% t(xt) + a): object 'a' not found
# variable selection based on the reference model vs <- cv_varsel(ref)
#> Error in get_refmodel(fit, ...): object 'ref' not found
#> Error in "vsel" %in% class(object): object 'vs' not found
# pass in the original data as 'reference'; this allows us to compute # traditional estimates like Lasso dref <- init_refmodel(x,y,gaussian()) lasso <- cv_varsel(dref, method='l1') # lasso
#> [1] "Performing selection for each fold.." #> | | | 0% | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100% #> [1] "Performing the selection using all the data.." #> [1] "Done."
varsel_plot(lasso, stat='rmse')