Function get_refmodel()
is a generic function whose methods usually call
init_refmodel()
which is the underlying workhorse (and may also be used
directly without a call to get_refmodel()
).
Both, get_refmodel()
and init_refmodel()
, create an object containing
information needed for the projection predictive variable selection, namely
about the reference model, the submodels, and how the projection should be
carried out. For the sake of simplicity, the documentation may refer to the
resulting object also as "reference model" or "reference model object", even
though it also contains information about the submodels and the projection.
A "typical" reference model object is created by get_refmodel.stanreg()
and
brms::get_refmodel.brmsfit()
, either implicitly by a call to a top-level
function such as project()
, varsel()
, and cv_varsel()
or explicitly by
a call to get_refmodel()
. All non-"typical" reference model objects will be
called "custom" reference model objects.
Some arguments are for \(K\)-fold cross-validation (\(K\)-fold CV) only;
see cv_varsel()
for the use of \(K\)-fold CV in projpred.
get_refmodel(object, ...)
# S3 method for refmodel
get_refmodel(object, ...)
# S3 method for vsel
get_refmodel(object, ...)
# S3 method for projection
get_refmodel(object, ...)
# S3 method for default
get_refmodel(object, family = NULL, ...)
# S3 method for stanreg
get_refmodel(object, latent = FALSE, dis = NULL, ...)
init_refmodel(
object,
data,
formula,
family,
ref_predfun = NULL,
div_minimizer = NULL,
proj_predfun = NULL,
extract_model_data = NULL,
cvfun = NULL,
cvfits = NULL,
dis = NULL,
cvrefbuilder = NULL,
called_from_cvrefbuilder = FALSE,
...
)
For init_refmodel()
, an object that the functions from
arguments extract_model_data
and ref_predfun
can be applied to, with a
NULL
object being treated specially (see section "Value" below). For
get_refmodel.default()
, an object that function family()
can be applied
to in order to retrieve the family (if argument family
is NULL
),
additionally to the properties required for init_refmodel()
. For
non-default methods of get_refmodel()
, an object of the corresponding
class.
For get_refmodel.default()
and get_refmodel.stanreg()
:
arguments passed to init_refmodel()
. For the get_refmodel()
generic:
arguments passed to the appropriate method. For init_refmodel()
:
arguments passed to extend_family()
(apart from family
).
An object of class family
representing the observation model
(i.e., the distributional family for the response) of the submodels.
(However, the link and the inverse-link function of this family
are also
used for quantities like predictions and fitted values related to the
reference model.) May be NULL
for get_refmodel.default()
in which
case the family is retrieved from object
. For custom reference models,
family
does not have to coincide with the family of the reference model
(if the reference model possesses a formal family
at all). In typical
reference models, however, these families do coincide. Furthermore, the
latent projection is an exception where family
is not the family of the
submodels (in that case, the family of the submodels is the gaussian()
family).
A single logical value indicating whether to use the latent
projection (TRUE
) or not (FALSE
). Note that setting latent = TRUE
causes all arguments starting with augdat_
to be ignored.
A vector of posterior draws for the reference model's dispersion
parameter or---more precisely---the posterior values for the reference
model's parameter-conditional predictive variance (assuming that this
variance is the same for all observations). May be NULL
if the submodels
have no dispersion parameter or if the submodels do have a dispersion
parameter, but object
is NULL
(in which case 0
is used for dis
).
Note that for the gaussian()
family
, dis
is the standard deviation,
not the variance.
A data.frame
containing the data to use for the projection
predictive variable selection. Any contrasts
attributes of the dataset's
columns are silently removed. For custom reference models, the columns of
data
do not necessarily have to coincide with those of the dataset used
for fitting the reference model, but keep in mind that a row-subset of
data
is used for argument newdata
of ref_predfun
during \(K\)-fold
CV.
The full formula to use for the search procedure. For custom
reference models, this does not necessarily coincide with the reference
model's formula. For general information about formulas in R, see
formula
. For information about possible right-hand side (i.e.,
predictor) terms in formula
here, see the main vignette and section
"Formula terms" below. For multilevel formulas, see also package lme4
(in particular, functions lme4::lmer()
and lme4::glmer()
). For additive
formulas, see also packages mgcv (in particular, function
mgcv::gam()
) and gamm4 (in particular, function gamm4::gamm4()
).
Prediction function for the linear predictor of the
reference model, including offsets (if existing). See also section
"Arguments ref_predfun
, proj_predfun
, and div_minimizer
" below. If
object
is NULL
, ref_predfun
is ignored and an internal default is
used instead.
A function for minimizing the Kullback-Leibler (KL)
divergence from the reference model to a submodel (i.e., for performing the
projection of the reference model onto a submodel). The output of
div_minimizer
is used, e.g., by proj_predfun
's argument fits
. See
also section "Arguments ref_predfun
, proj_predfun
, and div_minimizer
"
below.
Prediction function for the linear predictor of a
submodel onto which the reference model is projected. See also section
"Arguments ref_predfun
, proj_predfun
, and div_minimizer
" below.
A function for fetching some variables (response,
observation weights, offsets) from the original dataset (supplied to
argument data
) or from a new dataset. May be NULL
for using an internal
default that essentially corresponds to y_wobs_offs()
. See also section
"Argument extract_model_data
" below.
For \(K\)-fold CV only. A function that, given a fold indices
vector, fits the reference model separately for each fold and returns the
\(K\) model fits as a list
. If object
is NULL
, cvfun
may be
NULL
for using an internal default. Only one of cvfits
and cvfun
needs to be provided (for \(K\)-fold CV). Note that cvfits
takes
precedence over cvfun
, i.e., if both are provided, cvfits
is used.
For \(K\)-fold CV only. A list
containing the \(K\)
reference model refits from which reference model objects are created. This
list
needs to have an attribute called folds
, consisting of an integer
vector giving the fold indices (one fold index per observation). Only one
of cvfits
and cvfun
needs to be provided (for \(K\)-fold CV). Note
that cvfits
takes precedence over cvfun
, i.e., if both are provided,
cvfits
is used.
For \(K\)-fold CV only. A function that, given a
reference model fit for fold \(k \in \{1, ..., K\}\),
returns an object of the same type as init_refmodel()
does. The reference
model fit for fold \(k\) is the \(k\)-th element of the return value of
cvfun
or the \(k\)-th element of the list
supplied to cvfits
(either here in init_refmodel()
or in cv_varsel.refmodel()
), extended
by elements omitted
(containing the indices of the left-out observations
in that fold) and projpred_k
(containing the integer \(k\)) if that
\(k\)-th element is a list
itself (otherwise, omitted
and
projpred_k
are appended as attributes). Argument cvrefbuilder
may be
NULL
for using an internal default: get_refmodel()
if object
is not
NULL
and a function calling init_refmodel()
appropriately (with the
assumption dis = 0
) if object
is NULL
.
A single logical value indicating whether
init_refmodel()
is called from a cvrefbuilder
function (TRUE
) or not
(FALSE
). Currently, TRUE
only causes some warnings to be suppressed
(warnings which don't need to be thrown for each of the \(K\) reference
model objects because it is sufficient to throw them for the original
reference model object only). This argument is mainly for internal use, but
may also be helpful for users with a custom cvrefbuilder
function.
An object that can be passed to all the functions that take the
reference model fit as the first argument, such as varsel()
,
cv_varsel()
, project()
, proj_linpred()
, and proj_predict()
.
Usually, the returned object is of class refmodel
. However, if object
is NULL
, the returned object is of class datafit
as well as of class
refmodel
(with datafit
being first). Objects of class datafit
are
handled differently at several places throughout this package.
The elements of the returned object are not meant to be accessed directly
but instead via downstream functions (see the functions mentioned above as
well as predict.refmodel()
).
Although bad practice (in general), a reference model lacking an intercept can be used within projpred. However, it will always be projected onto submodels which include an intercept. The reason is that even if the true intercept in the reference model is zero, this does not need to hold for the submodels.
In multilevel (group-level) terms, function calls on the right-hand side of
the |
character (e.g., (1 | gr(group_variable))
, which is possible in
brms) are currently not allowed in projpred.
For additive models (still an experimental feature), only mgcv::s()
and
mgcv::t2()
are currently supported as smooth terms. Furthermore, these need
to be called without any arguments apart from the predictor names (symbols).
For example, for smoothing the effect of a predictor x
, only s(x)
or
t2(x)
are allowed. As another example, for smoothing the joint effect of
two predictors x
and z
, only s(x, z)
or t2(x, z)
are allowed (and
analogously for higher-order joint effects, e.g., of three predictors). Note
that all smooth terms need to be included in formula
(there is no random
argument as in rstanarm::stan_gamm4()
, for example).
ref_predfun
, proj_predfun
, and div_minimizer
Arguments ref_predfun
, proj_predfun
, and div_minimizer
may be NULL
for using an internal default (see projpred-package for the functions used
by the default divergence minimizers). Otherwise, let \(N\) denote the
number of observations (in case of CV, these may be reduced to each fold),
\(S_{\mathrm{ref}}\) the number of posterior draws for the reference
model's parameters, and \(S_{\mathrm{prj}}\) the number of draws for
the parameters of a submodel that the reference model has been projected onto
(short: the number of projected draws). For the augmented-data projection,
let \(C_{\mathrm{cat}}\) denote the number of response categories,
\(C_{\mathrm{lat}}\) the number of latent response categories (which
typically equals \(C_{\mathrm{cat}} - 1\)), and define
\(N_{\mathrm{augcat}} := N \cdot C_{\mathrm{cat}}\)
as well as \(N_{\mathrm{auglat}} := N \cdot C_{\mathrm{lat}}\). Then the functions supplied to these arguments need to have the
following prototypes:
ref_predfun
: ref_predfun(fit, newdata = NULL)
where:
fit
accepts the reference model fit as given in argument object
(but possibly refitted to a subset of the observations, as done in
\(K\)-fold CV).
newdata
accepts either NULL
(for using the original dataset,
typically stored in fit
) or data for new observations (at least in the
form of a data.frame
).
proj_predfun
: proj_predfun(fits, newdata)
where:
fits
accepts a list
of length \(S_{\mathrm{prj}}\)
containing this number of submodel fits. This list
is the same as that
returned by project()
in its output element outdmin
(which in turn is
the same as the return value of div_minimizer
, except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
div_minimizer
does not need to have a specific prototype, but it needs to
be able to be called with the following arguments:
formula
accepts either a standard formula
with a single response
(if \(S_{\mathrm{prj}} = 1\) or in case of the
augmented-data projection) or a formula
with \(S_{\mathrm{prj}} >
1\) response variables cbind()
-ed on the left-hand side in
which case the projection has to be performed for each of the response
variables separately.
data
accepts a data.frame
to be used for the projection. In case of
the traditional or the latent projection, this dataset has \(N\) rows.
In case of the augmented-data projection, this dataset has
\(N_{\mathrm{augcat}}\) rows.
family
accepts an object of class family
.
weights
accepts either observation weights (at least in the form of a
numeric vector) or NULL
(for using a vector of ones as weights).
projpred_var
accepts an \(N \times S_{\mathrm{prj}}\)
matrix of predictive variances (necessary for projpred's internal
GLM fitter) in case of the traditional or the latent projection and an
\(N_{\mathrm{augcat}} \times S_{\mathrm{prj}}\)
matrix (containing only NA
s) in case of the augmented-data projection.
projpred_ws_aug
accepts an \(N \times S_{\mathrm{prj}}\)
matrix of expected values for the response in case of the traditional or
the latent projection and an \(N_{\mathrm{augcat}} \times
S_{\mathrm{prj}}\) matrix of probabilities for the
response categories in case of the augmented-data projection.
...
accepts further arguments specified by the user (or by
projpred).
The return value of these functions needs to be:
ref_predfun
: for the traditional or the latent projection, an \(N
\times S_{\mathrm{ref}}\) matrix; for the augmented-data
projection, an \(S_{\mathrm{ref}} \times N \times C_{\mathrm{lat}}\) array (the only exception is the augmented-data projection for
the binomial()
family in which case ref_predfun
needs to return an \(N
\times S_{\mathrm{ref}}\) matrix just like for the traditional
projection because the array is constructed by an internal wrapper function).
proj_predfun
: for the traditional or the latent projection, an \(N
\times S_{\mathrm{prj}}\) matrix; for the augmented-data
projection, an \(N \times C_{\mathrm{lat}} \times S_{\mathrm{prj}}\) array.
div_minimizer
: a list
of length \(S_{\mathrm{prj}}\)
containing this number of submodel fits.
extract_model_data
The function supplied to argument extract_model_data
needs to have the
prototype
extract_model_data(object, newdata, wrhs = NULL, orhs = NULL,
extract_y = TRUE)
where:
object
accepts the reference model fit as given in argument object
(but
possibly refitted to a subset of the observations, as done in \(K\)-fold
CV).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
wrhs
accepts at least (i) a right-hand side formula consisting only of
the variable in newdata
containing the observation weights or (ii) NULL
for using the observation weights corresponding to newdata
(typically, the
observation weights are stored in a column of newdata
; if the model was
fitted without observation weights, a vector of ones should be used).
orhs
accepts at least (i) a right-hand side formula consisting only of
the variable in newdata
containing the offsets or (ii) NULL
for using the
offsets corresponding to newdata
(typically, the offsets are stored in a
column of newdata
; if the model was fitted without offsets, a vector of
zeros should be used).
extract_y
accepts a single logical value indicating whether output
element y
(see below) shall be NULL
(TRUE
) or not (FALSE
).
The return value of extract_model_data
needs to be a list
with elements
y
, weights
, and offset
, each being a numeric vector containing the data
for the response, the observation weights, and the offsets, respectively. An
exception is that y
may also be NULL
(depending on argument extract_y
),
a non-numeric vector, or a factor
.
The weights and offsets returned by extract_model_data
will be assumed to
hold for the reference model as well as for the submodels.
Above, arguments wrhs
and orhs
were assumed to have defaults of NULL
.
It should be possible to use defaults other than NULL
, but we strongly
recommend to use NULL
. If defaults other than NULL
are used, they need to
imply the behaviors described at items "(ii)" (see the descriptions of wrhs
and orhs
).
If a custom reference model for an augmented-data projection is needed, see
also extend_family()
.
For the augmented-data projection, the response vector resulting from
extract_model_data
is internally coerced to a factor
(using
as.factor()
). The levels of this factor
have to be identical to
family$cats
(after applying extend_family()
internally; see
extend_family()
's argument augdat_y_unqs
).
Note that response-specific offsets (i.e., one length-\(N\) offset vector
per response category) are not supported by projpred yet. So far, only
offsets which are the same across all response categories are supported. This
is why in case of the brms::categorical()
family, offsets are currently not
supported at all.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the augmented-data projection.
If a custom reference model for a latent projection is needed, see also
extend_family()
.
For the latent projection, family$cats
(after applying extend_family()
internally; see extend_family()
's argument latent_y_unqs
) currently must
not be NULL
if the original (i.e., non-latent) response is a factor
.
Conversely, if family$cats
(after applying extend_family()
) is
non-NULL
, the response vector resulting from extract_model_data
is
internally coerced to a factor
(using as.factor()
). The levels of this
factor
have to be identical to that non-NULL
element family$cats
.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the latent projection.
# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)
# The `stanreg` fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876
)
# Define the reference model object explicitly:
ref <- get_refmodel(fit)
print(class(ref)) # gives `"refmodel"`
#> [1] "refmodel"
# Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for
# possible post-processing functions. Most of the post-processing functions
# call get_refmodel() internally at the beginning, so you will rarely need
# to call get_refmodel() yourself.
# A custom reference model object which may be used in a variable selection
# where the candidate predictors are not a subset of those used for the
# reference model's predictions:
ref_cust <- init_refmodel(
fit,
data = dat_gauss,
formula = y ~ X6 + X7,
family = gaussian(),
cvfun = function(folds) {
kfold(
fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1
)$fits[, "fit"]
},
dis = as.matrix(fit)[, "sigma"],
cvrefbuilder = function(cvfit) {
init_refmodel(cvfit,
data = dat_gauss[-cvfit$omitted, , drop = FALSE],
formula = y ~ X6 + X7,
family = gaussian(),
dis = as.matrix(cvfit)[, "sigma"],
called_from_cvrefbuilder = TRUE)
}
)
# Now, the post-processing functions mentioned above (for example,
# varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.