Compute partial dependence for an ORSF model. Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model's prediction. You can compute partial dependence three ways using a random forest:
using in-bag predictions for the training data
using out-of-bag predictions for the training data
using predictions for a new set of data
See examples for more details
Usage
orsf_pd_oob(
object,
pred_spec,
pred_horizon = NULL,
pred_type = "risk",
expand_grid = TRUE,
prob_values = c(0.025, 0.5, 0.975),
prob_labels = c("lwr", "medn", "upr"),
boundary_checks = TRUE,
...
)
orsf_pd_inb(
object,
pred_spec,
pred_horizon = NULL,
pred_type = "risk",
expand_grid = TRUE,
prob_values = c(0.025, 0.5, 0.975),
prob_labels = c("lwr", "medn", "upr"),
boundary_checks = TRUE,
...
)
orsf_pd_new(
object,
pred_spec,
new_data,
pred_horizon = NULL,
pred_type = "risk",
na_action = "fail",
expand_grid = TRUE,
prob_values = c(0.025, 0.5, 0.975),
prob_labels = c("lwr", "medn", "upr"),
boundary_checks = TRUE,
...
)
Arguments
- object
(orsf_fit) a trained oblique random survival forest (see orsf).
- pred_spec
(named list or data.frame).
If
pred_spec
is a named list, Each item in the list should be a vector of values that will be used as points in the partial dependence function. The name of each item in the list should indicate which variable will be modified to take the corresponding values.If
pred_spec
is adata.frame
, columns will indicate variable names, values will indicate variable values, and partial dependence will be computed using the inputs on each row.
- pred_horizon
(double) a value or vector indicating the time(s) that predictions will be calibrated to. E.g., if you were predicting risk of incident heart failure within the next 10 years, then
pred_horizon = 10
.pred_horizon
can beNULL
ifpred_type
is'mort'
, since mortality predictions are aggregated over all event times- pred_type
(character) the type of predictions to compute. Valid options are
'risk' : probability of having an event at or before
pred_horizon
.'surv' : 1 - risk.
'chf': cumulative hazard function
'mort': mortality prediction
- expand_grid
(logical) if
TRUE
, partial dependence will be computed at all possible combinations of inputs inpred_spec
. IfFALSE
, partial dependence will be computed for each variable inpred_spec
, separately.- prob_values
(numeric) a vector of values between 0 and 1, indicating what quantiles will be used to summarize the partial dependence values at each set of inputs.
prob_values
should have the same length asprob_labels
. The quantiles are calculated based on predictions fromobject
at each set of values indicated bypred_spec
.- prob_labels
(character) a vector of labels with the same length as
prob_values
, with each label indicating what the corresponding value inprob_values
should be labelled as in summarized outputs.prob_labels
should have the same length asprob_values
.- boundary_checks
(logical) if
TRUE
,pred_spec
will be checked to make sure the requested values are between the 10th and 90th percentile in the object's training data. IfFALSE
, these checks are skipped.- ...
Further arguments passed to or from other methods (not currently used).
- new_data
a data.frame, tibble, or data.table to compute predictions in.
- na_action
(character) what should happen when
new_data
contains missing values (i.e.,NA
values). Valid options are:'fail' : an error is thrown if
new_data
containsNA
values'omit' : rows in
new_data
with incomplete data will be dropped
Value
a data.table containing partial dependence values for the specified variable(s) at the specified prediction horizon(s).
Details
Partial dependence has a number of known limitations and assumptions that users should be aware of (see Hooker, 2021). In particular, partial dependence is less intuitive when >2 predictors are examined jointly, and it is assumed that the feature(s) for which the partial dependence is computed are not correlated with other features (this is likely not true in many cases). Accumulated local effect plots can be used (see here) in the case where feature independence is not a valid assumption.
Examples
Begin by fitting an ORSF ensemble:
library(aorsf)
set.seed(329730)
index_train <- sample(nrow(pbc_orsf), 150)
pbc_orsf_train <- pbc_orsf[index_train, ]
pbc_orsf_test <- pbc_orsf[-index_train, ]
fit <- orsf(data = pbc_orsf_train,
formula = Surv(time, status) ~ . - id,
oobag_pred_horizon = 365.25 * 5)
Three ways to compute PD and ICE
You can compute partial dependence and ICE three ways with aorsf
:
using in-bag predictions for the training data
pd_train <- orsf_pd_inb(fit, pred_spec = list(bili = 1:5)) pd_train
## pred_horizon bili mean lwr medn upr ## 1: 1826.25 1 0.2054232 0.01599366 0.0929227 0.8077278 ## 2: 1826.25 2 0.2369077 0.02549869 0.1268457 0.8227315 ## 3: 1826.25 3 0.2808514 0.05027265 0.1720280 0.8457834 ## 4: 1826.25 4 0.3428065 0.09758988 0.2545869 0.8575243 ## 5: 1826.25 5 0.3992909 0.16392752 0.3232681 0.8634269
using out-of-bag predictions for the training data
pd_train <- orsf_pd_oob(fit, pred_spec = list(bili = 1:5)) pd_train
## pred_horizon bili mean lwr medn upr ## 1: 1826.25 1 0.2068300 0.01479443 0.08824123 0.8053317 ## 2: 1826.25 2 0.2377046 0.02469718 0.12623031 0.8258154 ## 3: 1826.25 3 0.2810546 0.04080813 0.18721220 0.8484846 ## 4: 1826.25 4 0.3417839 0.09076851 0.24968438 0.8611884 ## 5: 1826.25 5 0.3979925 0.16098228 0.32147532 0.8554402
using predictions for a new set of data
pd_test <- orsf_pd_new(fit, new_data = pbc_orsf_test, pred_spec = list(bili = 1:5)) pd_test
## pred_horizon bili mean lwr medn upr ## 1: 1826.25 1 0.2510900 0.01631318 0.1872414 0.8162621 ## 2: 1826.25 2 0.2807327 0.02903956 0.2269297 0.8332956 ## 3: 1826.25 3 0.3247386 0.05860235 0.2841853 0.8481825 ## 4: 1826.25 4 0.3850799 0.10741224 0.3405760 0.8588955 ## 5: 1826.25 5 0.4394952 0.17572657 0.4050864 0.8657886
in-bag partial dependence indicates relationships that the model has learned during training. This is helpful if your goal is to interpret the model.
out-of-bag partial dependence indicates relationships that the model has learned during training but using the out-of-bag data simulates application of the model to new data. if you want to test your model’s reliability or fairness in new data but you don’t have access to a large testing set.
new data partial dependence shows how the model predicts outcomes for observations it has not seen. This is helpful if you want to test your model’s reliability or fairness.