Compute individual conditional expectations for an oblique random forest. Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor. You can compute individual conditional expectations three ways using a random forest:
using in-bag predictions for the training data
using out-of-bag predictions for the training data
using predictions for a new set of data
See examples for more details
Usage
orsf_ice_oob(
object,
pred_spec,
pred_horizon = NULL,
pred_type = NULL,
expand_grid = TRUE,
boundary_checks = TRUE,
n_thread = NULL,
verbose_progress = NULL,
...
)
orsf_ice_inb(
object,
pred_spec,
pred_horizon = NULL,
pred_type = NULL,
expand_grid = TRUE,
boundary_checks = TRUE,
n_thread = NULL,
verbose_progress = NULL,
...
)
orsf_ice_new(
object,
pred_spec,
new_data,
pred_horizon = NULL,
pred_type = NULL,
na_action = "fail",
expand_grid = TRUE,
boundary_checks = TRUE,
n_thread = NULL,
verbose_progress = NULL,
...
)
Arguments
- object
(ObliqueForest) a trained oblique random forest object (see orsf).
- pred_spec
(named list, pspec_auto, or data.frame).
If
pred_spec
is a named list, Each item in the list should be a vector of values that will be used as points in the partial dependence function. The name of each item in the list should indicate which variable will be modified to take the corresponding values.If
pred_spec
is created usingpred_spec_auto()
, all that is needed is the names of variables to use (see pred_spec_auto).If
pred_spec
is adata.frame
, columns will indicate variable names, values will indicate variable values, and partial dependence will be computed using the inputs on each row.
- pred_horizon
(double) Only relevent for survival forests. A value or vector indicating the time(s) that predictions will be calibrated to. E.g., if you were predicting risk of incident heart failure within the next 10 years, then
pred_horizon = 10
.pred_horizon
can beNULL
ifpred_type
is'mort'
, since mortality predictions are aggregated over all event times- pred_type
(character) the type of predictions to compute. Valid Valid options for survival are:
'risk' : probability of having an event at or before
pred_horizon
.'surv' : 1 - risk.
'chf': cumulative hazard function
'mort': mortality prediction
'time': survival time prediction
For classification:
'prob': probability for each class
For regression:
'mean': predicted mean, i.e., the expected value
- expand_grid
(logical) if
TRUE
, partial dependence will be computed at all possible combinations of inputs inpred_spec
. IfFALSE
, partial dependence will be computed for each variable inpred_spec
, separately.- boundary_checks
(logical) if
TRUE
,pred_spec
will be checked to make sure the requested values are between the 10th and 90th percentile in the object's training data. IfFALSE
, these checks are skipped.- n_thread
(integer) number of threads to use while computing predictions. Default is 0, which allows a suitable number of threads to be used based on availability.
- verbose_progress
(logical) if
TRUE
, progress will be printed to console. IfFALSE
(the default), nothing will be printed.- ...
Further arguments passed to or from other methods (not currently used).
- new_data
a data.frame, tibble, or data.table to compute predictions in.
- na_action
(character) what should happen when
new_data
contains missing values (i.e.,NA
values). Valid options are:'fail' : an error is thrown if
new_data
containsNA
values'omit' : rows in
new_data
with incomplete data will be dropped
Value
a data.table containing individual conditional expectations for the specified variable(s) and, if relevant, at the specified prediction horizon(s).
Examples
You can compute individual conditional expectation and individual conditional expectations in three ways:
using in-bag predictions for the training data. In-bag individual conditional expectation indicates relationships that the model has learned during training. This is helpful if your goal is to interpret the model.
using out-of-bag predictions for the training data. Out-of-bag individual conditional expectation indicates relationships that the model has learned during training but using the out-of-bag data simulates application of the model to new data. This is helpful if you want to test your model’s reliability or fairness in new data but you don’t have access to a large testing set.
using predictions for a new set of data. New data individual conditional expectation shows how the model predicts outcomes for observations it has not seen. This is helpful if you want to test your model’s reliability or fairness.
Classification
Begin by fitting an oblique classification random forest:
set.seed(329)
index_train <- sample(nrow(penguins_orsf), 150)
penguins_orsf_train <- penguins_orsf[index_train, ]
penguins_orsf_test <- penguins_orsf[-index_train, ]
fit_clsf <- orsf(data = penguins_orsf_train,
formula = species ~ .)
Compute individual conditional expectation using out-of-bag data for
flipper_length_mm = c(190, 210)
.
pred_spec <- list(flipper_length_mm = c(190, 210))
ice_oob <- orsf_ice_oob(fit_clsf, pred_spec = pred_spec)
ice_oob
## Key: <class>
## id_variable id_row class flipper_length_mm pred
## <int> <char> <fctr> <num> <num>
## 1: 1 1 Adelie 190 0.92169247
## 2: 1 2 Adelie 190 0.80944657
## 3: 1 3 Adelie 190 0.85172955
## 4: 1 4 Adelie 190 0.93559327
## 5: 1 5 Adelie 190 0.97708693
## ---
## 896: 2 146 Gentoo 210 0.26092984
## 897: 2 147 Gentoo 210 0.04798334
## 898: 2 148 Gentoo 210 0.07927359
## 899: 2 149 Gentoo 210 0.84779971
## 900: 2 150 Gentoo 210 0.11105143
There are two identifiers in the output:
id_variable
is an identifier for the current value of the variable(s) that are in the data. It is redundant if you only have one variable, but helpful if there are multiple variables.id_row
is an identifier for the observation in the original data.
Note that predicted probabilities are returned for each class and each observation in the data. Predicted probabilities for a given observation and given variable value sum to 1. For example,
Regression
Begin by fitting an oblique regression random forest:
set.seed(329)
index_train <- sample(nrow(penguins_orsf), 150)
penguins_orsf_train <- penguins_orsf[index_train, ]
penguins_orsf_test <- penguins_orsf[-index_train, ]
fit_regr <- orsf(data = penguins_orsf_train,
formula = bill_length_mm ~ .)
Compute individual conditional expectation using new data for
flipper_length_mm = c(190, 210)
.
pred_spec <- list(flipper_length_mm = c(190, 210))
ice_new <- orsf_ice_new(fit_regr,
pred_spec = pred_spec,
new_data = penguins_orsf_test)
ice_new
## id_variable id_row flipper_length_mm pred
## <int> <char> <num> <num>
## 1: 1 1 190 37.94483
## 2: 1 2 190 37.61595
## 3: 1 3 190 37.53681
## 4: 1 4 190 39.49476
## 5: 1 5 190 38.95635
## ---
## 362: 2 179 210 51.80471
## 363: 2 180 210 47.27183
## 364: 2 181 210 47.05031
## 365: 2 182 210 50.39028
## 366: 2 183 210 48.44774
You can also let pred_spec_auto
pick reasonable values like so:
pred_spec = pred_spec_auto(species, island, body_mass_g)
ice_new <- orsf_ice_new(fit_regr,
pred_spec = pred_spec,
new_data = penguins_orsf_test)
ice_new
## id_variable id_row species island body_mass_g pred
## <int> <char> <fctr> <fctr> <num> <num>
## 1: 1 1 Adelie Biscoe 3200 37.78339
## 2: 1 2 Adelie Biscoe 3200 37.73273
## 3: 1 3 Adelie Biscoe 3200 37.71248
## 4: 1 4 Adelie Biscoe 3200 40.25782
## 5: 1 5 Adelie Biscoe 3200 40.04074
## ---
## 8231: 45 179 Gentoo Torgersen 5300 46.14559
## 8232: 45 180 Gentoo Torgersen 5300 43.98050
## 8233: 45 181 Gentoo Torgersen 5300 44.59837
## 8234: 45 182 Gentoo Torgersen 5300 44.85146
## 8235: 45 183 Gentoo Torgersen 5300 44.23710
By default, all combinations of all variables are used. However, you can also look at the variables one by one, separately, like so:
ice_new <- orsf_ice_new(fit_regr,
expand_grid = FALSE,
pred_spec = pred_spec,
new_data = penguins_orsf_test)
ice_new
## id_variable id_row variable value level pred
## <int> <char> <char> <num> <char> <num>
## 1: 1 1 species NA Adelie 37.74136
## 2: 1 2 species NA Adelie 37.42367
## 3: 1 3 species NA Adelie 37.04598
## 4: 1 4 species NA Adelie 39.89602
## 5: 1 5 species NA Adelie 39.14848
## ---
## 2009: 5 179 body_mass_g 5300 <NA> 51.50196
## 2010: 5 180 body_mass_g 5300 <NA> 47.27055
## 2011: 5 181 body_mass_g 5300 <NA> 48.34064
## 2012: 5 182 body_mass_g 5300 <NA> 48.75828
## 2013: 5 183 body_mass_g 5300 <NA> 48.11020
And you can also bypass all the bells and whistles by using your own
data.frame
for a pred_spec
. (Just make sure you request values that
exist in the training data.)
custom_pred_spec <- data.frame(species = 'Adelie',
island = 'Biscoe')
ice_new <- orsf_ice_new(fit_regr,
pred_spec = custom_pred_spec,
new_data = penguins_orsf_test)
ice_new
## id_variable id_row species island pred
## <int> <char> <fctr> <fctr> <num>
## 1: 1 1 Adelie Biscoe 38.52327
## 2: 1 2 Adelie Biscoe 38.32073
## 3: 1 3 Adelie Biscoe 37.71248
## 4: 1 4 Adelie Biscoe 41.68380
## 5: 1 5 Adelie Biscoe 40.91140
## ---
## 179: 1 179 Adelie Biscoe 43.09493
## 180: 1 180 Adelie Biscoe 38.79455
## 181: 1 181 Adelie Biscoe 39.37734
## 182: 1 182 Adelie Biscoe 40.71952
## 183: 1 183 Adelie Biscoe 39.34501
Survival
Begin by fitting an oblique survival random forest:
set.seed(329)
index_train <- sample(nrow(pbc_orsf), 150)
pbc_orsf_train <- pbc_orsf[index_train, ]
pbc_orsf_test <- pbc_orsf[-index_train, ]
fit_surv <- orsf(data = pbc_orsf_train,
formula = Surv(time, status) ~ . - id,
oobag_pred_horizon = 365.25 * 5)
Compute individual conditional expectation using in-bag data for
bili = c(1,2,3,4,5)
:
ice_train <- orsf_ice_inb(fit_surv, pred_spec = list(bili = 1:5))
ice_train
## id_variable id_row pred_horizon bili pred
## <int> <char> <num> <num> <num>
## 1: 1 1 1826.25 1 0.1290317
## 2: 1 2 1826.25 1 0.1242352
## 3: 1 3 1826.25 1 0.0963452
## 4: 1 4 1826.25 1 0.1172367
## 5: 1 5 1826.25 1 0.2030256
## ---
## 746: 5 146 1826.25 5 0.7868537
## 747: 5 147 1826.25 5 0.2012954
## 748: 5 148 1826.25 5 0.4893605
## 749: 5 149 1826.25 5 0.4698220
## 750: 5 150 1826.25 5 0.9557285
If you don’t have specific values of a variable in mind, let
pred_spec_auto
pick for you:
ice_train <- orsf_ice_inb(fit_surv, pred_spec_auto(bili))
ice_train
## id_variable id_row pred_horizon bili pred
## <int> <char> <num> <num> <num>
## 1: 1 1 1826.25 0.55 0.11728559
## 2: 1 2 1826.25 0.55 0.11728839
## 3: 1 3 1826.25 0.55 0.08950739
## 4: 1 4 1826.25 0.55 0.10064959
## 5: 1 5 1826.25 0.55 0.18736417
## ---
## 746: 5 146 1826.25 7.25 0.82600898
## 747: 5 147 1826.25 7.25 0.29156437
## 748: 5 148 1826.25 7.25 0.58395919
## 749: 5 149 1826.25 7.25 0.54202021
## 750: 5 150 1826.25 7.25 0.96391985
Specify pred_horizon
to get individual conditional expectation at each
value:
ice_train <- orsf_ice_inb(fit_surv, pred_spec_auto(bili),
pred_horizon = seq(500, 3000, by = 500))
ice_train
## id_variable id_row pred_horizon bili pred
## <int> <char> <num> <num> <num>
## 1: 1 1 500 0.55 0.008276627
## 2: 1 1 1000 0.55 0.055724516
## 3: 1 1 1500 0.55 0.085091120
## 4: 1 1 2000 0.55 0.123423352
## 5: 1 1 2500 0.55 0.166380739
## ---
## 4496: 5 150 1000 7.25 0.837774757
## 4497: 5 150 1500 7.25 0.934536379
## 4498: 5 150 2000 7.25 0.967823286
## 4499: 5 150 2500 7.25 0.972059574
## 4500: 5 150 3000 7.25 0.980785643
Multi-prediction horizon ice comes with minimal extra computational cost. Use a fine grid of time values and assess whether predictors have time-varying effects.