This article covers core features of the aorsf
package.
Background: ORSF
The oblique random survival forest (ORSF) is an extension of the axis-based RSF algorithm.
Accelerated ORSF
The purpose of aorsf
(‘a’ is short for accelerated) is to provide routines to fit ORSFs that will scale adequately to large data sets. The fastest algorithm available in the package is the accelerated ORSF model, which is the default method used by orsf()
:
library(aorsf)
set.seed(329)
orsf_fit <- orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id)
orsf_fit
#> ---------- Oblique random survival forest
#>
#> Linear combinations: Accelerated
#> N observations: 276
#> N events: 111
#> N trees: 500
#> N predictors total: 17
#> N predictors per node: 5
#> Average leaves per tree: 25
#> Min observations in leaf: 5
#> Min events in leaf: 1
#> OOB stat value: 0.84
#> OOB stat type: Harrell's C-statistic
#> Variable importance: anova
#>
#> -----------------------------------------
you may notice that the first input of aorsf
is data
. This is a design choice that makes it easier to use orsf
with pipes (i.e., %>%
or |>
). For instance,
Interpretation
aorsf
includes several functions dedicated to interpretation of ORSFs, both through estimation of partial dependence and variable importance.
Variable importance
aorsf
provides multiple ways to compute variable importance.
-
To compute negation importance, ORSF multiplies each coefficient of that variable by -1 and then re-computes the out-of-sample (sometimes referred to as out-of-bag) accuracy of the ORSF model.
orsf_vi_negate(orsf_fit) #> bili age copper albumin protime sex #> 0.084340488 0.029849969 0.023129819 0.010627214 0.010002084 0.008699729 #> ascites ast chol platelet edema spiders #> 0.006042926 0.004792665 0.004584288 0.003698687 0.002767194 0.002344238 #> hepato stage trig alk.phos trt #> 0.001771202 0.001510731 -0.001510731 -0.001823297 -0.002865180
-
You can also compute variable importance using permutation, a more classical approach.
orsf_vi_permute(orsf_fit) #> bili age albumin stage copper #> 0.0120337570 0.0107834966 0.0039591582 0.0036986872 0.0036465930 #> ascites protime ast spiders chol #> 0.0035944989 0.0032298395 0.0026568035 0.0015628256 0.0013544488 #> hepato platelet edema sex trig #> 0.0013023547 0.0011981663 0.0010009526 -0.0001562826 -0.0015107314 #> alk.phos trt #> -0.0017191081 -0.0023963326
-
A faster alternative to permutation and negation importance is ANOVA importance, which computes the proportion of times each variable obtains a low p-value (p < 0.01) while the forest is grown.
orsf_vi_anova(orsf_fit) #> ascites bili edema copper albumin age protime #> 0.37716956 0.27668288 0.24659461 0.20188872 0.18106061 0.17233782 0.14552704 #> spiders chol stage ast sex hepato alk.phos #> 0.13985149 0.13853379 0.13743076 0.13238238 0.11798708 0.11449452 0.09347937 #> trig platelet trt #> 0.08996452 0.06906218 0.06234686
Partial dependence (PD)
Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model’s prediction.
For more on PD, see the vignette
Individual conditional expectations (ICE)
Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor.
For more on ICE, see the vignette
What about the original ORSF?
The original ORSF (i.e., obliqueRSF
) used glmnet
to find linear combinations of inputs. aorsf
allows users to implement this approach using the orsf_control_net()
function:
orsf_net <- orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_net(),
n_tree = 50)
net
forests fit a lot faster than the original ORSF function in obliqueRSF
. However, net
forests are still much slower than cph
ones:
# tracking how long it takes to fit 50 glmnet trees
print(
t1 <- system.time(
orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_net(),
n_tree = 50)
)
)
#> user system elapsed
#> 3.824 0.044 3.868
# and how long it takes to fit 50 cph trees
print(
t2 <- system.time(
orsf(data = pbc_orsf,
formula = Surv(time, status) ~ . - id,
control = orsf_control_cph(),
n_tree = 50)
)
)
#> user system elapsed
#> 0.051 0.000 0.052
t1['elapsed'] / t2['elapsed']
#> elapsed
#> 74.38462
aorsf and other machine learning software
The unique feature of aorsf
is its fast algorithms to fit ORSF ensembles. RLT
and obliqueRSF
both fit oblique random survival forests, but aorsf
does so faster. ranger
and randomForestSRC
fit survival forests, but neither package supports oblique splitting. obliqueRF
fits oblique random forests for classification and regression, but not survival. PPforest
fits oblique random forests for classification but not survival.
Note: The default prediction behavior for aorsf
models is to produce predicted risk at a specific prediction horizon, which is not the default for ranger
or randomForestSRC
. I think this will change in the future, as computing time independent predictions with aorsf
could be helpful.