This is the central, most important function
of the drake package. It runs all the steps of your
workflow in the correct order, skipping any work
that is already up to date. Because of how make()
tracks global functions and objects as dependencies of targets,
please restart your R session so the pipeline runs
in a clean reproducible environment.
Usage
make(
plan,
targets = NULL,
envir = parent.frame(),
verbose = 1L,
hook = NULL,
cache = drake::drake_cache(),
fetch_cache = NULL,
parallelism = "loop",
jobs = 1L,
jobs_preprocess = 1L,
packages = rev(.packages()),
lib_loc = NULL,
prework = character(0),
prepend = NULL,
command = NULL,
args = NULL,
recipe_command = NULL,
log_progress = TRUE,
skip_targets = FALSE,
timeout = NULL,
cpu = Inf,
elapsed = Inf,
retries = 0,
force = FALSE,
graph = NULL,
trigger = drake::trigger(),
skip_imports = FALSE,
skip_safety_checks = FALSE,
config = NULL,
lazy_load = "eager",
session_info = NULL,
cache_log_file = NULL,
seed = NULL,
caching = "main",
keep_going = FALSE,
session = NULL,
pruning_strategy = NULL,
makefile_path = NULL,
console_log_file = NULL,
ensure_workers = NULL,
garbage_collection = FALSE,
template = list(),
sleep = function(i) 0.01,
hasty_build = NULL,
memory_strategy = "speed",
layout = NULL,
spec = NULL,
lock_envir = NULL,
history = TRUE,
recover = FALSE,
recoverable = TRUE,
curl_handles = list(),
max_expand = NULL,
log_build_times = TRUE,
format = NULL,
lock_cache = TRUE,
log_make = NULL,
log_worker = FALSE
)
Arguments
- plan
Workflow plan data frame. A workflow plan data frame is a data frame with a
target
column and acommand
column. (See the details in thedrake_plan()
help file for descriptions of the optional columns.) Targets are the objects that drake generates, and commands are the pieces of R code that produce them. You can create and track custom files along the way (seefile_in()
,file_out()
, andknitr_in()
). Use the functiondrake_plan()
to generate workflow plan data frames.- targets
Character vector, names of targets to build. Dependencies are built too. You may supply static and/or whole dynamic targets, but no sub-targets.
- envir
Environment to use. Defaults to the current workspace, so you should not need to worry about this most of the time. A deep copy of
envir
is made, so you don't need to worry about your workspace being modified bymake
. The deep copy inherits from the global environment. Wherever necessary, objects and functions are imported fromenvir
and the global environment and then reproducibly tracked as dependencies.- verbose
Integer, control printing to the console/terminal.
0
: print nothing.1
: print target-by-target messages asmake()
progresses.2
: show a progress bar to track how many targets are done so far.
- hook
Deprecated.
- cache
drake cache as created by
new_cache()
. See alsodrake_cache()
.- fetch_cache
Deprecated.
- parallelism
Character scalar, type of parallelism to use. For detailed explanations, see
https://books.ropensci.org/drake/hpc.html
.You could also supply your own scheduler function if you want to experiment or aggressively optimize. The function should take a single
config
argument (produced bydrake_config()
). Existing examples fromdrake
's internals are thebackend_*()
functions:backend_loop()
backend_clustermq()
backend_future()
However, this functionality is really a back door and should not be used for production purposes unless you really know what you are doing and you are willing to suffer setbacks wheneverdrake
's unexported core functions are updated.
- jobs
Maximum number of parallel workers for processing the targets. You can experiment with
predict_runtime()
to help decide on an appropriate number of jobs. For details, visithttps://books.ropensci.org/drake/time.html
.- jobs_preprocess
Number of parallel jobs for processing the imports and doing other preprocessing tasks.
- packages
Character vector packages to load, in the order they should be loaded. Defaults to
rev(.packages())
, so you should not usually need to set this manually. Just calllibrary()
to load your packages beforemake()
. However, sometimes packages need to be strictly forced to load in a certain order, especially ifparallelism
is"Makefile"
. To do this, do not uselibrary()
orrequire()
orloadNamespace()
orattachNamespace()
to load any libraries beforehand. Just list your packages in thepackages
argument in the order you want them to be loaded.- lib_loc
Character vector, optional. Same as in
library()
orrequire()
. Applies to thepackages
argument (see above).- prework
Expression (language object), list of expressions, or character vector. Code to run right before targets build. Called only once if
parallelism
is"loop"
and once per target otherwise. This code can be used to set global options, etc.- prepend
Deprecated.
- command
Deprecated.
- args
Deprecated.
- recipe_command
Deprecated.
- log_progress
Logical, whether to log the progress of individual targets as they are being built. Progress logging creates extra files in the cache (usually the
.drake/
folder) and slows downmake()
a little. If you need to reduce or limit the number of files in the cache, callmake(log_progress = FALSE, recover = FALSE)
.- skip_targets
Logical, whether to skip building the targets in
plan
and just import objects and files.- timeout
deprecated
. Useelapsed
andcpu
instead.- cpu
Same as the
cpu
argument ofsetTimeLimit()
. Seconds of cpu time before a target times out. Assign target-level cpu timeout times with an optionalcpu
column inplan
.- elapsed
Same as the
elapsed
argument ofsetTimeLimit()
. Seconds of elapsed time before a target times out. Assign target-level elapsed timeout times with an optionalelapsed
column inplan
.- retries
Number of retries to execute if the target fails. Assign target-level retries with an optional
retries
column inplan
.- force
Logical. If
FALSE
(default) thendrake
imposes checks if the cache was created with an old and incompatible version of drake. If there is an incompatibility,make()
stops to give you an opportunity to downgradedrake
to a compatible version rather than rerun all your targets from scratch.- graph
Deprecated.
- trigger
Name of the trigger to apply to all targets. Ignored if
plan
has atrigger
column. Seetrigger()
for details.- skip_imports
Logical, whether to totally neglect to process the imports and jump straight to the targets. This can be useful if your imports are massive and you just want to test your project, but it is bad practice for reproducible data analysis. This argument is overridden if you supply your own
graph
argument.- skip_safety_checks
Logical, whether to skip the safety checks on your workflow. Use at your own peril.
- config
Deprecated.
- lazy_load
An old feature, currently being questioned. For the current recommendations on memory management, see
https://books.ropensci.org/drake/memory.html#memory-strategies
. Thelazy_load
argument is either a character vector or a logical. For dynamic targets, the behavior is always"eager"
(see below). So thelazy_load
argument is for static targets only. Choices forlazy_load
:"eager"
: no lazy loading. The target is loaded right away withassign()
."promise"
: lazy loading withdelayedAssign()
"bind"
: lazy loading with active bindings:bindr::populate_env()
.TRUE
: same as"promise"
.FALSE
: same as"eager"
.
If
lazy_load
is"eager"
, drake prunes the execution environment before each target/stage, removing all superfluous targets and then loading any dependencies it will need for building. In other words, drake prepares the environment in advance and tries to be memory efficient. Iflazy_load
is"bind"
or"promise"
, drake assigns promises to load any dependencies at the last minute. Lazy loading may be more memory efficient in some use cases, but it may duplicate the loading of dependencies, costing time.- session_info
Logical, whether to save the
sessionInfo()
to the cache. Defaults toTRUE
. This behavior is recommended for seriousmake()
s for the sake of reproducibility. This argument only exists to speed up tests. Apparently,sessionInfo()
is a bottleneck for smallmake()
s.- cache_log_file
Name of the CSV cache log file to write. If
TRUE
, the default file name is used (drake_cache.CSV
). IfNULL
, no file is written. If activated, this option writes a flat text file to represent the state of the cache (fingerprints of all the targets and imports). If you put the log file under version control, your commit history will give you an easy representation of how your results change over time as the rest of your project changes. Hopefully, this is a step in the right direction for data reproducibility.- seed
Integer, the root pseudo-random number generator seed to use for your project. In
make()
,drake
generates a unique local seed for each target using the global seed and the target name. That way, different pseudo-random numbers are generated for different targets, and this pseudo-randomness is reproducible.To ensure reproducibility across different R sessions,
set.seed()
and.Random.seed
are ignored and have no affect ondrake
workflows. Conversely,make()
does not usually change.Random.seed
, even when pseudo-random numbers are generated. The exception to this last point ismake(parallelism = "clustermq")
because theclustermq
package needs to generate random numbers to set up ports and sockets for ZeroMQ.On the first call to
make()
ordrake_config()
,drake
uses the random number generator seed from theseed
argument. Here, if theseed
isNULL
(default),drake
uses aseed
of0
. On subsequentmake()
s for existing projects, the project's cached seed will be used in order to ensure reproducibility. Thus, theseed
argument must either beNULL
or the same seed from the project's cache (usually the.drake/
folder). To reset the random number generator seed for a project, useclean(destroy = TRUE)
.- caching
Character string, either
"main"
or"worker"
."main"
: Targets are built by remote workers and sent back to the main process. Then, the main process saves them to the cache (config$cache
, usually a file systemstorr
). Appropriate if remote workers do not have access to the file system of the calling R session. Targets are cached one at a time, which may be slow in some situations."worker"
: Remote workers not only build the targets, but also save them to the cache. Here, caching happens in parallel. However, remote workers need to have access to the file system of the calling R session. Transferring target data across a network can be slow.
- keep_going
Logical, whether to still keep running
make()
if targets fail.- session
Deprecated. Has no effect now.
- pruning_strategy
Deprecated. See
memory_strategy
.- makefile_path
Deprecated.
- console_log_file
Deprecated in favor of
log_make
.- ensure_workers
Deprecated.
- garbage_collection
Logical, whether to call
gc()
each time a target is built duringmake()
.- template
A named list of values to fill in the
{{ ... }}
placeholders in template files (e.g. fromdrake_hpc_template_file()
). Same as thetemplate
argument ofclustermq::Q()
andclustermq::workers
. Enabled forclustermq
only (make(parallelism = "clustermq")
), notfuture
orbatchtools
so far. For more information, see theclustermq
package:https://github.com/mschubert/clustermq
. Some template placeholders such as{{ job_name }}
and{{ n_jobs }}
cannot be set this way.- sleep
Optional function on a single numeric argument
i
. Default:function(i) 0.01
.To conserve memory,
drake
assigns a brand new closure tosleep
, so your custom function should not depend on in-memory data except from loaded packages.For parallel processing,
drake
uses a central main process to check what the parallel workers are doing, and for the affected high-performance computing workflows, wait for data to arrive over a network. In between loop iterations, the main process sleeps to avoid throttling. Thesleep
argument tomake()
anddrake_config()
allows you to customize how much time the main process spends sleeping.The
sleep
argument is a function that takes an argumenti
and returns a numeric scalar, the number of seconds to supply toSys.sleep()
after iterationi
of checking. (Here,i
starts at 1.) If the checking loop does something other than sleeping on iterationi
, theni
is reset back to 1.To sleep for the same amount of time between checks, you might supply something like
function(i) 0.01
. But to avoid consuming too many resources during heavier and longer workflows, you might use an exponential back-off: say,function(i) { 0.1 + 120 * pexp(i - 1, rate = 0.01) }
.- hasty_build
Deprecated
- memory_strategy
Character scalar, name of the strategy
drake
uses to load/unload a target's dependencies in memory. You can give each target its own memory strategy, (e.g.drake_plan(x = 1, y = target(f(x), memory_strategy = "lookahead"))
) to override the global memory strategy. Choices:"speed"
: Once a target is newly built or loaded in memory, just keep it there. This choice maximizes speed and hogs memory."autoclean"
: Just before building each new target, unload everything from memory except the target's direct dependencies. After a target is built, discard it from memory. (Setgarbage_collection = TRUE
to make sure it is really gone.) This option conserves memory, but it sacrifices speed because each new target needs to reload any previously unloaded targets from storage."preclean"
: Just before building each new target, unload everything from memory except the target's direct dependencies. After a target is built, keep it in memory untildrake
determines they can be unloaded. This option conserves memory, but it sacrifices speed because each new target needs to reload any previously unloaded targets from storage."lookahead"
: Just before building each new target, search the dependency graph to find targets that will not be needed for the rest of the currentmake()
session. After a target is built, keep it in memory until the next memory management stage. In this mode, targets are only in memory if they need to be loaded, and we avoid superfluous reads from the cache. However, searching the graph takes time, and it could even double the computational overhead for large projects."unload"
: Just before building each new target, unload all targets from memory. After a target is built, do not keep it in memory. This mode aggressively optimizes for both memory and speed, but in commands and triggers, you have to manually load any dependencies you need usingreadd()
."none"
: Do not manage memory at all. Do not load or unload anything before building targets. After a target is built, do not keep it in memory. This mode aggressively optimizes for both memory and speed, but in commands and triggers, you have to manually load any dependencies you need usingreadd()
.
For even more direct control over which targets
drake
keeps in memory, see the help file examples ofdrake_envir()
. Also see thegarbage_collection
argument ofmake()
anddrake_config()
.- layout
Deprecated.
- spec
Deprecated.
- lock_envir
Deprecated in
drake >= 7.13.10
. Environments are no longer locked.- history
Logical, whether to record the build history of your targets. You can also supply a
txtq
, which is howdrake
records history. Must beTRUE
fordrake_history()
to work later.- recover
Logical, whether to activate automated data recovery. The default is
FALSE
becauseAutomated data recovery is still stable.
It has reproducibility issues. Targets recovered from the distant past may have been generated with earlier versions of R and earlier package environments that no longer exist.
It is not always possible, especially when dynamic files are combined with dynamic branching (e.g.
dynamic = map(stuff)
andformat = "file"
etc.) since behavior is harder to predict in advance.
How it works: if
recover
isTRUE
,drake
tries to salvage old target values from the cache instead of running commands from the plan. A target is recoverable ifThere is an old value somewhere in the cache that shares the command, dependencies, etc. of the target about to be built.
The old value was generated with
make(recoverable = TRUE)
.
If both conditions are met,
drake
willAssign the most recently-generated admissible data to the target, and
skip the target's command.
Functions
recoverable()
andr_recoverable()
show the most upstream outdated targets that will be recovered in this way in the nextmake()
orr_make()
.- recoverable
Logical, whether to make target values recoverable with
make(recover = TRUE)
. This requires writing extra files to the cache, and it prevents old metadata from being removed with garbage collection (clean(garbage_collection = TRUE)
,gc()
instorr
s). If you need to limit the cache size or the number of files in the cache, considermake(recoverable = FALSE, progress = FALSE)
. Recovery is not always possible, especially when dynamic files are combined with dynamic branching (e.g.dynamic = map(stuff)
andformat = "file"
etc.) since behavior is harder to predict in advance.- curl_handles
A named list of curl handles. Each value is an object from
curl::new_handle()
, and each name is a URL (and should start with "http", "https", or "ftp"). Example: list(http://httpbin.org/basic-auth
= curl::new_handle( username = "user", password = "passwd" ) ) Then, if your plan hasfile_in("http://httpbin.org/basic-auth/user/passwd")
drake
will authenticate using the username and password of the handle forhttp://httpbin.org/basic-auth/
.drake
uses partial matching on text to find the right handle of thefile_in()
URL, so the name of the handle could be the complete URL ("http://httpbin.org/basic-auth/user/passwd"
) or a part of the URL (e.g."http://httpbin.org/"
or"http://httpbin.org/basic-auth/"
). If you have multiple handles whose names match your URL,drake
will choose the closest match.- max_expand
Positive integer, optional.
max_expand
is the maximum number of targets to generate in eachmap()
,cross()
, orgroup()
dynamic transform. Useful if you have a massive number of dynamic sub-targets and you want to work with only the first few sub-targets before scaling up. Note: themax_expand
argument ofmake()
anddrake_config()
is for dynamic branching only. The static branchingmax_expand
is an argument ofdrake_plan()
andtransform_plan()
.- log_build_times
Logical, whether to record build_times for targets. Mac users may notice a 20% speedup in
make()
withbuild_times = FALSE
.- format
Character, an optional custom storage format for targets without an explicit
target(format = ...)
in the plan. Details about formats:https://books.ropensci.org/drake/plans.html#special-data-formats-for-targets
# nolint- lock_cache
Logical, whether to lock the cache before running
make()
etc. It is usually recommended to keep cache locking on. However, if you interruptmake()
before it can clean itself up, then the cache will stay locked, and you will need to manually unlock it withdrake::drake_cache("xyz")$unlock()
. Repeatedly unlocking the cache by hand is annoying, andlock_cache = FALSE
prevents the cache from locking in the first place.- log_make
Optional character scalar of a file name or connection object (such as
stdout()
) to dump maximally verbose log information formake()
and other functions (all functions that accept aconfig
argument, plusdrake_config()
). If you choose to use a text file as the console log, it will persist over multiple function calls until you delete it manually. Fields in each row the log file, from left to right: - The node name (short host name) of the computer (fromSys.info()["nodename"]
). - The process ID (fromSys.getpid()
). - A timestamp with the date and time (in microseconds). - A brief description of whatdrake
was doing.The fields are separated by pipe symbols (
"|"`).- log_worker
Logical, same as the
log_worker
argument ofclustermq::workers()
andclustermq::Q()
. Only relevant ifparallelism
is"clustermq"
.
Interactive mode
In interactive sessions, consider r_make()
, r_outdated()
, etc.
rather than make()
, outdated()
, etc. The r_*()
drake
functions
are more reproducible when the session is interactive.
If you do run make()
interactively, please restart your R session
beforehand so your functions and global objects get loaded into
a clean reproducible environment. This prevents targets
from getting invalidated unexpectedly.
A serious drake workflow should be consistent and reliable,
ideally with the help of a main R script.
This script should begin in a fresh R session,
load your packages and functions in a dependable manner,
and then run make()
. Example:
https://github.com/wlandau/drake-examples/tree/main/gsp
.
Batch mode, especially within a container, is particularly helpful.
Interactive R sessions are still useful, but they easily grow stale. Targets can falsely invalidate if you accidentally change a function or data object in your environment.
Self-invalidation
It is possible to construct a workflow that tries to invalidate itself. Example:
plan <- drake_plan(
x = {
data(mtcars)
mtcars$mpg
},
y = mean(x)
)
Here, because data()
loads mtcars
into the global environment,
the very act of building x
changes the dependencies of x
.
In other words, without safeguards, x
would not be up to date at
the end of make(plan)
.
Please try to avoid workflows that modify the global environment.
Functions such as data()
belong in your setup scripts
prior to make()
, not in any functions or commands that get called
during make()
itself.
For each target that is still problematic (e.g.
https://github.com/rstudio/gt/issues/297
)
you can safely run the command in its own special callr::r()
process.
Example: https://github.com/rstudio/gt/issues/297#issuecomment-497778735
. # nolint
Cache locking
When make()
runs, it locks the cache so other processes cannot modify it.
Same goes for outdated()
, vis_drake_graph()
, and similar functions
when make_imports = TRUE
. This is a safety measure to prevent simultaneous
processes from corrupting the cache. If you get an error saying that the
cache is locked, either set make_imports = FALSE
or manually force
unlock it with drake_cache()$unlock()
.
Examples
if (FALSE) { # \dontrun{
isolate_example("Quarantine side effects.", {
if (suppressWarnings(require("knitr"))) {
load_mtcars_example() # Get the code with drake_example("mtcars").
config <- drake_config(my_plan)
outdated(my_plan) # Which targets need to be (re)built?
make(my_plan) # Build what needs to be built.
outdated(my_plan) # Everything is up to date.
# Change one of your imported function dependencies.
reg2 = function(d) {
d$x3 = d$x^3
lm(y ~ x3, data = d)
}
outdated(my_plan) # Some targets depend on reg2().
make(my_plan) # Rebuild just the outdated targets.
outdated(my_plan) # Everything is up to date again.
if (requireNamespace("visNetwork", quietly = TRUE)) {
vis_drake_graph(my_plan) # See how they fit in an interactive graph.
make(my_plan, cache_log_file = TRUE) # Write a CSV log file this time.
vis_drake_graph(my_plan) # The colors changed in the graph.
# Run targets in parallel:
# options(clustermq.scheduler = "multicore") # nolint
# make(my_plan, parallelism = "clustermq", jobs = 2) # nolint
}
clean() # Start from scratch next time around.
}
# Dynamic branching
# Get the mean mpg for each cyl in the mtcars dataset.
plan <- drake_plan(
raw = mtcars,
group_index = raw$cyl,
munged = target(raw[, c("mpg", "cyl")], dynamic = map(raw)),
mean_mpg_by_cyl = target(
data.frame(mpg = mean(munged$mpg), cyl = munged$cyl[1]),
dynamic = group(munged, .by = group_index)
)
)
make(plan)
readd(mean_mpg_by_cyl)
})
} # }