Assumptions
Users can access multi-threaded central processing units and sufficient memory (i.e., at least 4GB/thread)
Basic workflow
Minimize errors
Consider using try()
or tryCatch()
to let
an error will not halt all the work without results, especially when
using higher-level functions.
Raster workflow: stacked vs file-based parallelization
\[!NOTE\] It’s up to the users’ system specification and the size of the data.
For faster computation, there are two strategies. One is
parallelization which is implemented in chopin
for example.
The other strategy is to stack rasters and extract values at once.
Adjustment of arguments in exactextractr::exact_extract
will benefit many people who are dealing with sizable data that are able
to be stacked. We compare the two strategies in terms of computation
time.
Before proceeding, users need to consider the hardware specification.
For example, memory and storage to leverage the maximal performance of
exact_extract
. Specifically speaking, memory capacity is
crucial to store the stacked rasters in memory rather than to read the
proxy of rasters from the disk as implemented in terra
.
max_cells_in_memory
is a key argument to control the memory
usage. The maximum possible value for this argument is \(2^{31} - 1 = 2,147,483,647\), roughly
2.147e9
, as applied in the example above. As memory
bandwidth is much faster than disk input/output specification, the
stacked rasters with high max_cells_in_memory
applied will
run faster than file-based parallelization or the extraction with lower
value of max_cells_in_memory
. The performance does not come
without a cost. The memory-intensive setting is not suitable for the
system with limited memory, for example, in consumer laptops with around
16 GB of RAM.
\[!NOTE\] Stacking rasters usually takes the large amount of memory. Users need to consider the memory capacity of the system before stacking rasters.
Raster-Vector overlay
Use chopin::par_pad_grid()
to make exhaustive (fully
filling the entire extent without any gaps) and padded grid objects in
SpatVector
class. chopin::par_grid()
filters
features that intersect with each grid. Post-processing
is necessary as polygons may be taken more than twice in parallelized
calculation. If users have a field containing divisible and hierarchical
information, dataset can be processed with
chopin::par_hierarchy()
.
Customization
chopin
way to parallelize is to iterate calculation over
a list then to use future.apply
function. If one wants to customize parallelization workflow like
chopin
, consider making list objects in two steps. First,
assign extent vectors to each element. In this case, users should be
aware of the coordinate system of both the input data and the
coordinates of the extent vectors. Second, use preprocessed vector
objects with respect to each extent vector in the first list. We
recommend to name each list element to debug easily. For
par_multirasters()
workflow, using file or database table
paths is recommended instead of SpatRaster
objects in
future
parallel processing workflow. This is because
SpatRaster
objects made from external files are Rcpp
pointers, which cannot be exported to processes in parallelization.
More tips to save time and memory
- Read vector data with
sf
package at first. It is a bit faster thanterra
and will save time for processing asexactextractr
is designed to work withsf
objects for vector inputs. - If your analysis does not require the high precision of vector data,
simplification of geometries (e.g., using
rmapshaper
, i.e.,ms_simplify
) will result in considerable time savings.
Save computing costs
If HPC systems are used, you will have CPU-memory usage quota per
user. Thus, it is always recommended to test your computation workflow
with subsets in manageable sizes. It can give you an estimate for the
total computational demand. profvis::profvis()
helps to
summarize runtime per function call and memory usage. For testing, use
future::plan(future::sequential)
to see single-thread. Keep
chopin::par_pad_grid()
arguments the same then only take
one or two grids using lapply
. Preferably the hardest case
will be tested to estimate the maximum peak memory usage per thread.
After prototyping, take the numbers as reference numbers to submit a job
to HPC with a proper amount of computational assets: each HPC management
system offers tools to see statistics and task summaries. Users are
advised to use these records to plan asset allocation for the next
time.