Introduction
-
chopin
automatically distributes geospatial data computation over multiple threads. - Function centered workflow will streamline the process of parallelizing geospatial computation with minimal effort.
chopin
workflow
- The simplest way of parallelizing generic geospatial computation is
to start from
par_pad_*
functions topar_grid
, or runningpar_hierarchy
, orpar_multirasters
functions at once.
Example data
- North Carolinian counties and a raster of elevation data are used as example data.
nccnty_path <- system.file("extdata", "nc_hierarchy.gpkg", package = "chopin")
ncelev_path <-
system.file("extdata/nc_srtm15_otm.tif", package = "chopin")
nccnty <- terra::vect(nccnty_path)
ncelev <- terra::rast(ncelev_path)
Generating random points in North Carolina
- To demonstrate chopin functions, we generate 10,000 random points in North Carolina
ncsamp <-
terra::spatSample(
nccnty,
1e4L
)
ncsamp$pid <- 1:nrow(ncsamp)
Creating grids
- The example below will generate a regular grid from the random point data.
ncgrid <- par_pad_grid(ncsamp, mode = "grid", nx = 4L, ny = 2L, padding = 10000)
plot(ncgrid$original)
Extracting values from raster
- Since all
par_*
functions operate onfuture
backends, users should define the future plan before running the functions.multicore
plan supportsterra
objects which may lead to faster computation, but it is not supported in Windows. An alternative isfuture.mirai
’smirai_multisession
plan, which is supported in many platforms and generally faster than plain future multisession plan. -
workers
argument should be defined with an integer value to specify the number of threads to be used.
future::plan(future.mirai::mirai_multisession, workers = 2L)
- Then we dispatch multiple
extract_at
runs on the grid polygons. - Before we proceed, the terra object should be converted to sf object.
Hierarchical processing
- Here we demonstrate hierarchical processing of the random points using census tract polygons.
nccnty <- sf::st_read(nccnty_path, layer = "county")
## Reading layer `county' from data source
## `/usr/local/lib/R/site-library/chopin/extdata/nc_hierarchy.gpkg'
## using driver `GPKG'
## Simple feature collection with 100 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 1054155 ymin: 1341756 xmax: 1838923 ymax: 1690176
## Projected CRS: NAD83 / Conus Albers
nctrct <- sf::st_read(nccnty_path, layer = "tracts")
## Reading layer `tracts' from data source
## `/usr/local/lib/R/site-library/chopin/extdata/nc_hierarchy.gpkg'
## using driver `GPKG'
## Simple feature collection with 2672 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1054155 ymin: 1341756 xmax: 1838923 ymax: 1690176
## Projected CRS: NAD83 / Conus Albers
- The example below will parallelize summarizing mean elevation at 10 kilometers circular buffers of random sample points by the first five characters of census tract unique identifiers, which are county codes.
- This example demonstrates the hierarchy can be defined from any given polygons if the unique identifiers are suitably formatted for defining the hierarchy.
px <-
par_hierarchy(
# from here the par_hierarchy-specific arguments
regions = nctrct,
regions_id = "GEOID",
length_left = 5,
pad = 10000,
pad_y = FALSE,
.debug = TRUE,
# from here are the dispatched function definition
# for parallel workers
fun_dist = extract_at,
# below should follow the arguments of the dispatched function
x = ncelev,
y = sf::st_as_sf(ncsamp),
id = "pid",
radius = 1e4,
func = "mean"
)
dim(px)
## [1] 10000 2
head(px)
## pid mean
## 1 79 13.664886
## 2 198 7.980108
## 3 337 12.189816
## 4 454 2.763836
## 5 467 19.170189
## 6 485 13.523435
tail(px)
## pid mean
## 9995 9542 -1.184563
## 9996 9703 5.548240
## 9997 9756 5.548804
## 9998 9759 6.540437
## 9999 9841 6.414054
## 10000 9963 4.990165
Multiraster processing
- Here we demonstrate multiraster processing of the random points using multiple rasters.
ncelev <-
system.file("extdata/nc_srtm15_otm.tif", package = "chopin")
ncelev <- terra::rast(ncelev)
tdir <- tempdir(check = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test1.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test2.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test3.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test4.tif"), overwrite = TRUE)
terra::writeRaster(ncelev, file.path(tdir, "test5.tif"), overwrite = TRUE)
rasts <- list.files(tdir, pattern = "tif$", full.names = TRUE)
pm <-
par_multirasters(
filenames = rasts,
fun_dist = extract_at,
x = NA,
y = sf::st_as_sf(ncsamp)[1:500, ],
id = "pid",
radius = 1e4,
func = "mean",
.debug = TRUE
)
dim(pm)
## [1] 2500 2
head(pm)
## mean base_raster
## 1 12.443567 /tmp/RtmpvyLHDS/test1.tif
## 2 74.934341 /tmp/RtmpvyLHDS/test1.tif
## 3 179.088806 /tmp/RtmpvyLHDS/test1.tif
## 4 8.542515 /tmp/RtmpvyLHDS/test1.tif
## 5 1087.089722 /tmp/RtmpvyLHDS/test1.tif
## 6 110.032043 /tmp/RtmpvyLHDS/test1.tif
tail(pm)
## mean base_raster
## 2495 101.370438 /tmp/RtmpvyLHDS/test5.tif
## 2496 -2.881788 /tmp/RtmpvyLHDS/test5.tif
## 2497 7.347102 /tmp/RtmpvyLHDS/test5.tif
## 2498 8.790645 /tmp/RtmpvyLHDS/test5.tif
## 2499 21.443277 /tmp/RtmpvyLHDS/test5.tif
## 2500 56.097313 /tmp/RtmpvyLHDS/test5.tif
Caveats
Why parallelization is slower than the ordinary function run?
- Parallelization may underperform when the datasets are too small to take advantage of divide-and-compute approach, where parallelization overhead is involved. Overhead here refers to the required amount of computational resources for transferring objects to multiple processes.
- Since the demonstrations above use quite small datasets, the advantage of parallelization was not as noticeable as it was expected. Should a large amount of data (spatial/temporal resolution or number of files, for example) be processed, users could find the efficiency of this package. A vignette in this package demonstrates use cases extracting various climate/weather datasets.
Notes on data restrictions
chopin
works best with two-dimensional
(planar) geometries. Users should disable
s2
spherical geometry mode in sf
by setting
sf::sf_use_s2(FALSE)
. Running any chopin
functions at spherical or three-dimensional (e.g., including M/Z
dimensions) geometries may produce incorrect or unexpected results.