An R package to do parallel processing on Amazon, (more) easily. Born 2016, at the Brisbane ROpenSci Unconference. This is a work in progress, and is currently in development.
Authors:
Automatically sets up and starts a cluster of AWS workers, does parallel processing, and saves the output to S3 Bucket.
# Install
devtools::install_github("ropenscilabs/snowball")
snowball
takes the location of data, a user defined function, and some basic instructions to set up and run virtual machines in parallel on Amazon, and save results in an S3 bucket.
.rds
filesnowball(function, bucketName, ...)
Save a .snowball file into your current working directory with the following configuration,
AWS_ACCESS_KEY_ID: <YOURACCESSSKEYID>
AWS_SECRET_ACCESS_KEY: <YOURSECRETACCESSKEY>
AWS_DEFAULT_REGION: <YOURDEFAULTREGION>
Next, run snowball_setup
to set global variables.
snowball_setup(config_file, echo)
Start an AWS instance with buckets, while setting up the data/feature split
snowpack(fn, listItem, bucketNameString, rdsInputObjectString, rdsOutputString)
Check out the Snow and Snowfall package documentations.
We assume you have a (very) basic understanding of what an S3 Bucket is (it’s like dropbox, for data). Click here for info from Amazon.. It is very easy to create a bucket. You just click create bucket
.
Setting up the ‘bucket policy allowing an IAM user full access’ is harder:
Services
, then IAM
, then click on the user you want to give access to (you, most likely).Properties
add policy
, which opens a window called “AWS Policy Generator”
All Actions
.arn:aws:s3:::bucketName
Add Statement
, copy the contents to clipboard. Go back to bucket page, click “Edit bucket policy” and paste clipboard into this.