An R package to do parallel processing on Amazon, (more) easily. Born 2016, at the Brisbane ROpenSci Unconference. This is a work in progress, and is currently in development.
Automatically sets up and starts a cluster of AWS workers, does parallel processing, and saves the output to S3 Bucket.
snowball takes the location of data, a user defined function, and some basic instructions to set up and run virtual machines in parallel on Amazon, and save results in an S3 bucket.
snowball(function, bucketName, ...)
Save a .snowball file into your current working directory with the following configuration,
snowball_setup to set global variables.
Start an AWS instance with buckets, while setting up the data/feature split
Give data location and user function
combine all results into one file
We assume you have a (very) basic understanding of what an S3 Bucket is (it’s like dropbox, for data). Click here for info from Amazon.. It is very easy to create a bucket. You just click
Setting up the ‘bucket policy allowing an IAM user full access’ is harder:
IAM, then click on the user you want to give access to (you, most likely).
add policy, which opens a window called “AWS Policy Generator”
Add Statement, copy the contents to clipboard. Go back to bucket page, click “Edit bucket policy” and paste clipboard into this.