Working with medium-sized data
Over the past summer, I have written two small-ish R packages to address challenges I frequently run up against during the course of my research. Both are challenges with what I will refer to as medium-sized data – not the kind of petabyte scale “big data” which precludes analysis on standard hardware or existing methodology, but large enough that the size alone starts creating problems for certain bits of a typical workflow. More precisely, I will take medium-sized to refer to data that is too large to comfortably fit in memory on most laptops (e.g. on the order of several GB), or data that is merely too large to commit to GitHub. By typical workflow, I mean easily being able to share all parts of analysis publicly or privately with collaborators (or merely different machines, such as my laptop and cloud server) who should be able to reproduce the results with minimal fuss and configuration.
For data too large to fit into memory, there’s already a well-established solution of using an external database, to store the data. Thanks to
dplyr’s database backends, many R users can adapt their workflow relatively seamlessly to move from
dplyr commands that call in-memory data frames to identical or nearly identical commands that call a database. This all works pretty well when your data is already in a database, but getting it onto a database, and then moving the data around so that other people/machines can access it is not nearly so straight forward. So far, this part of the problem has received relatively little attention.
The reason for that is because the usual response to this problem is “you’re doing it wrong.” The standard practice in this context is simply not to move the data at all. A central database server, usually with access controlled by password or other credential, can allow multiple users to all query the same database directly. Thanks to the magical abstractions of SQL queries such as the
DBI package, the user (aka client), doesn’t need to care about the details of where the database is located, or even what particular backend is used. Moving all that data around can be slow and expensive. Arbitrarily large data can be housed in a central/cloud location and provisioned with enough resources to store everything and process complex queries. Consequently, just about every database backend not only to provides a mechanism for doing your
dplyr querying, filtering, joining etc on data that cannot fit into memory all at once, but also nearly every such backend provides server abilities to do so over a network connection, handling secure logins and so forth. Why would you want to do anything else?
The problem with the usual response is that it is often at odds with our original objectives and typical scientific workflows. Setting up a database server can be non-trivial; by which I mean: difficult to automate in a portable/cross-platform manner when working entirely from R. More importantly, it reflects a use-case more typical of industry context than scientific practice. Individual researchers need to make data available to a global community of scientists who can reproduce results years or decades later; not just to a handful of employees who can be granted authenticated access to a central database. Archiving data as static text files is far more scalable, more cost-effective (storing static files is much cheaper than keeping a database server running), more future-proof (rapid evolution in database technology is not always backwards compatible) and simplifies or avoids most security issues involved in maintaining a public server. In the scientific context, it almost always makes more sense to move the data after all.
Scientific data repositories are already built on precisely this model: providing long term storage of files that can be downloaded and analyzed locally. For smaller
.csv files, this works pretty well.
We just wanted to access data that was a bit larger than our active memory. There is in fact a widely-used solution to this case: the
Lite flavors of databases like
SQLite, or my new favorite,
MonetDBLite, which provide the disk-based storage but not the support for network connection server model. Using the corresponding R packages, these databases can be easily deployed to store & query data on our local disk.