Getting Started Storing Dataframes as Plain Text
Thierry Onkelinx
Source:vignettes/plain_text.Rmd
plain_text.Rmd
Introduction
This vignette motivates why we wrote git2rdata
and
illustrates how you can use it to store dataframes as plain text
files.
Maintaining Variable Classes
R has different options to store dataframes as plain text files from
R. Base R has write.table()
and its companions like
write.csv()
. Some other options are
data.table::fwrite()
, readr::write_delim()
,
readr::write_csv()
and readr::write_tsv()
.
Each of them writes a dataframe as a plain text file by converting all
variables into characters. After reading the file, they revert this
conversion. The distinction between character
and
factor
gets lost in translation. read.table()
converts by default all strings to factors,
readr::read_csv()
keeps by default all strings as
character. These functions cannot recover the factor levels. These
functions determine factor levels based on the observed levels in the
plain text file. Hence factor levels without observations will
disappear. The order of the factor levels is also determined by the
available levels in the plain text file, which can be different from the
original order.
The write_vc()
and read_vc()
functions from
git2rdata
keep track of the class of each variable and, in
case of a factor, also of the factor levels and their order. Hence this
function pair preserves the information content of the dataframe. The
vc
suffix stands for version
control as these functions
use their full capacity in combination with a version control
system.
Efficiency Relative to Storage and Time
Optimizing File Storage
Plain text files require more disk space than binary files. This is
the price we have to pay for a readable file format. The default option
of write_vc()
is to create file as compact as possible.
Since we use a tab delimited file format, we can omit quotes around
character variables. This saves 2 bytes per row for each character
variable. write_vc
add quotes automatically in the
exceptional cases when we needed them, e.g. to store a string that
contains tab or newline characters. We don’t add quotes to row-variable
combinations where we don’t need them.
Since we store the class of each variable, we can further reduce the file size by following rules:
- Store a
logical
as 0 (FALSE), 1 (TRUE) or NA to the data. - Store a
factor
as its indices in the data. Store the index, labels of levels and their order in the metadata. - Store a
POSIXct
as a numeric to the data. Store the class and the origin in the metadata. Store and return timestamps as UTC. - Store a
Date
as an integer to the data. Store the class and the origin in the metadata.
Storing the factors, POSIXct and Date as their index, makes them less user readable. The user can turn off this optimization when user readability is more important than file size.
Optimized for Version Control
Another main goal of git2rdata
is to optimise the
storage of the plain text files under version control.
write_vc()
and read_vc()
has methods for
interacting with git repositories
using the git2r
framework. Users who want to use git
without git2r
or use a different version control system
(e.g. Subversion, Mercurial), still can use
git2rdata
to write the files to disk and uses their
preferred workflow on version control.
Hence, write_vc()
will always perform checks to look for
changes which potentially lead to large diffs. More details on this in
vignette("version_control", package = "git2rdata")
. Some
problems will always yield a warning. Other problems will yield an error
by default. The user can turn these errors into warnings by setting the
strict = FALSE
argument.
As this vignette ignores the part on version control, we will always
use write_vc(strict = FALSE)
and hide the warnings to
improve the readability.
Basic Usage
Let’s start by setting up the environment. We need a directory to store the data and a dataframe to store.
# Create a directory in tempdir
path <- tempfile(pattern = "git2r-")
dir.create(path)
# Create dummy data
set.seed(20190222)
x <- data.frame(
x = sample(LETTERS),
y = factor(
sample(c("a", "b", NA), 26, replace = TRUE),
levels = c("a", "b", "c")
),
z = c(NA, 1:25),
abc = c(rnorm(25), NA),
def = sample(c(TRUE, FALSE, NA), 26, replace = TRUE),
timestamp = seq(
as.POSIXct("2018-01-01"),
as.POSIXct("2019-01-01"),
length = 26
),
stringsAsFactors = FALSE
)
str(x)
#> 'data.frame': 26 obs. of 6 variables:
#> $ x : chr "V" "U" "Z" "W" ...
#> $ y : Factor w/ 3 levels "a","b","c": 1 2 NA NA 1 NA 2 1 NA 1 ...
#> $ z : int NA 1 2 3 4 5 6 7 8 9 ...
#> $ abc : num -0.382 -0.42 -0.917 0.387 -0.992 ...
#> $ def : logi TRUE FALSE NA FALSE NA NA ...
#> $ timestamp: POSIXct, format: "2018-01-01 00:00:00" "2018-01-15 14:24:00" ...
Storing Optimized
Use write_vc()
to store the dataframe. The
root
argument refers to the base directory where we store
the data. The file
argument becomes the base name of the
files. The data file gets a .tsv
extension, the metadata
file a .yml
extension. file
can include a
relative path starting from root
.
library(git2rdata)
write_vc(x = x, file = "first_test", root = path, strict = FALSE)
#> 7fed986e0f5c218e3d0a4f0cc434bd87bb1e5f82 f8350dc218051af4bafcd8872d92b1a29cbb4f31
#> "first_test.tsv" "first_test.yml"
write_vc()
returns a vector of relative paths to the raw
data and metadata files. The names of this vector contains the hashes of
these files. We can have a look at both files. We’ll display the first
10 rows of the raw data. Notice that the YAML format of the metadata has
the benefit of being both human and machine readable.
print_file <- function(file, root, n = -1) {
fn <- file.path(root, file)
data <- readLines(fn, n = n)
cat(data, sep = "\n")
}
print_file("first_test.tsv", path, 10)
#> x y z abc def timestamp
#> V 1 NA -0.382010380419258 1 1514764800
#> U 2 1 -0.420347607856041 0 1516026240
#> Z NA 2 -0.916731402237418 NA 1517287680
#> W NA 3 0.387455128525654 0 1518549120
#> L 1 4 -0.992354993526956 NA 1519810560
#> C NA 5 0.0228713954429028 NA 1521072000
#> R 2 6 -0.947557467717088 1 1522333440
#> S 1 7 -0.16302914628615 NA 1523594880
#> O NA 8 0.523643352634392 1 1524856320
print_file("first_test.yml", path)
#> ..generic:
#> git2rdata: 0.4.1
#> optimize: yes
#> NA string: NA
#> hash: f8350dc218051af4bafcd8872d92b1a29cbb4f31
#> data_hash: 7fed986e0f5c218e3d0a4f0cc434bd87bb1e5f82
#> x:
#> class: character
#> 'y':
#> class: factor
#> labels:
#> - a
#> - b
#> - c
#> index:
#> - 1
#> - 2
#> - 3
#> ordered: no
#> z:
#> class: integer
#> abc:
#> class: numeric
#> def:
#> class: logical
#> timestamp:
#> class: POSIXct
#> origin: 1970-01-01 00:00:00
#> timezone: UTC
Storing Verbose
Adding optimize = FALSE
to write_vc()
will
keep the raw data in a human readable format. The metadata file is
slightly different. The most obvious is the optimize: no
tag and the different hash. Another difference is the metadata for
POSIXct and Date classes. They will no
longer have an origin tag but a format tag.
Another important difference is that we store the data file as comma
separated values instead of tab separated values. We noticed that the
csv
file format is more easily recognised by a larger
audience as a data file.
write_vc(x = x, file = "verbose", root = path, optimize = FALSE, strict = FALSE)
#> 8ff3daa81dc05dd007582884a1fd8d736d195184 9fc40476f0ba0c4cb4225d5313d56760a6d3b065
#> "verbose.csv" "verbose.yml"
print_file("verbose.csv", path, 10)
#> x,y,z,abc,def,timestamp
#> V,a,NA,-0.382010380419258,TRUE,2018-01-01T00:00:00Z
#> U,b,1,-0.420347607856041,FALSE,2018-01-15T14:24:00Z
#> Z,NA,2,-0.916731402237418,NA,2018-01-30T04:48:00Z
#> W,NA,3,0.387455128525654,FALSE,2018-02-13T19:12:00Z
#> L,a,4,-0.992354993526956,NA,2018-02-28T09:36:00Z
#> C,NA,5,0.0228713954429028,NA,2018-03-15T00:00:00Z
#> R,b,6,-0.947557467717088,TRUE,2018-03-29T14:24:00Z
#> S,a,7,-0.16302914628615,NA,2018-04-13T04:48:00Z
#> O,NA,8,0.523643352634392,TRUE,2018-04-27T19:12:00Z
print_file("verbose.yml", path)
#> ..generic:
#> git2rdata: 0.4.1
#> optimize: no
#> NA string: NA
#> hash: 9fc40476f0ba0c4cb4225d5313d56760a6d3b065
#> data_hash: 8ff3daa81dc05dd007582884a1fd8d736d195184
#> x:
#> class: character
#> 'y':
#> class: factor
#> labels:
#> - a
#> - b
#> - c
#> index:
#> - 1
#> - 2
#> - 3
#> ordered: no
#> z:
#> class: integer
#> abc:
#> class: numeric
#> def:
#> class: logical
#> timestamp:
#> class: POSIXct
#> format: '%Y-%m-%dT%H:%M:%SZ'
#> timezone: UTC
Efficiency Relative to File Storage
Storing dataframes optimized or verbose has an impact on the required file size. The efficiency vignette give a comparison.
Reading Data
You retrieve the data with read_vc()
. This function will
reinstate the variables to their original state.
y <- read_vc(file = "first_test", root = path)
all.equal(x, y, check.attributes = FALSE)
#> [1] "Component \"timestamp\": 'tzone' attributes are inconsistent ('' and 'UTC')"
y2 <- read_vc(file = "verbose", root = path)
all.equal(x, y2, check.attributes = FALSE)
#> [1] "Component \"timestamp\": 'tzone' attributes are inconsistent ('' and 'UTC')"
read_vc()
requires the meta data. It cannot handle
dataframe not stored by write_vc()
.
Missing Values
write_vc()
has an na
argument which
specifies the string which to use for missing values. Because we avoid
using quotes, this string must be different from any character value in
the data. This includes factor labels with verbose data storage.
write_vc()
checks this and will always return an error,
even with strict = FALSE
.
write_vc(x, "custom_na", path, strict = FALSE, na = "X", optimize = FALSE)
#> Error: one of the strings matches the NA string ('X')
#> Please use a different NA string or consider using a factor.
write_vc(x, "custom_na", path, strict = FALSE, na = "b", optimize = FALSE)
#> Error: one of the levels matches the NA string ('b').
#> Please use a different NA string or use optimize = TRUE
write_vc(x, "custom_na", path, strict = FALSE, na = "X")
#> Error: one of the strings matches the NA string ('X')
#> Please use a different NA string or consider using a factor.
write_vc(x, "custom_na", path, strict = FALSE, na = "b")
#> 5415c0176d1c25fc1ac08763ae9271f6306eccc5 3c028ddbb9efabc8c85a396b2db1f4f1c8d9461d
#> "custom_na.tsv" "custom_na.yml"
Please note that write_vc()
uses the same NA string for
the entire dataset, thus for every variable.
print_file("custom_na.tsv", path, 10)
#> x y z abc def timestamp
#> V 1 b -0.382010380419258 1 1514764800
#> U 2 1 -0.420347607856041 0 1516026240
#> Z b 2 -0.916731402237418 b 1517287680
#> W b 3 0.387455128525654 0 1518549120
#> L 1 4 -0.992354993526956 b 1519810560
#> C b 5 0.0228713954429028 b 1521072000
#> R 2 6 -0.947557467717088 1 1522333440
#> S 1 7 -0.16302914628615 b 1523594880
#> O b 8 0.523643352634392 1 1524856320
print_file("custom_na.yml", path, 4)
#> ..generic:
#> git2rdata: 0.4.1
#> optimize: yes
#> NA string: b
The default string for missing values is "NA"
. We
recommend to keep this default, as long as the dataset permits it. A
first good alternative is an empty string (""
). If that
won’t work either, you’ll have to use your imagination. Try to keep it
short, clear and robust1.
write_vc(x, "custom_na", path, strict = FALSE, na = "")
#> dfaf9f13491b30a88b7e37dabd9f9c87225fbba0 d2b145b74dbe6bd1d7883a62f2549c124988ec63
#> "custom_na.tsv" "custom_na.yml"