Thank you for your interest in contributing to
suppdata! The most important contribution you can make to
suppdata is to add code to download data from another publisher’s journals. There are five steps you have to go through to do that; I go through them in detail below, but briefly, they are:
suppdataso the user knows what you’ve done
If you want to make a small change to
suppdata (i.e., changing <= 5 lines of code) fork the repo, make the change, and then make a pull request with the suggestion. If you want to make a more sweeping change (i.e., > 5 lines of code) then before writing any code make an issue and discuss it with @willpearse. The purpose of this is to make maximal use of everyone’s time: small code changes are better off “just done” and then we can talk about it; larger changes require discussion before implementation. You’re quite welcome to do whatever you wish with the code (within the boundaries of the license, of course), but please be aware that the maintainers of the package are not obligate to accept all pull requests. Of course, the ROpenSci maintainer rules apply, so we’ll always be polite and we’ll always let you know why we make any decision! :D
When making version changes, please follow the standards set by CRAN, so the next version after “1.2-9” would be “1.2-10”. Package versions numbers are not decimal, so something like “126.96.36.199” won’t pass CRAN’s checks (see the R extension guide)
suppdata download functions start off with, at a minimum, something like:
.suppdata.nameofpublisher <- function(doi, si, save.name=NA, dir=NA, cache=TRUE, ...)
nameofpublisher is replaced with the name of the publisher you’re loading data from. This should then be followed with something like:
#Argument handling if(!is.numeric(si)) stop("nameofpublisher download requires numeric SI info") dir <- .tmpdir(dir) save.name <- .save.name(doi, save.name, si)
Note the name of the section (
#Argument handling), and how we’re making sure the user gives us a supplement number if we need that, or using
is.character if we’re expecting some
character SI info. We have a few internal functions that should make your life easier (…well, they should…);
.tmp.dir will make a temporary directory to save files out to for you, and
.save.name will generate a sensible name to save files out to as well. It’s quite important that you use those functions, since they make the
cache behaviour of the package work.
Next comes the hard work, where you get the data out of the publisher’s website. This is a lot of regular expression kung-foo; you may find the functions
.url.redir, and all the other functions in
utils.R useful. Please do have a poke around in there, and feel free to add any functions you think are missing (using the
.name.function. convention for non-user-facing functions). If there’s something you think would be useful to have, and you don’t know how to write it, then just make an issue, tag me (@willpearse), and I’ll see what I can do.
Finally, you need to return to the user a location of the file they want. Something like:
destination <- file.path(dir, save.name) return(.download(url, dir, save.name, cache))
…should suffice. Notice how we’re using
file.path to make a sensible path on all distributions, and we’re using the internal
.download to download the file, and so guaranteeing that we’ll obey all the
cache instructions etc. We’re also ensuring that the user will get sensible filename information (“oooh, this looks like a .csv file”) as an attribute by using
Save your function in
journals.R. There are plenty of examples in there if you get stuck. There is also a list of functions to be written sitting in the issues section on GitHub.
There is a ‘hit list’ of publishers that it would be great to write wrappers for up in the issues page - click this link to see it.
We’re nearly there now, I promise!
rcrossref to look up articles’ publishers, so to hook your download function into the package you’re going to have to figure out your journal’s code. Take a paper’s DOI that you know works, and run the following on it (replacing the DOI below with the DOI you’re checking):
…that number is your publisher’s code. Modify the first
switch statement in
utils.R to add your journal’s number (as a character string) and then match it with the name of your download function. If that sounds complicated, once you open the function it will become obvious.
Next, modify the second (and last)
switch in the same function to work if your publisher is known by name. This should match onto your function’s name. So, for example, if your publishing company were called Pearse Publishing, and you’d called your function
.suppdata.pearse, then you would add an entry like:
"pearse" = .suppdata.pearse
Please remember that
case statements are separated by commas; add a comma to the previous entry to keep the code syntactically correct!
roxygen documentation here and here to give the user information about what your function expects (
character SI information). Re-build the documentation when you’re done by running something like:
…if you’re an RStudio person there’s a button for this in the “Build” tab (“More” > “Document”). If you’re an emacs person like me, there are several and you probably have a strong opinion about which is best :p
Add tests to
tests/testthat/ with a file of the format
test-<name of publisher>.R (see existing tests to get started) to give the maintainers and the continuous integration services something to check that the publisher works.
Add the newly supported publisher to the alphabetically ordered list in
Commit your changes, then make a pull request to the
master branch of
suppdata. If you need help figuring out how to do that, take a look at this website.
I always think a little history is useful when contributing to a package, so let me tell you how the code here came to be. Originally, this package was called
grabr (this repo is the same repo, the name alone changed), which was then merged into
fulltext. At that time, the structure was changed to match that of
fulltext, but this code was then pulled back out of
fulltext. So if you ever find yourself wondering “why does it do that?”, the answer is probably “because