awardFindR is built to be easily extendable to include additional data sources and we welcome contributions adding support for additional databases of research funding.
Adding a new source involves three steps: 1. Adding a
sourcename.R file that parse the database using its API or webscraping as discussed 2. Adding the source to default
sources argument of the
search_awards() function in
main.R 3. Adding relevant tests
New sources are entirely self-contained in a single
Sources need to have a minimum of two functions:
get_*function that scrapes raw data from the source and returns it as-is,
.standardize_*(note the preceding period) function that harmonizes the data to fit the data.frame output from
The file containing the source should have an easily identifiable filename, which should be reflected in the naming of the
standardize_* functions. For example, import from the “Megafunds Foundation” should be in
get_* routine is exported, so end users can call it directly if they have interest in a specific data source. It should be as faithful a reproduction of the original data source as possible, though variables should still be R-friendly (for example, award amounts as strings like “$300,000” aren’t useful. An integer of 300000 is much more useful.)
These routines will usually use HTML or JSON-based HTTP resources.
utils.R in this package has a
request() function that handles HTTP POST and GET requests with
httr and returns xml2 objects (which can be used with
xml2) or json as a base R
list() object automatically. In order to centralize HTTP error handling and standardize message output to the user, please employ this function as much as possible for HTTP requests.
This function can be largely tailored to the needs of an API, but should follow some basic style guidelines. Three arguments are required:
The function should accept at least one keyword and two date terms as arguments. keyword is a string, or, if the source can handle more than one keyword at a time, keywords is a vector of strings. Dates are typically exact days or years, depending on the capabilities of the search function available from the source. Exact dates should be date objects named from_date and to_date, while years should be integers named from_year and to_year.
Additional source-specific variables can be added to the function, but should have default values specified.
*_get routines should function as expected with these three variables alone.
If at all possible, searching should be done server-side to reduce HTTP traffic. Downloading the whole grant database and searching with
grep() or a similar function should be a last resort. This also minimizes CPU load.
JSON sources shouldn’t need any additional dependencies, since the
request() function handles encoding and decoding from/to lists automatically. HTML/XML sources, however, should use
If the source does not use HTML, XML or JSON, it may be necessary to employ additional dependencies as necessary. This is an extreme case and should be done with care.
awardFindR calls the internal
.*_standardize function, which should in turn call the
get_* function described above. All
.standardize_* functions need to accept the exact same input. This includes:
The date objects should be translated into whatever the source-specific requirements are for search terms. If an API can only delineate searches by year, get the year. If an API can only handle one keyword at a time, loop the function through multiple keywords with
This function needs to return a
data.frame with the following columns:
Source routines are included in the
search_awards() function through the sources argument. Include the name of your new source (the section before "_standardize") in the default value of the sources for
main.R. This will include your new source routine in the default functionality, and expose the name in the documentation where users are likely to discover it.
The main tests in
test-full.R attempt to exercise both the successful and no results code branches. The latter always works. Unfortunately, the default search for the former (the terms “qualitative analysis” and “ethnography” in 2018) does not actually return results for some smaller sources. For this and other reasons, it’s a good idea to create a source-specific test for the
Tests should actually return results, but should also minimally tax the API. Less than 10 results would be ideal. There is one exception to this: when a source needs to loop through multiple pages. In this case, the smallest number of results that triggers the looping is ideal, in order to ensure maximum test coverage.
One test should verify reproducibility; i.e. we get the same results again for a query in a specified date range. When used with a real-world HTTP resource, this should alert us if there is a change in how the resource is provided.
Finally, HTTP requests for tests should be cached using the ROpenSci package
vcr. Individual sources should have their own cassettes. This limits stress on the APIs from frequent testing of internal logic, especially by the continuous integration system. Failure to do this could potentially result in service-level bans if we’re not careful.