The tool Data for Research
(DfR) by JSTOR is a valuable source for citation analysis and text
mining. jstor
provides functions and suggests workflows for
importing datasets from DfR.
When using DfR, requests for datasets can be made for small excerpts
(max. 25,000 records) or large ones, which require an agreement between
the researcher and JSTOR. jstor
was developed to deal with
very large datasets which require an agreement, but can be used with
smaller ones as well.
The most important set of functions is a group of
jst_get_*
functions:
-
jst_get_article
(for journal documents and pamphlets) -
jst_get_authors
(for all sources) -
jst_get_references
(for journal documents) -
jst_get_footnotes
(for journal documents) -
jst_get_book
(for books and research reports) -
jst_get_chapters
(for books and possibly research reports)
I will demonstrate their usage using the sample dataset which is provided by JSTOR on their website.
General Concept
All functions from the jst_get_*
family which are
concerned with meta data operate along the same lines:
- The file is read with
xml2::read_xml()
. - Content of the file is extracted via XPATH or CSS-expressions.
- The resulting data is returned in a tidy
tibble
.
The functions are similar in that all operate on single files
(article, book, research report or pamphlet). Depending on the content
of the file, the output of the functions might have one or multiple
rows. jst_get_article
always returns a tibble
with one row: the core meta data (like title, id, or first page of the
article) are single items, and only one article is processed at a time.
Running jst_get_authors
for the same article might give you
a tibble with one or multiple rows, depending on the number of authors
the article has. The same is true for jst_get_references
and jst_get_footnotes
. If a file has no data on references
(they might still exist, but JSTOR might not have parsed them), the
output is only one row, with missing references. If there is data on
references, each entry gets its own row. Note however, that the number
of rows does not equal the number of references. References usually
start with a title like “References”, which is obviously not a reference
to another article. Be sure to think carefully about your assumptions
and to check the content of your data before you make inferences.
Books work a bit differently. Searching for data on https://www.jstor.org/dfr/results lets you filter for
books, which are actually book chapters. If you receive data from DfR on
a book chapter, you always get one xml-file with the whole book,
including data on all chapters. Ngram or full-text data for the same
entry however is processed only from single chapters1. Thus, the output of
jst_get_book
for a single file is similar to the one from
jst_get_article
: it is one row with general data about the
book. jst_get_chapters
gives you data on all chapters, and
the resulting tibble therefore might have multiple rows.
The following sections showcase the different functions separately.
Application
Apart from jstor
we only need to load dplyr
for matching records and knitr
for printing nice
tables.
jst_get_article
The basic usage of the jst_get_*
functions is very
simple. They take only one argument, the path to the file to import:
meta_data <- jst_get_article(file_path = jst_example("article_with_references.xml"))
The resulting object is a tibble
with one row and 17
columns. The columns correspond to most of the elements documented here:
https://www.jstor.org/dfr/about/technical-specifications.
The columns are:
- file_name (chr): The file name of the original .xml-file. Can be used for joining with other parts (authors, references, footnotes, full-texts).
- journal_doi (chr): A registered identifier for the journal.
- journal_jcode (chr): A identifier for the journal like “amerjsoci” for the “American Journal of Sociology”.
- journal_pub_id (chr): Similar to journal_jcode. Most of the time either one is present.
- article_doi (chr): A registered unique identifier for the article.
- article_jcode (chr): A unique identifier for the article (not a DOI).
- article_pub_id (chr): Infrequent, either part of the DOI or the article_jcode.
- article_type (chr): The type of article (research-article, book-review, etc.).
- article_title (chr): The title of the article.
- volume (chr): The volume the article was published in.
- issue (chr): The issue the article was published in.
- language (chr): The language of the article.
- pub_day (chr): Publication day, if specified.
- pub_month (chr): Publication month, if specified.
- pub_year (int): Year of publication.
- first_page (int): Page number for the first page of the article.
- last_page (int): Page number for the last page of the article.
Since the output from all functions are tibbles, the result is nicely formatted:
file_name | journal_doi | journal_jcode | journal_pub_id | journal_title | article_doi | article_pub_id | article_jcode | article_type | article_title | volume | issue | language | pub_day | pub_month | pub_year | first_page | last_page | page_range |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
article_with_references | NA | tranamermicrsoci | NA | Transactions of the American Microscopical Society | 10.2307/3221896 | NA | NA | research-article | On the Protozoa Parasitic in Frogs | 41 | 2 | eng | 1 | 4 | 1922 | 59 | 76 | 59-76 |
jst_get_authors
Extracting the authors works in similar fashion:
authors <- jst_get_authors(jst_example("article_with_references.xml"))
kable(authors)
file_name | prefix | given_name | surname | string_name | suffix | author_number |
---|---|---|---|---|---|---|
article_with_references | NA | R. | Kudo | NA | NA | 1 |
Here we have the following columns:
- file_name: The same as above, used for matching articles.
- prefix: A prefix to the name.
-
given_name: The given name of the author
(i.e.
Albert
orA.
). -
surname: The surname of the author
(i.e.
Einstein
). -
string_name: Sometimes instead of given_name and surname,
only a full string is supplied, i.e.:
Albert Einstein
, orEinstein, Albert
. -
suffix: A suffix to the name, as in
Albert Einstein, II.
. - author_number: An integer representing the order of how the authors appeared in the data.
The number of rows matches the number of authors – each author get its’ own row.
jst_get_references
references <- jst_get_references(jst_example("article_with_references.xml"))
# # we need to remove line breaks for knitr::kable() to work properly for printing
references <- references %>%
mutate(ref_unparsed = stringr::str_remove_all(ref_unparsed, "\\\n"))
We have two columns:
- file_name: Identifier, can be used for matching.
- ref_title: The title of the references sections.
-
ref_authors: A string of authors. Several authors are
separated with
;
. - ref_editors: A string of editors, if present.
- ref_collab: A field that may contain information on the authors, if authors are not available.
- ref_item_title: The title of the cited entry.
- ref_year: A year, often the article’s publication year, but not always.
- ref_source: The source of the cited entry. For books often the title of the book, for articles the publisher of the journal.
- ref_volume: The volume of the journal article.
- ref_first_page: The first page of the article/chapter.
- ref_last_page: The last page of the article/chapter.
- ref_publisher: For books the publisher, for articles often missing.
- ref_publication_type: Known types: book, journal, web, other.
- ref_unparsed: The full references entry in unparsed form.
Here I display 5 random entries:
file_name | ref_title | ref_authors | ref_collab | ref_item_title | ref_year | ref_source | ref_volume | ref_first_page | ref_last_page | ref_publisher | ref_publication_type | ref_unparsed |
---|---|---|---|---|---|---|---|---|---|---|---|---|
article_with_references | References: Trypanosomes | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | LEBEDEFF, A.1910 Ueber Trypanosoma rotatorium Gruby. Festschr. 60sten Geburts. RichardHertwigs, 1:397-436, 2 pl., 9 textfig. |
article_with_references | References: Opalinae | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | METCALF, M. M.1909 Opalina. Its anatomy and reproduction, with a description of infection experi-ments and a chronological review of the literature. Arch. Protist., 13. 181pp., 15 pl. and 15 textfig. |
article_with_references | References: Leptotheca ohilmacheri | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1922 On the morphology and life history of a Myxosporidian, Leptotheca ohlmacheri,parasitic in Rana clamitans and Rana pipiens. Parasitology, 14, no. 2. |
article_with_references | References: Trypanosomes | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | BRUMPT, E.1906 Rôle pathoghne et mode de transmission du Trypanosoma inopinatum Ed. et Et.Sergent. Mode d’inoculation d’autres trypanosomes. C. R. soc. biol.,61:167-169. |
article_with_references | References: Trypanosomes | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | LAVERAN, A. and E. MESNIL (translated and revised by Nabarro).1907 Trypanosomes and trypanosomiases. Chicago. 538 pp., 1 pl. and 8 textfig. |
This example shows several things: file_name
is
identical among rows, since it identifies the article and all references
came from one article. The the sample file doesn’t follow a typical
convention (it was published in 1922), therefore there are several
different headings (ref_title
). Usually, this is only
“Bibliography” or “References”.
Since the references were not parsed by JSTOR, we only get an
unparsed version. In general, the content of references
(unparsed_refs
) is in quite a raw state, quite often the
result of digitising scans via OCR. For example, the last entry reads
like this:
MACHADO, A.1911 Zytologische Untersuchungen fiber Trypanosoma rotatorium ...
.
There is an error here: fiber
should be über
.
The language of the source is German, but the OCR-software assumed
English. Therefore, it didn’t recognize the Umlaut. Similar
errors are common for text read via OCR.
For other files, we can set parse_refs = TRUE
, so
references will be imported in their parsed form, whenever they are
available.
jst_get_references(
jst_example("parsed_references.xml"),
parse_refs = TRUE
) %>%
kable()
file_name | ref_title | ref_authors | ref_editors | ref_collab | ref_item_title | ref_year | ref_source | ref_volume | ref_first_page | ref_last_page | ref_publisher | ref_publication_type | ref_unparsed |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
parsed_references | Notes | NA | NA | NA | NA | 2005 | NA | NA | NA | NA | NA | other | 1. The USA PATRIOT Act expanded the government’s surveillance power in numerous other ways (see, e.g. Keenan 2005 ). |
parsed_references | References | Acohido, B.; Eisler, P. | NA | NA | “Snowden Case: How Low-Level Insider Could Steal from NSA” | 2013 | USA Today | NA | NA | NA | NA | other | Acohido, B. and Eisler, P. ( 2013 ) “Snowden Case: How Low-Level Insider Could Steal from NSA” , USA Today , 12 June. Available online at http://www.google.com (accessed 15 June 2013). |
parsed_references | References | NA | NA | Amnesty International | NA | 2013 | “USA: Revelations about Government Surveillance ‘raise red flags’” | NA | NA | NA | NA | other | Amnesty International ( 2013 ) “USA: Revelations about Government Surveillance ‘raise red flags’” , 7 June. Available online at http://www.google.com (accessed 14 June 2013). |
parsed_references | References | Jacobson, D. | D. E. Davis; J. Go | NA | Chapter title | 2009 | Book title | NA | 281 | 286 | Routledge | book | Jacobson, D. , 2009 . Chapter title . In: D. E. Davis & J. Go , eds. Book title .: Routledge , pp. 281 - 286 . |
parsed_references | References | Costall, Alan | NA | NA | “Some article title” | 1980 | Theory and Psychology | 1 | 123 | 145 | NA | journal | Costall, Alan ( 1980 ). “Some article title” Theory and Psychology 1 : 123 – 145 . |
parsed_references | References | Hudson, W. | NA | NA | Another article title | 2000 | Australian Journal of Cats & Dogs | 40 | 134 | 150 | NA | journal | Hudson, W. , 2000 . Another article title . Australian Journal of Cats & Dogs , September , 40 ( 3 ), p. 134 – 150 . |
parsed_references | References | Fries-Britt, S.; Griffin, K.A. | NA | NA | Some article about race | 2000 | Journal of College Student Fun | 20 | 60 | 120 | NA | journal | Fries-Britt S. , & Griffin K.A. ( 2000 ). Some article about race . Journal of College Student Fun , 20 , 60 – 120 . |
Note, that there might be other content present like endnotes, in case the article used endnotes rather than footnotes.
jst_get_footnotes
jst_get_footnotes(jst_example("article_with_references.xml")) %>%
kable()
file_name | footnotes |
---|---|
article_with_references | NA |
Very commonly, articles either have footnotes or references. The
sample file used here does not have footnotes, therefore a simple
tibble
with missing footnotes is returned.
I will use another file to demonstrate footnotes.
footnotes <- jst_get_footnotes(jst_example("article_with_footnotes.xml"))
footnotes %>%
mutate(footnotes = stringr::str_remove_all(footnotes, "\\\n")) %>%
kable()
file_name | footnotes |
---|---|
article_with_footnotes | [Footnotes] |
article_with_footnotes | 9Quarterly, vol. XIII, no. 1,entries for April 19 and 21. |
article_with_footnotes | 10Quarterly, vol. XIII,no. 1, p. 8. |
article_with_footnotes | 14Quarterly, vol. VIII, no. 1.Olympia Columbian, Sept. 11, 1852, |
article_with_footnotes | 26Quarterly, vol. XII,no. 2, p. 141. |
article_with_footnotes | 32Dr. David S. Maynard, later (March 31, 1852) |
article_with_footnotes | 34Thomas Linklater, Shepherd, since October 6, 1849, |
In general, you might need to combine
jst_get_footnotes()
with jst_get_references()
to get all available information on citation data.
jst_get_full_text
The function to extract full texts can’t be demonstrated with proper
data, since the full texts are only supplied upon special request with
DfR. The function guesses the encoding of the specified file via
readr::guess_encoding()
, reads the whole file and returns a
tibble
with file_name
, full_text
and encoding
.
I created a file that looks similar to files supplied by DfR with sample text:
full_text <- jst_get_full_text(jst_example("full_text.txt"))
full_text %>%
mutate(full_text = stringr::str_remove_all(full_text, "\\\n")) %>%
kable()
file_name | full_text | encoding |
---|---|---|
full_text | ASCII |
Combining results
Different parts of meta-data can be combined by using
dplyr::left_join()
.
Matching with authors
meta_data %>%
left_join(authors) %>%
select(file_name, article_title, pub_year, given_name, surname) %>%
kable()
#> Joining with `by = join_by(file_name)`
file_name | article_title | pub_year | given_name | surname |
---|---|---|---|---|
article_with_references | On the Protozoa Parasitic in Frogs | 1922 | R. | Kudo |
Matching with references
meta_data %>%
left_join(references) %>%
select(file_name, article_title, volume, pub_year, ref_unparsed) %>%
head(5) %>%
kable()
#> Joining with `by = join_by(file_name)`
file_name | article_title | volume | pub_year | ref_unparsed |
---|---|---|---|---|
article_with_references | On the Protozoa Parasitic in Frogs | 41 | 1922 | DOBELL, C.C.1909 Researches on the intestinal Protozoa of frogs and toads. Quart. Jour. Micros.Sc., 53:201-276, 4 pl. and 1 textfig. |
article_with_references | On the Protozoa Parasitic in Frogs | 41 | 1922 | 1918 Are Entamoeba histolytica and Entamoeba ranarum the same species? An experi-mental study. Parasit., 10:294-310. |
article_with_references | On the Protozoa Parasitic in Frogs | 41 | 1922 | KUDO, R.1920 Studies on Myxosporidia. A Synopsis of Genera and Species of Myxosporidia.ill. Biol. Monogr., 5:243-503, 25 pl. and 2 textfig. |
article_with_references | On the Protozoa Parasitic in Frogs | 41 | 1922 | 1921 On the nature of structures characteristic of Cnidosporidian spores. Trans.Micro. Soc., 40:60-74. |
article_with_references | On the Protozoa Parasitic in Frogs | 41 | 1922 | 1922 On the morphology and life history of a Myxosporidian, Leptotheca ohlmacheri,parasitic in Rana clamitans and Rana pipiens. Parasitology, 14, no. 2. |
Books
Quite recently DfR added book chapters to their stack. To import
metadata about the books and chapters, jstor supplies
jst_get_book
and jst_get_chapters
.
jst_get_book
is very similar to
jst_get_article
. We obtain general information about the
complete book:
jst_get_book(jst_example("book.xml")) %>% knitr::kable()
book_id | file_name | discipline | book_title | book_subtitle | pub_day | pub_month | pub_year | isbn | publisher_name | publisher_location | n_pages | language |
---|---|---|---|---|---|---|---|---|---|---|---|---|
j.ctt24hdz7 | book | Political Science | The 2006 Military Takeover in Fiji | A Coup to End All Coups? | 30 | 4 | 2009 | 9781921536502; 9781921536519 | ANU E Press | Canberra | NA | eng |
A single book might contain many chapters.
jst_get_chapters
extracts all of them. Due to this, the
function is a bit slower than most of jstor’s other functions.
chapters <- jst_get_chapters(jst_example("book.xml"))
str(chapters)
#> tibble [36 × 9] (S3: tbl_df/tbl/data.frame)
#> $ book_id : chr [1:36] "j.ctt24hdz7" "j.ctt24hdz7" "j.ctt24hdz7" "j.ctt24hdz7" ...
#> $ file_name : chr [1:36] "book" "book" "book" "book" ...
#> $ part_id : chr [1:36] "j.ctt24hdz7.1" "j.ctt24hdz7.2" "j.ctt24hdz7.3" "j.ctt24hdz7.4" ...
#> $ part_label : chr [1:36] NA NA NA NA ...
#> $ part_title : chr [1:36] "Front Matter" "Table of Contents" "Acronyms and abbreviations" "Authors’ biographies" ...
#> $ part_subtitle : chr [1:36] NA NA NA NA ...
#> $ authors : chr [1:36] NA NA NA NA ...
#> $ abstract : chr [1:36] NA NA NA NA ...
#> $ part_first_page: chr [1:36] "i" "v" "vii" "xi" ...
Without the abstracts (they are rather long) the first 10 chapters look like this:
book_id | file_name | part_id | part_label | part_title | part_subtitle | authors | part_first_page |
---|---|---|---|---|---|---|---|
j.ctt24hdz7 | book | j.ctt24hdz7.1 | NA | Front Matter | NA | NA | i |
j.ctt24hdz7 | book | j.ctt24hdz7.2 | NA | Table of Contents | NA | NA | v |
j.ctt24hdz7 | book | j.ctt24hdz7.3 | NA | Acronyms and abbreviations | NA | NA | vii |
j.ctt24hdz7 | book | j.ctt24hdz7.4 | NA | Authors’ biographies | NA | NA | xi |
j.ctt24hdz7 | book | j.ctt24hdz7.5 | 1. | The enigmas of Fiji’s good governance coup | NA | NA | 3 |
j.ctt24hdz7 | book | j.ctt24hdz7.6 | 2. | ‘Anxiety, uncertainty and fear in our land’: | Fiji’s road to military coup, 2006 | NA | 21 |
j.ctt24hdz7 | book | j.ctt24hdz7.7 | 3. | Fiji’s December 2006 coup: | Who, what, where and why? | NA | 43 |
j.ctt24hdz7 | book | j.ctt24hdz7.8 | 4. | ‘This process of political readjustment’: | The aftermath of the 2006 Fiji Coup | NA | 67 |
j.ctt24hdz7 | book | j.ctt24hdz7.9 | 5. | The changing role of the Great Council of Chiefs | NA | NA | 97 |
j.ctt24hdz7 | book | j.ctt24hdz7.10 | 6. | The Fiji military and ethno-nationalism: | Analyzing the paradox | NA | 117 |
Since extracting all authors for all chapters needs considerably more time, by default authors are not extracted. You can import them like so:
author_chap <- jst_get_chapters(jst_example("book.xml"), authors = TRUE)
The authors are supplied in a list column:
class(author_chap$authors)
#> [1] "list"
You can expand this list with tidyr::unnest
:
author_chap %>%
tidyr::unnest(authors) %>%
select(part_id, given_name, surname) %>%
head(10) %>%
kable()
part_id | given_name | surname |
---|---|---|
j.ctt24hdz7.1 | NA | NA |
j.ctt24hdz7.2 | NA | NA |
j.ctt24hdz7.3 | NA | NA |
j.ctt24hdz7.4 | NA | NA |
j.ctt24hdz7.5 | Jon | Fraenkel |
j.ctt24hdz7.5 | Stewart | Firth |
j.ctt24hdz7.6 | Brij V. | Lal |
j.ctt24hdz7.7 | Jon | Fraenkel |
j.ctt24hdz7.8 | Brij V. | Lal |
j.ctt24hdz7.9 | Robert | Norton |
You can learn more about the concept of list-columns in Hadley Wickham’s book R for Data Science.