Known Quirks of JSTOR/DfR Data
Thomas Klebel
2024-12-14
Source:vignettes/known-quirks.Rmd
known-quirks.Rmd
Collecting all the quirks
Data from JSTOR/DfR is unlike most other data you encounter when
doing text analysis. First and foremost, the data about articles and
books come from a wide variety of journals and publishers. The level of
detail and certain formats vary because of this. jstor
tries to deal with this situation with two strategies:
- try to recognise the format and read data accordingly
- if this is not possible, read data as “raw” as possible, i.e. without any conversions
An example for the first case are references. Four different ways how
references can be specified are known at this time, and all are imported
in specific ways to deal this variation. There might however be other
formats, which should lead to an informative error when trying to import
them via jst_get_references()
.
An example for the latter case are page numbers. Most of the time,
the entries for page numbers are simply 42
, or
61
. This is as expected, and could be parsed as integers.
Sometimes, there are characters present, like M77
. This
would pose no problem either, we could simply extract all digits via
regex and parse as character. Unfortunately, sometimes the page is
specified like this: v75i2p84
. Extracting all digits would
result in 75284
, which is wrong by a long shot. Since there
might be other ways of specifying pages, jstor
does not
attempt to parse the pages to integers when importing. However, it
offers a set of convenience functions which deal with a few common cases
(see jst_augment()
and below).
There are many other problems or peculiarities like this. This vignette tries to list as many as possible, and offer solutions for dealing with them. Unfortunately I have neither the time nor the interest to wade through all the data which you could get from DfR in order to find all possible quirks. The following list is thus inevitably incomplete. If you encounter new quirks/peculiarities, it would be greatly appreciated if you sent me an email, or opened an issue at GitHub. I will then include your findings in future version of this vignette, so this vignette can be a starting point for everybody who conducts text analysis with data from JSTOR/DfR.
Data augmentation
After importing data via jst_get_article()
, there are at
least two tasks you might typically want to undertake:
- Merge different identifiers for journals into one, so you can filter journals.
- Convert pages from character into integers and calculate the total number of pages per article.
There are four functions which help you to streamline this process:
-
jst_clean_page()
attempts to turn a character vector with pages into an integer vector. -
jst_add_total_pages()
adds a column with the total number of pages per article. -
jst_unify_journal_id()
merges different identifiers for journals into one. -
jst_augment()
wraps the above functions for convenience.
Known quirks
In the following sections, known issues with data from DfR are described in greater detail.
Page numbers
Page numbers are a mess. Besides the issues mentioned above, page
numbers might sometimes be specified as “pp. 1234-83” as in this article from the
American Journal of Sociology. Of course, this results in
first_page = 1234
and last_page = 83
, and the
computed number of total pages from jst_get_total_pages()
will be negative. There is currently no general solution for this
issue.
Calculating total pages
As outlined above, page numbers come in very different forms. Besides
this problem, there is actually another issue. Imagine you would like to
quantify the lengths of articles. Obviously you will need information on
the first and the last page of the articles. Furthermore, the pages need
to be parsed properly: you will run into troubles if you calculate page
numbers like 75284 - 42 + 1
, in case the number was parsed
badly. jst_clean_page()
tries to do this properly, based on
a few known possibilities:
- “2” -> 2
- “A2” -> 2
- “v75i2p84” -> 84
Parsing correctly is unfortunately not enough. Things like “Errata”
might come to haunt you. For example there might be an article with
first_page = 42
and last_page = 362
, which
would leave you puzzled as to if this can be true1. There could be a
simple explanation: the article might start on page 42, and end on page
65, and there is furthermore an erratum on page 362. Technically,
last_page = 362
is true then, but it will cause problems
for calculating the total number of pages. Quite often, there is
information in another column which could resolve this:
page_range
, which in this case would look like
42 - 65, 362
.
A small helper to deal with those situations is
jst_get_total_pages()
. It works for page ranges, but also
for first and last pages:
library(jstor)
library(dplyr)
input <- tibble::tribble(
~first_page, ~last_page, ~page_range,
NA_real_, NA_real_, NA_character_,
1, 10, "1 - 10",
1, 10, NA_character_,
1, NA_real_, NA_character_,
1, NA_real_, "1-10",
NA_real_, NA_real_, "1, 5-10",
NA_real_, NA_real_, "1-4, 5-10",
NA_real_, NA_real_, "1-4, C5-C10"
)
input %>%
mutate(n_pages = jst_get_total_pages(first_page, last_page, page_range))
#> # A tibble: 8 × 4
#> first_page last_page page_range n_pages
#> <dbl> <dbl> <chr> <dbl>
#> 1 NA NA NA NA
#> 2 1 10 1 - 10 10
#> 3 1 10 NA 10
#> 4 1 NA NA NA
#> 5 1 NA 1-10 10
#> 6 NA NA 1, 5-10 7
#> 7 NA NA 1-4, 5-10 10
#> 8 NA NA 1-4, C5-C10 10
This is actually identical to using
jst_add_total_pages()
:
input %>%
jst_add_total_pages()
#> # A tibble: 8 × 4
#> first_page last_page page_range n_pages
#> <dbl> <dbl> <chr> <dbl>
#> 1 NA NA NA NA
#> 2 1 10 1 - 10 10
#> 3 1 10 NA 10
#> 4 1 NA NA NA
#> 5 1 NA 1-10 10
#> 6 NA NA 1, 5-10 7
#> 7 NA NA 1-4, 5-10 10
#> 8 NA NA 1-4, C5-C10 10
Journal identifiers
Identifiers for the journal usually appear in three columns:
journal_doi
journal_jcode
journal_pub_id
sample_article <- jst_get_article(jst_example("article_with_references.xml"))
knitr::kable(sample_article)
file_name | journal_doi | journal_jcode | journal_pub_id | journal_title | article_doi | article_pub_id | article_jcode | article_type | article_title | volume | issue | language | pub_day | pub_month | pub_year | first_page | last_page | page_range |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
article_with_references | NA | tranamermicrsoci | NA | Transactions of the American Microscopical Society | 10.2307/3221896 | NA | NA | research-article | On the Protozoa Parasitic in Frogs | 41 | 2 | eng | 1 | 4 | 1922 | 59 | 76 | 59-76 |
From my samples, it seems that the information in
journal_pub_id
is often missing, as is journal_doi. The
most important identifier is thus journal_jcode
. In cases
where both journal_jcode
and journal_pub_id
are present, at least in my samples, the format of
journal_jcode
was different. For consistency,
jst_unify_journal_id()
thus takes content of
journal_pub_id
if it is present, and that of
journal_jcode
otherwise.
With this algorithm, it should be possible to reliably match them to
general information about the respective journals, which are available
from jst_get_journal_overview()
:
sample_article %>%
jst_unify_journal_id() %>%
left_join(jst_get_journal_overview()) %>%
tidyr::gather(variable, value) %>%
knitr::kable()
#> Joining with `by = join_by(journal_id)`
variable | value |
---|---|
file_name | article_with_references |
journal_title | Transactions of the American Microscopical Society |
article_doi | 10.2307/3221896 |
article_pub_id | NA |
article_jcode | NA |
article_type | research-article |
article_title | On the Protozoa Parasitic in Frogs |
volume | 41 |
issue | 2 |
language | eng |
pub_day | 1 |
pub_month | 4 |
pub_year | 1922 |
first_page | 59 |
last_page | 76 |
page_range | 59-76 |
journal_id | tranamermicrsoci |
title | Transactions of the American Microscopical Society |
issn | 00030023 |
eissn | NA |
doi | 10.2307/j100072 |
url | https://www.jstor.org/journal/tranamermicrsoci |
discipline | Biological Sciences ; Science & Mathematics ; Zoology |
publisher | American Microscopical Society ; Wiley |
coverage_range | 1878-1994 |
oclc_catalog_identifier | 61241470 |
lccn_catalog_identifier | 2005 237209 |
archive_release_date | 2005-08-10 |
collections | Biological Sciences Collection ; Corporate & For-Profit Access Initiative Collection ; JSTOR Archival Journal & Primary Source Collection ; Life Sciences Collection |
Duplicated ngrams
Source | time span | Part |
---|---|---|
American Journal of Sociology | Unknown | Book Reviews |
For the AJS, ngrams for book reviews are calculated per issue. There are numerous reviews per issue, and each of them has an identical file of ngrams, containing ngrams for all book reviews of this issue.
A possible strategy for dealing with this is either not to use those ngrams, since they are calculated on all reviews in the issue, irrespective of whether actually all reviews of the given issue are in the sample or not. Alternatively, one could group by issues, and only take one set of ngrams per issue.
Language codes
Information on langues is not consistent. For the sample article,
language
is eng
.
In other cases it might be en
. It is thus advisable to
take a quick look at different variants via
distinct(meta_data, language)
or
count(meta_data, language)
.
Incorrect/odd references
When analysing data about references and footnotes, you will encounter many inconsistencies and errors. Most of them are not due to errors from DfR, but stem simply from the fact, that humans make mistakes when creating manuscripts, and not all errors with references are caught before printing.
Problems with non-english characters
A common problem are names with non-english characters like german umlauts (Ferdinand Tönnies) or nordic names (Gøsta Esping-Andersen). These will appear in many different variations: Tonnies, Tönnies, Gosta, Gösta, etc.
OCR-Issues
For older articles, you might encounter issues that stem from
digitising text with OCR-software. A common problem is distinguishing
I
from l
, like in the phrase “In love”.
Depending on which names appear in your data, this might lead to
inconsistencies.
Errors by article authors
There are many examples where authors make mistakes and your summary statistics end up being skewed. This article about “Ethics Education in the Workplace” cites the same items multiple times, which is possibly an artifact. The advantage of using JSTOR/DfR data is, that you can inspect all sources and check, if a specific pattern you see is an artifact or genuine.