The service DfR by JSTOR offers several ways for text analysis of scientific articles. In this vignette I will demonstrate how to analyse n-grams which DfR delivers.

Let’s suppose, we are interested in the topic of “inequality” within the discipline of sociology. Social inequality can be considered a prime subject of sociological inquiry. In order to gain some context on the subject, we might be interested to analyse frequently occurring terms. DfR offers different grades of tokenization: n-grams for 1-3 words, e.g. unigrams, bigrams and trigrams. In case you are unfamiliar with the analysis of tokenized text, you could read the first few paragraphs of chapters 1 and 4 in as an introduction.

Our analysis starts at the main page of DfR. We create a dataset1 by searching for “inequality” and selecting “sociology” as our subject. To trim down the number of articles, we only select articles from 1997 to 2017. After logging in/creating an account, we select unigrams and bigrams. The resulting .zip-file can be downloaded from the email which DfR sends.

Up-front, we need to load some packages. jstor is available from CRAN, so it can be installed via install.packages("jstor").


# set a lighter theme for plots

# import files in parallel

With the latest version of jstor we can now directly import files from a zip archive. We only need to specify, where the zip archive is located, which parts we want to extract, and where the resulting files should be saved.

Before importing data, we can take a quick look at the contents of each archive with jst_preview_zip():


Although we could import all ngrams at this point, this would be extremely inefficient. It is thus best first to decide, which articles we want to analyze, and import the corresponding ngram-files afterwards.

The following code assumes that you follow a workflow organised around projects within RStudio (refer to for further information).

import_spec <- jst_define_import(article = jst_get_article)

jst_import_zip("", import_spec = import_spec, out_file = "part1")
jst_import_zip("", import_spec = import_spec, out_file = "part2")

#> Processing files for journal_article with functions jst_get_article
#> Processing chunk 1/1
#>  Progress: ───────────────────────────────────────────────────────────── 100%

Since jst_import_zip writes the results to disk, we need to read the metadata from the newly created file. This is made easy by jst_re_import which ensures, that the data are read with the right column types.

imported_metadata <- c("part1_journal_article_jst_get_article-1.csv",
                       "part2_journal_article_jst_get_article-1.csv") %>% 
Cleaning the data

Data from DfR is inherently messy. To fix a few common issues, we can use jst_augment():

imported_metadata <- jst_augment(imported_metadata, quietly = TRUE)

For more information on common quirks with data from DfR and how to deal with them, take a look at the vignette("known-quirks").


Before diving into the analysis of n-grams, we might wish to take an exploratory look at our metadata. The first thing to look at are the types of articles.

ggplot(imported_metadata, aes(article_type)) +
  geom_bar() +
plot of chunk article-types

We can see, that the majority of articles are proper “research-articles”, which together with book-reviews and miscellaneous articles amount to ~99% of all articles.

imported_metadata %>% 
  count(article_type, sort = T) %>% 
  mutate(perc = scales::percent(n/sum(n)))
We must be cautious, however, when using this variable to distinguish articles into categories. In this instance, we have “research-articles” which are actually book-reviews:

imported_metadata %>% 
  filter(article_type == "research-article" & str_detect(article_title, "Book")) %>% 
  select(file_name, article_title, pub_year)
For the current demonstration, we want to restrict the type of articles to research articles, therefore we need to take steps to remove book reviews and other miscellaneous articles: First, filter by article_type, then remove articles where the title starts with “Book Review”.

research_articles <- imported_metadata %>% 
  filter(article_type == "research-article") %>% 
  filter(!str_detect(article_title, "^Book Review"))

The moving wall - filtering articles by time

Since JSTOR has a moving wall, we should take a look at the number of articles per year in our dataset.

research_articles %>% 
  ggplot(aes(pub_year)) +
plot of chunk publications-per-year

From this graph we can see an increase in research articles until 2010, after which the number of articles first tapers off, and then drops off sharply. For this reason we should exclude articles at least from 2015 onward, since the sample might get quite biased toward specific journals.

without_wall <- research_articles %>% 
  filter(pub_year < 2015)

Flagship journals - filtering articles by journal

Since the amount of articles is still rather large for this demonstration, we could select only a few journals. Here, we will look at articles from two leading journals within the discipline, “American Journal of Sociology” and “American Sociological Review”.

Since we cleaned the identifiers for journals with jst_augment earlier, we can select our two flagship-journals very easily.

flagship_journals <- without_wall %>%
  filter(journal_id %in% c("amerjsoci", "amersocirevi"))

Importing bigrams

Disclaimer: Much of the following analysis was inspired by the book “Text Mining with R” by Julia Silge and David Robinson:

For this demonstration we will look at bigrams to find the most common pairs of words. Until now, we were only dealing with the metadata, therefore we need a way to link our reduced dataset to the bigram files from DfR. The directory structure for deliveries from DfR looks something like this:

  -- metadata
    -- journal_article_foo.xml
  -- ngram2
    -- journal_article_foo.txt
  -- metadata
    -- journal_article_bar.xml
  -- ngram2
    -- journal_article_bar.txt

From this structure we can see, that the file name can serve as an identifier to match articles and n-grams, since it is similar between metadata and n-grams.

To make importing a subset of ngrams more convenient, we can use jst_subset_ngrams. This function returns a list of “zip-locations”, which jst_get_ngram can read.

ngram_selection <- jst_subset_ngrams(c("", ""), "ngram2",
## Error in utils::unzip(zip_archive, list = TRUE): zip Datei '' kann nicht geöffnet werden
head(ngram_selection, 2)
## Error in head(ngram_selection, 2): Objekt 'ngram_selection' nicht gefunden
ngram_selection <- jst_subset_ngrams(c("", ""), "ngram2",

imported_bigrams <- ngram_selection %>% 

From the 872 articles in our two flagship journals we now have 6,729,813 bigrams. The bigrams are calculated by JSTOR for each article independently. In order to reduce the sample to the most common bigrams, we have two choices: either to include only terms which occur within each article a given amount of times, or to include terms which occur within all articles a given amount of times. By only including terms which occur more than 5 times in each article, we can drastically reduce the number of terms. However, we might miss some important ones: there might be terms which do not occur repeatedly within articles, but are present in all of them.

For demonstration purposes we are a bit restrictive and include only those terms, which occur at least three times per article.

top_bigrams <- imported_bigrams %>%
  filter(n >= 3)

Cleaning up bigrams

When constructing n-grams, DfR uses a stop-word list, which is quite limited 2. If we would like to restrict the terms a bit further, we could use stopwords from tidytext:

bigrams_separated <- top_bigrams %>%
  separate(bigrams, c("word1", "word2"), sep = " ")

bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

After removing the stopwords we need to consider the fact, that our bigrams were created for each article on its own. In order to analyse them together, we need to count the terms for all articles in combination.

bigram_counts <- bigrams_filtered %>%
  group_by(word1, word2) %>%
  summarise(n = sum(n)) %>%

From the first few terms we can see, that there are still many terms which are not very interesting for our analysis. The terms “american” and “sociological” are simply part of the title of a journal we selected (American Sociological Review). To clean the terms up, we can employ different approaches. One is to simply filter the terms we wish to exclude:

bigram_counts_clean <- bigram_counts %>%
  unite(bigram, word1, word2, sep = " ") %>%
  filter(!bigram %in% c("american sociological", "sociological review",
                        "university press", "american journal",
                        "journal sociology")) %>%
  separate(bigram, c("word1", "word2"))

We will look at another approach after plotting our bigrams.

Visualize relationships

When analyzing bigrams, we might want to look at the relationships between common terms. For this we can leverage the power of igraph and ggraph.

First, we only keep the most common terms and then convert our data.frame to an igraph-object. 3

bigram_graph <- bigram_counts_clean %>%
  filter(n > 500) %>%

For plotting, we will use a simple plotting function, adapted from

plot_bigrams <- function(igraph_df, seed = 2016) {
  a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

  ggraph(igraph_df, layout = "fr") +
    geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                   arrow = a, end_cap = circle(.07, 'inches')) +
    geom_node_point(color = "lightblue", size = 4) +
    geom_node_text(aes(label = name), repel = T) +

plot of chunk first-bigram-graph Very obvious is a group of nodes which are not relevant to the topic of inequality. They come from LaTeX documents and somehow made their way into the original dataset. However, since they are more common than most of the other terms, they are quite easy to remove. We can look at the nodes/vertices of our graph with V(bigram_graph).

The first node, “labor”, is relevant to us, but all other nodes from 2 to at least 40 are clearly irrelevant. We can remove them by simple subtraction:

bigram_graph_clean <- bigram_graph - 2:40
Another apparent group is a combination of “table” or “figure” with digits. This evidently comes from tables or figures in the papers and might suggest, that the articles in our sample quite frequently employ quantitative methods, where figures and tables are very common. For the analysis at hand however, we might remove them, along with a few other irrelevant terms.

bigram_graph_clean <- bigram_graph_clean - c("table", "model",
                                             "xd", "rh", "landscape", "00",
                                             "figure", "review", "79",
                                             "http", "www", "000", "01")

After cleaning up a bit, we can take a fresh look at our bigrams.

plot_bigrams(bigram_graph_clean, 234)
plot of chunk second-bigram-graph

The figure is still far from perfect (“eco” -> “nomic” should clearly be one term), but we can begin to analyse our network.

The most frequent bigrams are now “labor market”, “labor force”, and “income inequality”, which are not very surprising given that most individuals in capitalist societies need to supply their work in exchange for income. For this reason, the labor market and its stratification is a prime subject of the sociological inquiry into inequality. A few further key dimensions of sociological analysis are apparent from the graph: gender, race/ethnicity, occupational and socioeconomic status. That we find many terms to be associated with the term “social” seems quite likely given the discipline’s subject.

At least two surprising results should be pointed out. First, it is not evident how the terms “ethnic” and “racial” are connected. They do not form a typical term like “social capital”, “middle class” or similar, nor could they be considered a dichotomy like “black” and “white” which are often included in tables from regressions. From a theoretical point of view, they have slightly different meanings but are frequently being used as synonyms. Second, there is a group of nodes around the term “university”: university -> chicago, university -> california, harvard -> university, etc. At least two explanations seem plausible: either, many books are being cited which are in some way associated with those universities (“The University of Chicago Press” is the largest university press in the United States), or many researchers who publish in the two flagship-journals we selected are affiliated with those four universities: Harvard, Chicago, Cambridge and California. At least partly the prominence of university -> chicago -> press might be due to the fact, that it is the publisher of the American Journal of Sociology, and therefore included in each article by this journal.

Altogether, these findings will not be surprising to well-educated sociologists. Almost all bigrams in the graph are common concepts or terms within the discipline. Most importantly, very often those concepts and terms comprise two single words, as in “social science”, “social structure”, “social networks”. Two approaches might be useful to examine, how these concepts are being used:

  1. Analyzing trigrams. It could be the case, that many of the above concepts would show up in combination with more interesting terms, if we were to analyze combinations of three words.
  2. Another approach would be to first analyze data for bigrams as above, determining the core concepts of a field. In a second step, one could search for those core concepts in the raw data files and extract adjacent terms. This would not be possible with the default deliveries from DfR however, since full-text content is only available through a dedicated agreement.

Comparison over time

Besides looking at the overall relationship of bigrams, we could be interested in the development over time of specific terms. Here, we want to look at how often “labor market” and “income inequality” appear from year to year.

For this, we need to join our bigrams with the metadata.

time_bigrams <- top_bigrams %>% 
  left_join(flagship_journals, by = "file_name") %>% 
  select(bigrams, n, pub_year)

Again, we need to sum up the counts, but this time grouped by year:

time_bigrams <- time_bigrams %>%
  group_by(bigrams, pub_year) %>%
  summarise(n = sum(n)) %>%

We now only keep the two terms of interest and plot them in a simple chart.

# filter the terms of interest
time_comparison <- time_bigrams %>% 
  filter(bigrams == "labor market" | bigrams == "income inequality")

ggplot(time_comparison, aes(pub_year, n, colour = bigrams)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = scales::pretty_breaks(7))
plot of chunk bigrams-over-time

In this instance, the plot does not reveal trends over time-the frequency of the terms is fluctuating a lot but staying on a similar level. Single spikes of term frequency for specific years (for example income inequality in 2011) could stem from special issues being explicitly concerned with income inequality, although a quick glance at the corresponding issues invalidates this hypothesis.