Skip to contents

Following the template in OpenAlex’s oa-percentage tutorial, this vignette uses openalexR to answer:

How many of recent journal articles from the University of Pennsylvania are open access? And how many aren’t?

library(openalexR)
library(dplyr)
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
#>   object 'type_sum.accel' not found
library(tidyr)
library(ggplot2)

We first need to find the openalex.id for University of Pennsylvania. We can do this by fetching for the institutions entity and put “University of Pennsylvania” in display_name or display_name.search:

oa_fetch(
  entity = "inst", # same as "institutions"
  display_name.search = "\"University of Pennsylvania\""
) %>%
  select(display_name, ror) %>% 
  knitr::kable()
#> Warning: Note: `oa_fetch` and `oa2df` now return new names for some columns in openalexR v2.0.0.
#>     See NEWS.md for the list of changes.
#>     Call `get_coverage()` to view the all updated columns and their original names in OpenAlex.
#> This warning is displayed once every 8 hours.
display_name ror
University of Pennsylvania https://ror.org/00b30xv10
California University of Pennsylvania https://ror.org/01spssf70
Hospital of the University of Pennsylvania https://ror.org/02917wp91
University of Pennsylvania Health System https://ror.org/04h81rw26
Indiana University of Pennsylvania https://ror.org/0511cmw96
Cheyney University of Pennsylvania https://ror.org/02nckwn80
University of Pennsylvania Press https://ror.org/03xwa9562

We will use the first ror, 00b30xv10, as one of the filters for our query.

Alternatively, we could go to the autocomplete endpoint at https://explore.openalex.org/ to search for “University of Pennsylvania” and find the ror there!

All other filters are straightforward and explained in detailed in the original jupyter notebook tutorial. The only difference here is that, instead of grouping by is_oa, we’re interested in the “trend” over the years, so we’re going to group by publication_year, and perform the query twice, one for is_oa = "true" and one for is_oa = "false" .

open_access <- oa_fetch(
  entity = "works",
  institutions.ror = "00b30xv10",
  type = "article",
  from_publication_date = "2012-08-24",
  is_paratext = "false",
  is_oa = "true",
  group_by = "publication_year"
)

closed_access <- oa_fetch(
  entity = "works",
  institutions.ror = "00b30xv10",
  type = "article",
  from_publication_date = "2012-08-24",
  is_paratext = "false",
  is_oa = "false",
  group_by = "publication_year"
)

uf_df <- closed_access %>%
  select(- key_display_name) %>%
  full_join(open_access, by = "key", suffix = c("_ca", "_oa")) 

uf_df
#>     key count_ca key_display_name count_oa
#> 1  2012     1104             2012     1423
#> 2  2013     4069             2013     5007
#> 3  2014     4150             2014     5123
#> 4  2015     4202             2015     5270
#> 5  2016     3778             2016     5382
#> 6  2017     3661             2017     5749
#> 7  2018     3937             2018     6378
#> 8  2019     4013             2019     6809
#> 9  2020     4230             2020     8147
#> 10 2021     4034             2021     8223
#> 11 2022     3822             2022     7751
#> 12 2023     4349             2023     7286
#> 13 2024     5739             2024     5069

Finally, we compare the number of open vs. closed access articles over the years:

uf_df %>%
  filter(key <= 2021) %>% # we do not yet have complete data for 2022 and after
  pivot_longer(cols = starts_with("count")) %>%
  mutate(
    year = as.integer(key),
    is_oa = recode(
      name,
      "count_ca" = "Closed Access",
      "count_oa" = "Open Access"
    ),
    label = if_else(key < 2021, NA_character_, is_oa)
  ) %>% 
  select(year, value, is_oa, label) %>%
  ggplot(aes(x = year, y = value, group = is_oa, color = is_oa)) +
  geom_line(size = 1) +
  labs(
    title = "University of Pennsylvania's progress towards Open Access",
    x = NULL, y = "Number of journal articles") +
  scale_color_brewer(palette = "Dark2", direction = -1) +
  scale_x_continuous(breaks = seq(2010, 2024, 2)) +
  geom_text(aes(label = label), nudge_x = 0.1, hjust = 0) +
  coord_cartesian(xlim = c(NA, 2022.5)) +
  guides(color = "none")