The Science of Science (SciSci) is an emerging, trans-disciplinary approach for using large and disparate data-sets to study the emergence, dissemination, and impact of scientific research (Fortunato et al. 2018). Bibliometric databases such as the Web of Science are rich sources of data for SciSci studies (Sugimoto and Larivière 2018). In recent years the type and scope of questions addressed with data gathered from these databases has expanded tremendously (Forutnato et al. 2018). This is due in part to their expanding coverage and greater accessibility, but also because advances in computational power make it possible to analyze data-sets comprising millions of bibliographic records (e.g., Larivière et al. 2013, Smith et al. 2014).
The rapidly increasing size of bibliometric data-sets available to researchers has exacerbated two major and persistent challenges in SciSci research. The first of these is Author Name Disambiguation. Correctly identifying the authors of a research product is fundamental to bibliometric research, as is the ability to correctly attribute to a given author all of their scholarly output. However, this seemingly straightforward task is often extremely complicated, even when using the nominally high-quality data extracted from bibliometric databases (reviewed in Smalheiser and Torvik 2009). The most obvious case is when different authors have identical names, which can be quite common in some countries (Strotmann et al. 2009). However, confusion might also arise as a result of journal conventions or individual preferences for abbreviating names. For instance, one might conclude “J. C. Smith”, “Jennifer C. Smith”, and “J. Smith” are different authors, when in fact they are the same person. In contrast, papers by “E. Martinez” could have been written by different authors with the same last name but whose first names start with the same letter (e.g., “Enrique”, “Eduardo”). Failure to disambiguate author names can seriously undermine the conclusions of some SciSci studies, but manually verifying author identity quickly becomes impractical as the number of authors or papers in a dataset increases.
The second challenge to working with large bibliometric data-sets is correctly parsing author addresses. The structure of author affiliations is complex and idiosyncratic, and journals differ in the information they require authors to provide and the way in which they present it. Authors may also represent affiliations in different ways on different articles. For instance, the affiliations might be written in different ways in different journals (e.g., “Dept. of Biology”, “Department of Biology”, “Departamento de Biologia”). The same is true of the institution’s name (“UC Davis”, “University of California-Davis”,“University of California”) or the country in which it is based (“USA”, “United States”, “United States of America”). Researchers at academic institutions might include the one or more Centers, Institutes, Colleges, Departments, or Programs in their address, and researchers working for the same institution could be based at units in geographically disparate locations (e.g., University of Florida researchers could be based at the main campus in Gainesville or one of dozens of facilities across the state, including 12 Research and Education Centers, 5 field stations, and 67 County Extension Offices). Finally, affiliations are recorded in a single field of a reference’s bibliographic record, despite comprising very different types of information (e.g., city, postal code, institution). In concert, these factors can make it challenging to conduct analyses for which author affiliation or location is of particular interest.
refsplitr helps users of the R statistical computing environment (R Core Team 2017) address these challenges. It imports and organizes the output from Web of Science searches, disambiguates author names and suggests which might need additional scrutiny, parses author addresses, and georeferences authors’ institutions. It also maps author locations and coauthorship networks. Finally, the processed data-sets can be exported in tidy formats for analysis with user-written code or, after some additional formatting, packages such as
revtools (Westgate 2018) or
bibliometrix (Aria & Cuccurullo 2017).
Appendix 1 provides guidance on downloading records from the Web of Science in the proper format for use in
refsplitr. Once bibliographic records have been downloaded, the
refsplitr package’s tools are applied in four steps:
Learning to use
refspliter with the examples below: The examples below use the dataset ‘example_data.txt’ included with the
refsplitr package. To use them, (1) create two folders in the working directory or Rstudio project: one named “data”, and one named “output”. (2) Save the sample data ‘example_data.txt’ file in the ‘data’ folder. This is the same folder structure we recommend for saving and processing your own WOS output files.
refsplitr package can either import a single Web of Science search result file or combine and import multiple files located in the same directory. The acceptable file formats are ‘.txt’ and ‘.ciw’ (Appendix 1). Importing reference records is done with the
references_read() function, which has three arguments:
data: The location of the directory in which the Web of Science file(s) are located. If left blank it assumes the files are in the working directory. If in a different directory (e.g., the ‘data’ folder in the working directory), the absolute file name or relative file paths can be used.
dir: when loading a single file dir=FALSE, when loading multiple files dir=TRUE. If multiple files are processed
refsplitr will identify and remove any duplicate reference records.
include_all: Setting ‘include_all=TRUE’ will import all fields from the WOS record (see Appendix 2). The defailt is ‘include_all=FALSE’.
The output of
references_read() is an object in the R workspace. Each line of the output is a reference; the columns are the name of the .txt file from which the data were extracted, a unique id number assigned by
refsplitr to each article, and the data from each field of the reference record (see Appendix 2 for a list of these data fields and their Web of Science and RIS codes). This object is used by
refsplitr in Step 2.2; we recommend also saving it as a .csv file in the “output” folder.
example_refs <- references_read(data = "./data/example_data.txt", dir=FALSE, include_all = FALSE)
example_refs <- references_read(data = "./data/UF_data", dir=TRUE, include_all = FALSE)
example_refs <- references_read(data = system.file("extdata",package = "refsplitr"), dir = TRUE, include_all = FALSE)
Figure 1 An image of the .csv file showing a subset of the rows and columns from the output of
refsplitr can generate five visualizations of scientific productivity and couthorship. The functions that generate these visualization use packages
rworldmap (No. 1),
ggplot2 (Nos. 2,4,and 5), and
igraph (No. 3). Advanced users of these packages can customize the visualizations to suit their needs. WARNING: The time required to render these plots is highly dependent on the number of authors in the dataset and the processing power of the computer on which analyses are being carried out.
plot_addresses_country <- plot_addresses_country(example_georef$addresses)
Figure 8: Plot of the countries in which the authors in the dataset are based, with shading to indicate the number of authors based in each of country.
plot_addresses_points <- plot_addresses_points(example_georef$addresses) plot_addresses_points
Figure 9: Figure indicating the georefeenced locations of all authors in the dataset
plot_addresses_points <- plot_addresses_points(example_georef$addresses, mapCountry = "Brazil") plot_addresses_points
Figure 10: Figure indicating the georefeenced locations of authors in the dataset with institutional addresses in Brazil.
plot_net_coauthor <- plot_net_coauthor(example_georef$addresses)
plot_net_coauthor #> IGRAPH 1891651 UNW- 14 25 -- #> + attr: name (v/c), label (v/c), label.color (v/c), label.cex (v/n), #> | size (v/n), frame.color (v/l), color (v/c), weight (e/n) #> + edges from 1891651 (vertex names): #>  argentina --mexico argentina --usa australia --brazil #>  australia --germany australia --mexico australia --usa #>  belgium --brazil belgium --usa brazil --germany #>  brazil --mexico brazil --usa canada --england #>  canada --france canada --scotland canada --usa #>  costa rica--netherlands costa rica--usa england --france #>  england --usa france --usa germany --usa #> + ... omitted several edges
Figure 11: Plot of the coauthorship network for authors of articles in the dataset.
plot_net_country <- plot_net_country(example_georef$addresses) #> Regions defined for each Polygons plot_net_country$plot
Figure 12: Map showing the coauthorship connections between countries.
plot_net_address <- plot_net_address(example_georef$addresses) #> Regions defined for each Polygons plot_net_address$plot
Figure 13: Plot showing the network between individual author locations.
Aria, M. & Cuccurullo, C. (2017) bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4): 959-975. DOI: 10.1016/j.joi.2017.08.007
Fortunato, S., C. T. Bergstrom, K. Barner, J. A. Evans, D. Helbing, S. Milojevic, A. M. Petersen, F. Radicchi, R. Sinatra, B. Uzzi, A. Vespignani, L. Waltman, D. Wang, & A.-L. Barabasi (2018). Science of science. Science, 359: eaao0185. DOI: 10.1126/science.aao0185
Larivière, V., Ni, C., Gingras, Y., Cronin, B., & Sugimoto, C. R. (2013). Bibliometrics: Global gender disparities in science. Nature News, 504(7479): 211-213. DOI: 10.1038/504211a
R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Smalheiser, N. R., & Torvik, V. I. (2009). Author name disambiguation. Annual Review of Information Science and Technology, 43(1): 1-43. DOI: 10.1002/aris.2009.1440430113
Smith, M. J., Weinberger, C., Bruna, E. M., & Allesina, S. (2014). The scientific impact of nations: Journal placement and citation performance, PLOS One 9(10): e109195. DOI: 10.1371/journal.pone.0109195
Strotmann, A. and Zhao, D., (2012). Author name disambiguation: What difference does it make in author based citation analysis?. Journal of the Association for Information Science and Technology, 63(9): 1820-1833. DOI: 10.1002/asi.22695
Sugimoto CR, Larivière V. (2018). Measuring Research: What Everyone Needs to Know?. Oxford University Press, Oxford, UK. 149 pp. ISBN-10: 9780190640125
Westgate, M. J. (2018). revtools: bibliographic data visualization for evidence synthesis in R. bioRxiv:262881. DOI: 10.1101/262881
Figure 13: Web of Science Download Instructions
|filename||file from which records were imported|
|AF||Author Full Name|
|DI||Digital Object Identifier (DOI)|
|FU||Funding Agency and Grant Number|
|PT||Publication Type (J=Journal; B=Book; S=Series; P=Patent)|
|OI||Open Researcher and Contributor ID Number (ORCID ID)|
|SN||International Standard Serial Number (ISSN)|
|TC||Web of Science Core Collection Times Cited Count|
|WC||Web of Science Categories|
|Z9||Total Times Cited Count2|
|refID||a unique identifier for each article in the dataset assigned by refnet|
1the following Web of Science data fields are only included if users select the
include_all=TRUE option in
references_read(): CC, CH, CL, CT, CY, DT, FX, GA, GE, ID, IS, J9, JI, LA, LT, MC, MI, NR, PA, PI, PN, PS, RID, SU, TA, VR.
2Includes citations in the Web of Science Core Collection, BIOSIS Citation Index, Chinese Science Citation Database, Data Citation Index, Russian Science Citation Index, and SciELO Citation Index.
To help the user identify potential false positives and negatives of author matches, we calculate a confidence score metric. This metric is calculated on names who could not be matched automatically by our algorithm above, but were matched using a ‘best guess’ approach using the available information and the Jaro Winkler textual matching. Names matched using this method suffer from higher false positive rates because we lacked proper information to match them directly. This metric is a 0 - 10 rating where 0 means we have no information that would lead us to be confident on the match, and 10 meaning based on the available information we are very confident the name is correct (but not enough to have matched automatically). Names are given a score using the following criteria:
4 - The postal code matches
2 - The country matches
2 - The last name is longer than 10 characters
2 - The last name contains a dash (like when there are two surnames)
1 - The last name is longer than 6 characters but less than 11.
1 - Either name has a middle initial
1 - Either name has a full first name
1 - For each instance a name or match name contains a university or email (And thus easier to find a match using Google).
The maximum score is 10 regardless of the total scores sum.
Using this score criteria we find that >7 are very likely always correct, scores >=5 but <7 are nearly always correct, and <5 are variable in their accuracy.
This can help guide the user when manually checking reviewed names, but can also speed up the process when running analysis. Users can set their confidence threshold in
authors_refine depending on their tolerance for false positives in their analysis without needing to manually check all reviewed names.