library(refimpact)

Introduction

This package is an API wrapper around the REF Impact Case Studies database API. Chances are that if you’re looking at this package, you already know what this dataset is, and you probably know roughly what you’re looking for.

If you have stumbled upon this package however, and you want to know more about the dataset, you can head here to find out more. If you are thinking of using this dataset as a toy dataset for learning, then you might find this dataset useful for text mining, amongst other things.

Core functions

The core function for this package is ref_get(), which takes an API method as the first argument, and some optional arguments depending on the method.

The API methods available are detailed below, but presented here for quick reference:

  • SearchCaseStudies
  • ListUnitsOfAssessment
  • ListTagTypes
  • ListTagValues
  • ListInstitutions

SearchCaseStudies

This is the core method of the API, and the most important for users of this package. The search method requires a compulsory argument to the ref_get() function: query. This argument takes a list of query parameters, which can be as simple as a single Case Study ID, which returns a single record. A query returning a single record is shown below to demonstrate the syntax and the returned data structure; more complex queries will be shown later in the vignette.

results <- ref_get("SearchCaseStudies", query=list(ID=941))
print(results)
## # A tibble: 1 x 19
##   CaseStudyId Continent Country Funders ImpactDetails ImpactSummary ImpactType
##   <chr>       <list>    <list>  <list>  <chr>         <chr>         <chr>     
## 1 941         <df[,2] … <df[,2… <chr [… "\r\nImpact … "\r\nDrs Pep… Technolog…
## # … with 12 more variables: Institution <chr>, Institutions <list>,
## #   Panel <chr>, PlaceName <list>, References <chr>,
## #   ResearchSubjectAreas <list>, Sources <chr>, Title <chr>, UKLocation <list>,
## #   UKRegion <list>, UOA <chr>, UnderpinningResearch <chr>

You will note that the function returns a nested tibble - that is a tibble with other data frames inside it. This means that you can interrogate the tibble as per usual:

cat(results[[1, "CaseStudyId"]])
## 941
cat(results[[1, "Title"]])
## 
## Novel models for advanced imaging of urinary system function in healthy and diseased tissue.
cat(strtrim(results[[1, "ImpactSummary"]], width = 200), "<truncated>")
## 
## Drs Peppiatt-Wildman &amp; Wildman have developed novel models to investigate kidney and bladder
## function and drug action, through visualisation of cellular events in live tissue. This has had an
##  <truncated>
cat(strtrim(results[[1, "ImpactDetails"]], width = 200), "<truncated>")
## 
## Impact on Commerce
## Peppiatt-Wildman and Wildman's unique approaches to investigate kidney and bladder function in
## a range of experimental models, including tissue slices, isolated and perfused org <truncated>
cat(results[[1, "Institution"]])
## 
## University of Kent and University of Greenwich

You can also interrogate the nested fields the same way, and even subset them:

print(results[[1, "Country"]])
##   GeoNamesId           Name
## 1    2635167 United Kingdom
## 2    6252001  United States
print(results[[1, "Institutions"]])
##             AlternativeName         InstitutionName PeerGroup     Region
## 1 Greenwich (University of) University of Greenwich         D     London
## 2      Kent (University of)      University of Kent         B South East
##      UKPRN
## 1 10007146
## 2 10007150
print(results[[1, "Institutions"]][,c("UKPRN", "InstitutionName")])
##      UKPRN         InstitutionName
## 1 10007146 University of Greenwich
## 2 10007150      University of Kent

In the opinion of the package author, the nested tibble offers many advantages over other data representations - it is a relatively straight-forward exercise to transform the data into a set of wide or narrow tables if required.

Returning a single case study based on the ID is obviously a niche use-case, so there are some other ways to search the database. But before getting to those, it is worth pointing out that you can select multiple case studies in a single query:

results <- ref_get("SearchCaseStudies", query=list(ID=c(941, 942, 1014)))
print(results)
## # A tibble: 3 x 19
##   CaseStudyId Continent Country Funders ImpactDetails ImpactSummary ImpactType
##   <chr>       <list>    <list>  <list>  <chr>         <chr>         <chr>     
## 1 941         <df[,2] … <df[,2… <chr [… "\r\nImpact … "\r\nDrs Pep… Technolog…
## 2 942         <df[,2] … <df[,2… <chr [… "\r\n    GRA… "\r\n    The… Societal  
## 3 1014        <df[,0] … <df[,0… <chr [… "\r\n    The… "\r\n    Res… Health    
## # … with 12 more variables: Institution <chr>, Institutions <list>,
## #   Panel <chr>, PlaceName <list>, References <chr>,
## #   ResearchSubjectAreas <list>, Sources <chr>, Title <chr>, UKLocation <list>,
## #   UKRegion <list>, UOA <chr>, UnderpinningResearch <chr>

The ID parameter above is an exclusive parameter - if you provide one or more IDs then the function will print a warning to the console, and remove all parameters except for the IDs. This is based on the API’s documented limitations.

The other parameters can all be combined for searching. Those parameters are:

  • UKPRN - This is a code referencing an institution, and comes from the ListInstitutions method below. Takes a single UKPRN.
  • UoA - This is a code referencing a Unit of Assessment, and comes from the ListUnitsOfAssessment method below. Takes a single ID.
  • tags - This is one or more codes referencing tags from the ListTagValues method. The tags are separated into 13 different TagTypes, which are detailed below. When multiple tags are provided to the search method, it will only return rows which contain both tags.
  • phrase - You can search the database using a text query. The query must conform to Lucene search query syntax.

Some examples are shown below.

results <- ref_get("SearchCaseStudies", query=list(UKPRN = 10007777))
dim(results)
## [1]  7 19
results <- ref_get("SearchCaseStudies", query=list(UoA = 5))
dim(results)
## [1] 257  19
results <- ref_get("SearchCaseStudies", query=list(tags = c(11280, 5085)))
dim(results)
## [1] 24 19
results <- ref_get("SearchCaseStudies", query=list(phrase = "hello"))
dim(results)
## [1]  7 19
results <- ref_get("SearchCaseStudies", query=list(UKPRN = 10007146,
                                                   UoA   = 3))
dim(results)
## [1]  2 19

Unfortunately, the API method requires at least one search parameter, which makes it more difficult to download the entire dataset. A short script for this purpose is included at the end of this vignette.

Useful values for the UKPRN, UoA and tags parameters can be found by querying the other 4 API methods - the phrase parameter is the only parameter which can be used in isolation. Each of the 4 other API methods are outlined below.

ListInstitutions

This method lists all of the institutions which are included in the REF Impact Case Studies database, and the UKPRN column in the resuling tibble can be used as a query parameter

institutions <- ref_get("ListInstitutions")
print(institutions)
## # A tibble: 155 x 5
##    AlternativeName          InstitutionName         PeerGroup Region       UKPRN
##    <chr>                    <chr>                   <chr>     <chr>        <int>
##  1 Open University          Open University         D         South East  1.00e7
##  2 Cranfield University     Cranfield University    B         East        1.00e7
##  3 Royal College of Art     Royal College of Art    G         London      1.00e7
##  4 Bishop Grosseteste Univ… Bishop Grosseteste Uni… F         East Midla… 1.00e7
##  5 Buckinghamshire New Uni… Buckinghamshire New Un… E         South East  1.00e7
##  6 Royal Central School of… Royal Central School o… G         London      1.00e7
##  7 Chester (University of)  University of Chester   E         North West  1.00e7
##  8 Canterbury Christ Churc… Canterbury Christ Chur… E         South East  1.00e7
##  9 York St John University  York St John University F         Yorkshire … 1.00e7
## 10 Edge Hill University     Edge Hill University    E         North West  1.00e7
## # … with 145 more rows

ListTagTypes and ListTagValues

These methods provide tags which can be used as search parameters in the SearchCaseStudies method. The ListTagTypes method returns the types of tags available:

tag_types <- ref_get("ListTagTypes")
print(tag_types)
## # A tibble: 13 x 2
##       ID TagType             
##    <int> <chr>               
##  1     1 ImpactType          
##  2     3 Subject             
##  3     4 PlaceName           
##  4     5 Country             
##  5     6 Continent           
##  6     7 Interdisciplinary   
##  7     8 Similar             
##  8     9 Funder              
##  9    10 Panel               
## 10    11 InstitutionRegion   
## 11    12 InstitutionPeerGroup
## 12    13 UK Region           
## 13    15 Joint Submission

These tag types can then be used as an argument to the ListTagValues method, to get all tags for each type:

tag_values_5 <- ref_get("ListTagValues", tag_type = 5)
print(tag_values_5)
## # A tibble: 252 x 2
##       ID Name                        
##    <int> <chr>                       
##  1 11280 Afghanistan                 
##  2 11310 Aland Islands               
##  3 11116 Albania                     
##  4 11106 Algeria                     
##  5 25129 American Samoa, Territory of
##  6 11221 Andorra                     
##  7 11185 Angola                      
##  8 11301 Anguilla                    
##  9 11187 Antigua and Barbuda         
## 10 11328 Argentina                   
## # … with 242 more rows

This can take some time to iterate through, so the full table is bundled with this package. You can access it via ref_tags:

print(ref_tags)
## # A tibble: 9,400 x 4
##       ID Name                                    TypeID TagType   
##  * <int> <chr>                                    <int> <chr>     
##  1  5083 Cultural                                     1 ImpactType
##  2  5086 Economic                                     1 ImpactType
##  3  5087 Environmental                                1 ImpactType
##  4  5082 Health                                       1 ImpactType
##  5  5081 Legal                                        1 ImpactType
##  6  5080 Political                                    1 ImpactType
##  7  5085 Societal                                     1 ImpactType
##  8  5084 Technological                                1 ImpactType
##  9   911 Accounting, Auditing and Accountability      3 Subject   
## 10  1022 Aerospace Engineering                        3 Subject   
## # … with 9,390 more rows

ListUnitsOfAssessment

This method lists all of the units of assessment which the Impact Case Studies can be assessed against. The tibble also includes an ID column which can be used when querying the SearchCaseStudies method.

UoAs <- ref_get("ListUnitsOfAssessment")
print(UoAs)
## # A tibble: 36 x 3
##       ID Panel        Subject                                                   
##    <int> <chr>        <chr>                                                     
##  1     1 "A         " Clinical Medicine                                         
##  2     2 "A         " Public Health, Health Services and Primary Care           
##  3     3 "A         " Allied Health Professions, Dentistry, Nursing and Pharmacy
##  4     4 "A         " Psychology, Psychiatry and Neuroscience                   
##  5     5 "A         " Biological Sciences                                       
##  6     6 "A         " Agriculture, Veterinary and Food Science                  
##  7     7 "B         " Earth Systems and Environmental Sciences                  
##  8     8 "B         " Chemistry                                                 
##  9     9 "B         " Physics                                                   
## 10    10 "B         " Mathematical Sciences                                     
## # … with 26 more rows

Extracting the entire dataset

As alluded to above, the API cannot be searched without parameters, which means that downloading the entire dataset is not a simple task. The code below can be used to extract all records from the database.

uoa_table <- ref_get("ListUnitsOfAssessment")
uoa_list <- uoa_table$ID

ref_corpus <- vector(length = length(uoa_list), mode = "list")

for (i in seq_along(uoa_list)) {
  message("Retrieving data for UoA ", uoa_list[i])
  ref_corpus[[i]] <- ref_get("SearchCaseStudies", query = list(UoA = uoa_list[i]))
}

output <- do.call(rbind, ref_corpus)