This document provides information, code and exercises to test and improve your R skills with an emphasis on road safety research. It was initially developed to support a 2 day course. The course is based on open road crash records from the stats19 package (Lovelace et al. 2019). However, the content should be of use for anyone working with road crash data that has (at a minimum):
You should type, run and ensure you understand each line of code in this document.
Code and data supporting the content can be found in the package’s GitHub repo at github.com/ropensci/stats19. The ‘issue tracker’ associated with that repo is a good place to ask questions about the course.
If you are not experienced with R, it is strongly advised that you read-up on and more importantly test out R and RStudio before attempting analyse road crash data with R. See the
stats19-training-setup vignette at https://docs.ropensci.org/stats19/articles/stats19-training-setup.html for guidance on getting started with R, RStudio and installing R packages.
The completing the course requires that the following packages, which can be installed with
install.packages(), can be loaded as follows:
You should type, run and ensure you understand each line of code in this document.
#> Warning in utils::citation(..., lib.loc = lib.loc): no date field in DESCRIPTION #> file of package 'stats19'
The learning outcomes of this first session are to learn: RStudio main features and scripts, R objects and functions, subsetting, basic plotting, and getting help.
The first exercise is to open up RStudio and take a look around and identify the main components, shown in the figure below. Explore each of the main components of RStudio. Try changing the Global Settings (in the Tools menu) and see RStudio’s short cuts by pressing
Option+Shift+K on Mac).
Projects are a way to organise related work together. Each project has its own folder and Rproj file. Advice: always working from projects will make your life easier! Start a new project with:
File > New Project You can choose to create a new directory (folder) or associate a project with an existing directory. Make a new project called stats1-course and save it in a sensible place on your computer. Notice that stats1-course now appears in the top right of RStudio.
Scripts are the files where R code is stored. Keeping your code in sensibly named, well organised and reproducible scripts will make your life easier: you could simply type all our code into the console, but that require retyping commands each time you run it. Instead, code that you want to keep and share should be saved script files, plain text files that have the
Make a new script with Flie > New File > Rscript or Ctrl+Shift+N
Save the script and give it a sensible name like
stats19-lesson-1.R with File > Save, the save button on the toolbar, or Ctrl+S.
Pro tip: You can also create new R scripts by typing and running this command in the R console:
Keeping scripts and other files associated with a project in a single folder per project (in an RStudio project) will help you find things you need and develop an efficient workflow.
Let’s start with some basic R operations. Write this code into your new
stats19-lesson-1.R R script and execute the result line-by-line by pressing Ctrl+Enter
This code creates two objects, both are vectors of 5 elements, and then plots them (bonus: check their length using the
length() function). Save the script by pressing Ctrl+S.
There are several ways to run code within a script and it is worth becoming familiar with each. Try running the code you saved in the previous section using each of these methods:
Ctrl+Enterto run that line of code.
Ctrl+Enterto run the highlighted code.
Ctrl+Shift+Enterto run all the code in a script.
source()to run all the code in a script e.g.
Pro tip: Try jumping between the console and the source editor by pressing Ctl+1 and Ctl+2.
Create new objects by typing and running the following code chunk in a new script, e.g. called
vehicle_type = c("car", "bus", "tank") casualty_type = c("pedestrian", "cyclist", "cat") casualty_age = seq(from = 20, to = 60, by = 20) set.seed(1) dark = sample(x = c(TRUE, FALSE), size = 3, replace = TRUE) small_matrix = matrix(1:24, nrow = 12) crashes = data.frame(vehicle_type, casualty_type, casualty_age, dark)
We can view the objects in a range of ways:
small_matrix, and run that code. Scroll up to see the numbers that didn’t fit on the screen.
head()function to view just the first 6 rows e.g.
nargument in the previous function call to show only the first 2 rows of
crashesobject in the environment tab to View it in a spreadsheet.
View(vehicle_type). What just happened?
You can, for example, view a summary of the
casualty_age variable by running the following line of code:
summary(casualty_age) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 20 30 40 40 50 60
Exercise try these functions on each of the objects, what results do they give?
Bonus: Find out the class of the column
vehicle_type in the data frame
crashes with the command
class(crashes$vehicle_type). Why has it changed? Create a new object called
crashes_char that keeps the class of the character vectors intact by using the function
tibble::tibble() (see tibble.tidyverse.org and Section 4 for details).
RStudio can help you write code by autocompleting it. RStudio will look for similar objects and functions after typing the first three letters of a name.
When there is more than one option you can select from the list using the mouse or arrow keys. Within a function, you can get a list of arguments by pressing Tab.
Every function in R has a help page. You can view the help using
? for example
?sum. Many packages also contain vignettes, these are long form help documents containing examples and guides.
vignette() will show a list of all the vignettes available, or you can show a specific vignette for example
vignette(topic = "sf1", package = "sf").
The Environment tab shows all the objects in your environment, this includes datasets, parameters, and any functions you have created. By default, new objects appear in the Global Environment but you can see other environments with the drop-down menu. For example, each package has its own environment.
Sometimes you wish to remove things from your environment, perhaps because you no longer need them or things are getting cluttered.
xthat was created in a previous section.
xby entering it into the console?
save.image(); rm(list = ls()); load(".RData"). What happened?
.RDatafile in your project’s folder?
All the code shown so far is reproducible. To test RStudio’s debugging features, let’s write some code that fails, as illustrated in the figure below.
Always address debugging prompts to ensure your code is reproducible
We have already seen that you can save R scripts. You can also save individual R objects in the RDS format.
We can also read back in our data.
crashes are not identical, what has changed? Hint: read the help page associated with
Subsetting returns part of an R object. It can be done by providing numbers representing the positions of the elements we want (e.g. the 2nd element) or with a logical vector, with values associated with
TRUE returned. Two dimension object such as matrices and data frames can be subset by rows and columns. Subsetting in base R is done with square brackets
 after the name of an object. Run the following commands to practice subsetting.
$operator to print the
[,]syntax so that only the first and third columns of
class()of the objects created by each of the previous exercises?
It is also possible to subset objects by the values of their elements. This works because the
[ operator accepts logical vectors returned by queries such as ‘is it less than 3?’ (
x < 3 in R) and ‘was it light?’ (
crashes$dark == FALSE), as demonstrated below:
x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] # 1st, 3rd, and 5th element in x x[x == 5] # only when x == 5 (notice the use of double equals) x[x < 3] # less than 3 x[x < 3] = 0 # assign specific elements casualty_age[casualty_age %% 6 == 0] # just the ages that are a multiple of 6 crashes[crashes$dark == FALSE, ]
R objects can have a value of NA. This is how R represents missing data.
z = c(4, 5, NA, 7)
NA values are common in real-world data but can cause trouble, for example
sum(z) # result is NA
Some functions can be told to ignore NA values.
sum(z, na.rm = TRUE) # result is equal to 4 + 5 + 7
You can find NAs using the
is.na() function, and then remove them
If you remove records with NAs be warned: the average of a value excluding NAs may not be representative.
crashesto the class
crashesobject into a matrix. What happened to the values?
Often it is useful to ‘recode’ values. In the raw STATS19 files, for example, -1 means NA. There are many ways to recode values in R, the simplest and most mature of which is the use of factors, as shown below:
z = c(1, 2, -1, 1, 3) l = c(NA, "a", "b", "c") # labels in ascending order z_factor = factor(z, labels = l) z_charcter = as.character(z_factor) z_charcter #>  "a" "b" NA "a" "c"
zto Slight, Serious and Fatal for 1:3 respectively.
?dplyr::case_whenand try to recode the values using this function.
Bonus: reproduce the following plot
R has over 15,000 packages (effectively plugins for base R), extending it in almost every direction of statistics and computing. Packages provide additional functions, data and documentation. They are very often written by subject-matter experts and therefore tend to fit well with the workflow of the analyst in that particular specialism. There are two main stages to using a package: installing it and loading it. A third stage is updating it, this is also important.
Install new packages from The Comprehensive R Archive Network with the command
remotes::install_github() to install from GitHub). Update packages with the command
update.package() or in Tools > Check for Package Updates in RStudio. You only need to install a package once.
install.packages("sf") # remotes::install_github("r-spatial/sf")
Installed packages are loaded with the command
library(). Usually, the package will load silently. In some cases the package will provide a message, as illustrated below.
To use a function in a package without first loading the package, use double colons, as shown below (this calls the
tibble() function from the
crashes_tibble = tibble::tibble( vehicle_type, casualty_type, casualty_age, dark )
stats19package is installed on your computer?
update.packages(). What happens? Why?
Let’s take a look at a particular package.
ggplot2 is a generic plotting package that is part of the ‘tidyverse’ meta-package, which is an “opinionated collection of R packages designed for data science”. All packages in the tidyverse “share an underlying design philosophy, grammar, and data structures”.
ggplot2 is flexible, popular, and has dozens of add-on packages which build on it, such as
gganimate. To plot non-spatial data, it works as follows (see figure below, left for result):
Note that the
+ operator adds layers onto one another.
ggplot2that begins with with
gg. Hint: enter
install.packages(gg)and hit Tab when your cursor is between the
ggthemespackage, try other themes).
Another useful package in the tidyverse is
dplyr. It provides functions for manipulating data frames and using the pipe operator
%>%. The pipe puts the output of one command into the first argument of the next, as shown below (note the results are the same):
dplyr functions are demonstrated below.
dplyrto filter row in which
casualty_ageis less than 18, and then 28.
arrangefunction to sort the
crashesobject in descending order of age (hint: see the
dplyr::mutate(). What does the function do?
birth_year, in the
crashesdata.frame which is defined as the current year minus their age.
%>%operator to filter the output from the previous exercise so that only observations with
birth_yearafter 1969 are returned.
For the analysis and manipulation of temporal data we will first load the R package
The simplest example of a Date object that we can analyze is just the current date, i.e.
today() #> Warning in system("timedatectl", intern = TRUE): running command 'timedatectl' #> had status 1 #> Warning in with_tz(Sys.time(), tzone): Unrecognized time zone '' #>  "2020-11-19"
We can manipulate this object using several
lubridate functions to extract the current day, month, year, weekday and so on…
monthto see how it is possible to extract the current month as character vector
Date variables are often stored simply as a character vectors. This is a problem, since R is not always smart enough to distinguish between character vectors representing Dates.
lubridate provides functions that can translate a wide range of date encodings such as
ymd(), which extracts the Year Month and Day from a character string, as demonstrated below.
Import function such as
read_csv try to recognize the Date variables. Sometimes this fails. You can manually create Date objects, as shown below.
"09/09/93"into a date object and extract its weekday.
strptimefor further details on base R functions for dates.
We can use Dates also for subsetting events in a dataframe. For example, if we define
x_date as before and add it to the
crash dataset, i.e.
crashes$casualty_day = x_date
then we can subset events using Dates. For example
filter(crashes, day(casualty_day) < 7) # the events that ocurred in the first week of the month #> vehicle_type casualty_type casualty_age dark casualty_day #> 1 car pedestrian 20 TRUE 2009-01-01 #> 2 bus cyclist 40 FALSE 2009-02-02 #> 3 tank cat 60 TRUE 2009-03-03 filter(crashes, weekdays(casualty_day) == "Monday") # the events occurred on monday #> vehicle_type casualty_type casualty_age dark casualty_day #> 1 bus cyclist 40 FALSE 2009-02-02
crashes) that occurred in January
Now we’ll take a look at the time components of a Date. Using the function
hms (acronym for Hour Minutes Seconds) and its subfunctions such as
ms, we can parse a character vector representing several times as an Hour object (which is tecnically called a Period object).
We can manipulate these objects using several
lubridate functions to extract the hour component, the minutes and so on:
If the Hour data do not specify the seconds, then we just have to use a subfunction of
hm, and everything works as before.
We can use Hour data also for subsetting events, like we did for Dates. Let’s add a new column to crashes data,
Extract the weekday from the variable called
date. How many crashes happened on Monday?
Advanced challenge: calculate how many crashes occurred for each day of the week. Then plot it with ggplot2. Repeat the same exercises extracting the hour of the car accident from the variable called time. How would you combine the two informations in a single plot?
All road crashes happen somewhere and, in the UK at least, all collisions recorded by the police are given geographic coordinates, something that can help prioritise interventions to save lives by intervening in and around ‘crash hotspots’. R has strong geographic data capabilities, with the
sf package provides a generic class for spatial vector data: points, lines and polygons, are represented in
sf objects as a special ‘geometry column’, typically called ‘geom’ or ‘geometry’, extending the data frame class we’ve already seen in
sf data frame called
crashes_sf as follows:
library(sf) # load the sf package for working with spatial data crashes_sf = crashes # create copy of crashes dataset crashes_sf$longitude = c(-1.3, -1.2, -1.1) crashes_sf$latitude = c(50.7, 50.7, 50.68) crashes_sf = st_as_sf(crashes_sf, coords = c("longitude", "latitude"), crs = 4326) # plot(crashes_sf[1:4]) # basic plot # mapview::mapview(crashes_sf) # for interactive map
crashes_sf(hint: the solution may contain
$geometry). If the result is like the figure below, congratulations, it worked!).
crashes_sf, only showing the age variable.
sffunctions begin with
See Chapter 6 of Geocomputation with R for further information.
sf objects can also represent administrative zones. This is illustrated below with reference to
zones, a spatial object representing the Isle of Wight, that we will download using the
pct package (note: the
[1:9] appended to the function selects only the first 9 columns).
zones = pct::get_pct_zones("isle-of-wight")[1:9]
#> Simple feature collection with 2 features and 5 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -1.301131 ymin: 50.69052 xmax: -1.28837 ymax: 50.70547 #> geographic CRS: WGS 84 #> # A tibble: 2 x 6 #> geo_code all bicycle foot car_driver geometry #> <chr> <dbl> <dbl> <dbl> <dbl> <MULTIPOLYGON [°]> #> 1 E01017326 698 23 285 286 (((-1.289993 50.69766, -1.290177 50.… #> 2 E01017327 720 25 225 374 (((-1.295712 50.69383, -1.29873 50.6…
Like index and value subsetting, spatial subsetting can be done with the
[ notation. Subset the
zones that contain features in
crashes_sf as follows:
zones_containing_crashes = zones[crashes_sf, ]
To plot a new layer on top of an existing
sf plot, use the
add = TRUE argument. Remember to plot only the
geometry column of objects to avoid multiple maps. Colours can be set with the
Geographic joins involve assigning values from one object to a new column in another, based on the geographic relationship between them. With
sf objects it works as follows:
zones_joined = st_join(zones, crashes_sf)
casualty_agevariable of the new
zones_joinedobject (see the figure below to verify the result).
geo_codecolumn from the
crashes_sfand use the
left = FALSEargument to return only zones in which crashes occured. Plot the result.
See Chapter 4 of Geocomputation with R (Lovelace, Nowosad, and Muenchow 2019) for further information on geographic joins.
Get and set Coordinate Reference Systems (CRSs) with the command
st_crs(). Transform CRSs with the command
st_transform(), as demonstrated in the code chunk below, which converts the ‘lon/lat’ geographic CRS of
crashes_sf into the projected CRS of the British National Grid:
crashes_osgb = st_transform(crashes_sf, 27700)
crashes_osgb. What does the error message say?
zones_osgbby transforming the
st_crs()to find out the units measurement of the British National Grid?
For more information on CRSs see Chapter 6 of Geocompuation with R.
Buffers are polygons surrounding geometries of a (usually) fixed distance. Currently buffer operations in R only work on objects with projected CRSs.
sf’s buffer function.
crashes_1km_bufferrepresenting the area within 1 km of the crashes.
crashes_sfobject. What happens?
sf objects are
data.frames, we can do non-spatial operations on them. Try the following attribute operations on the
# load example dataset if it doesn't already exist zones = pct::get_pct_zones("isle-of-wight") sel = zones$all > 3000 # create a subsetting object zones_large = zones[sel, ] # subset areas with a popualtion over 100,000 zones_2 = zones[zones$geo_name == "Isle of Wight 002",] # subset based on 'equality' query zones_first_and_third_column = zones[c(1, 3)] zones_just_all = zones["all"]
zones_smallwhich contains only regions with less than 3000 people in the
TRUEfor regions with above median numbers of people who travel by car and
zones_footwhich contains only the foot attribute from
zones_footusing the function
plotto show where walking is a popular mode of travel to work
dplyrpackage to subset small regions where car use is high.
as.numeric()and use the ‘all’ column)?
So far we have used the
plot() function to make maps. That’s fine for basic visualisation, but for publication-quality maps, we recommend using
tmap (see Chapter 8 of Geocomputation with R for reasons and alternatives). Load the package as follows:
tm_shape() + tm_polygons()functions (note: the third figure relies on setting
#> tmap mode set to interactive viewing
Based on the saying “don’t run before you can walk”, we’ve learned the vital foundations of R before tackling a real dataset. Temporal and spatial attributes are key to road crash data, hence the emphasis on
sf. Visualisation is key to understanding and policy influence, which is where
tmap comes in. With these solid foundations, plus knowledge of how to ask for help (read R’s internal help functions, ask colleagues, create new comments on online forums/GitHub, generally in that order of priority), you are ready to test the methods on some real data.
Before doing so, take a read of the
stats19 vignette, which can be launched as follows:
This should now be sufficient to tackle the following exercises:
stats19package that converts a
data.frameobject into an
sfdata frame. Use this function to convert the road crashes into an
crashes_sf, for example.
monthin the crash data using the function
a_zones_mayrepresenting all the crashes that happened in the Isle of Wight in the month of May
mean) speed limit associated with each crash that happened in May across the zones of the Isle of Wight (the result is shown in the map)
Road network data can be accessed from a range of sources, including OpenStreetMap (OSM) and Ordnance Survey. We will use some OSM data from the Ilse of Wight, which can be loaded as follows:
u = "https://github.com/ropensci/stats19/releases/download/1.1.0/roads_key.Rds" roads_wgs = readRDS(url(u)) roads = roads_wgs %>% st_transform(crs = 27700)
You should already have road crashes for the Isle of Wight from the previous stage. If not, load crash data as follows:
aggregate()function to identify how many crashes happened per segment and plot the result (hint: see
?aggregate.sfand take a read of Section 4.2.5 of Geocomputation with R) with
tmapand plot the crashes that happened outside the road buffers on top.
Identify a region and zonal units of interest from http://geoportal.statistics.gov.uk/ or from the object
police_boundaries in the
Lovelace, Robin, Malcolm Morgan, Layik Hama, Mark Padgham, and M Padgham. 2019. “Stats19 A Package for Working with Open Road Crash Data.” Journal of Open Source Software 4 (33): 1181. https://doi.org/10.21105/joss.01181.
Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. CRC Press.