Skip to contents

Source material

This vignette is adapted very heavily from Hadley Wickham’s incredible R for Data Science book. You should support Hadley and the work he does by buying it.

Packages

In addition to weathercan, you’ll need several packages from the tidyverse to complete the following analysis.

Using weathercan to load in data

Your first decision that you need to make when analyzing data from weather stations across canada is to determine for which stations you’d like to query from Environment and Climate Change Canada. In this example, to keep processing time low, we will query two stations with very long records that happen to be far apart. To make that choice we can use (tidyverse)[http://tidyverse.org/] tools and the included stations() function to access the data frame of stations in this package:

stations() %>%
  filter(station_id %in% c(707, 4859, 6693,5397, 2315),
         interval == "day") %>%
  select(prov, station_name, station_id, start, end)
## # A tibble: 5 × 5
##   prov  station_name     station_id start   end
##   <chr> <chr>                 <dbl> <dbl> <dbl>
## 1 AB    TABER                  2315  1907  2021
## 2 BC    AGASSIZ CDA             707  1889  2021
## 3 NL    PORT AUX BASQUES       6693  1909  2017
## 4 ON    BELLEVILLE             4859  1866  2021
## 5 QC    LENNOXVILLE            5397  1888  2021

These two weather stations will be our test data for this vignette. You can broaden or expand your analysis by choosing different or more station. Our next step is to use the weather_dl() function to load in the data.

The following will take quite some time to download as it is downloading over 100 years of daily data for 5 stations.

pancan_df <- weather_dl(station_ids = c(707, 4859, 6693,5397, 2315), 
                        interval = "day") %>%
  filter(year >= 1920) %>%
  select(station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id, TC_id, mean_temp, date)

Plot the data

ggplot(pancan_df, aes(x = date, y = mean_temp, colour = station_name)) +
  geom_point() +
  geom_line()
## Warning: Removed 29266 rows containing missing values (geom_point).
## Warning: Removed 8085 row(s) containing missing values (geom_path).

This is quite a large dataset.

Creating list-columns

pancan_df_nest <- pancan_df %>%
  group_by(station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id, TC_id) %>%
  nest()
pancan_df_nest
## # A tibble: 5 × 10
## # Groups:   station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## #   TC_id [5]
##   station_name     station_id prov    lat    lon  elev climate_id WMO_id TC_id
##   <chr>                 <dbl> <chr> <dbl>  <dbl> <dbl> <chr>      <chr>  <chr>
## 1 AGASSIZ CDA             707 BC     49.2 -122.   15   1100120    NA     NA   
## 2 BELLEVILLE             4859 ON     44.2  -77.4  76.2 6150689    NA     NA   
## 3 PORT AUX BASQUES       6693 NL     47.6  -59.2  39.7 8402975    71197  WZB  
## 4 LENNOXVILLE            5397 QC     45.4  -71.8 181   7024280    71611  WQH  
## 5 TABER                  2315 AB     49.8 -112.  811   3036360    NA     NA   
## # … with 1 more variable: data <list>

Fit some models

Define the model

clim_model <- function(df) {
  lm(mean_temp ~ date, data = df)
}

Run the model with the existing data

pancan_df_nest <- pancan_df_nest %>% 
  mutate(model = map(data, clim_model))
pancan_df_nest
## # A tibble: 5 × 11
## # Groups:   station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## #   TC_id [5]
##   station_name     station_id prov    lat    lon  elev climate_id WMO_id TC_id
##   <chr>                 <dbl> <chr> <dbl>  <dbl> <dbl> <chr>      <chr>  <chr>
## 1 AGASSIZ CDA             707 BC     49.2 -122.   15   1100120    NA     NA   
## 2 BELLEVILLE             4859 ON     44.2  -77.4  76.2 6150689    NA     NA   
## 3 PORT AUX BASQUES       6693 NL     47.6  -59.2  39.7 8402975    71197  WZB  
## 4 LENNOXVILLE            5397 QC     45.4  -71.8 181   7024280    71611  WQH  
## 5 TABER                  2315 AB     49.8 -112.  811   3036360    NA     NA   
## # … with 2 more variables: data <list>, model <list>

Then add the residuals to the model

pancan_df_nest <- pancan_df_nest %>% 
  mutate(model = map(data, clim_model),
         resids = map2(data, model, add_residuals)) 
pancan_df_nest
## # A tibble: 5 × 12
## # Groups:   station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## #   TC_id [5]
##   station_name     station_id prov    lat    lon  elev climate_id WMO_id TC_id
##   <chr>                 <dbl> <chr> <dbl>  <dbl> <dbl> <chr>      <chr>  <chr>
## 1 AGASSIZ CDA             707 BC     49.2 -122.   15   1100120    NA     NA   
## 2 BELLEVILLE             4859 ON     44.2  -77.4  76.2 6150689    NA     NA   
## 3 PORT AUX BASQUES       6693 NL     47.6  -59.2  39.7 8402975    71197  WZB  
## 4 LENNOXVILLE            5397 QC     45.4  -71.8 181   7024280    71611  WQH  
## 5 TABER                  2315 AB     49.8 -112.  811   3036360    NA     NA   
## # … with 3 more variables: data <list>, model <list>, resids <list>

Working with list-columns

We can unnest the results then plot them

unnest()

resids <- unnest(pancan_df_nest, resids)
resids
## # A tibble: 187,125 × 14
## # Groups:   station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## #   TC_id [5]
##    station_name station_id prov    lat   lon  elev climate_id WMO_id TC_id
##    <chr>             <dbl> <chr> <dbl> <dbl> <dbl> <chr>      <chr>  <chr>
##  1 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  2 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  3 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  4 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  5 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  6 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  7 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  8 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  9 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
## 10 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
## # … with 187,115 more rows, and 5 more variables: data <list>, model <list>,
## #   mean_temp <dbl>, date <date>, resid <dbl>
ggplot(data = resids, aes(date, resid)) +
  geom_line(aes(group = station_name), alpha = 1 / 3) + 
  geom_point() +
  geom_hline(yintercept = 0) +
  facet_wrap(~ station_name, ncol = 1)
## Warning: Removed 8085 row(s) containing missing values (geom_path).
## Warning: Removed 29266 rows containing missing values (geom_point).

Using broom

glance_df <- pancan_df_nest %>% 
  mutate(glance = map(model, broom::glance)) %>% 
  unnest(glance, .drop = TRUE) %>%
  select(station_name, prov, r.squared, p.value, AIC)
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
station_id lat lon elev climate_id WMO_id TC_id station_name prov r.squared p.value AIC
707 49.24 -121.76 15.0 1100120 NA NA AGASSIZ CDA BC 0.0025882 0.0000000 235016.0
4859 44.15 -77.39 76.2 6150689 NA NA BELLEVILLE ON 0.0027581 0.0000000 272877.5
6693 47.57 -59.15 39.7 8402975 71197 WZB PORT AUX BASQUES NL 0.0041296 0.0000000 178541.2
5397 45.37 -71.82 181.0 7024280 71611 WQH LENNOXVILLE QC 0.0010639 0.0000000 286717.8
2315 49.79 -112.12 811.0 3036360 NA NA TABER AB 0.0003020 0.0072753 184412.7

Looking at the predictions

preds <- pancan_df_nest %>% 
  mutate(model = map(data, clim_model),
         preds = map2(data, model, add_predictions)) %>%
  unnest(preds)
preds
## # A tibble: 187,125 × 15
## # Groups:   station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## #   TC_id [5]
##    station_name station_id prov    lat   lon  elev climate_id WMO_id TC_id
##    <chr>             <dbl> <chr> <dbl> <dbl> <dbl> <chr>      <chr>  <chr>
##  1 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  2 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  3 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  4 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  5 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  6 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  7 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  8 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
##  9 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
## 10 AGASSIZ CDA         707 BC     49.2 -122.    15 1100120    NA     NA   
## # … with 187,115 more rows, and 6 more variables: data <list>, model <list>,
## #   resids <list>, mean_temp <dbl>, date <date>, pred <dbl>
ggplot(data = preds, aes(x = date, y = mean_temp, colour = station_name)) +
  geom_point() +
  geom_line(aes(y = pred)) +
  facet_wrap(~ station_name, scales = "free_y", ncol = 1)
## Warning: Removed 29266 rows containing missing values (geom_point).