
Using with the tidyverse
Sam Albers
2022-06-20
Source:vignettes/articles/use_with_tidyverse.Rmd
use_with_tidyverse.Rmd
Source material
This vignette is adapted very heavily from Hadley Wickham’s incredible R for Data Science book. You should support Hadley and the work he does by buying it.
Packages
In addition to weathercan, you’ll need several packages from the tidyverse to complete the following analysis.
Using weathercan to load in data
Your first decision that you need to make when analyzing data from weather stations across canada is to determine for which stations you’d like to query from Environment and Climate Change Canada. In this example, to keep processing time low, we will query two stations with very long records that happen to be far apart. To make that choice we can use (tidyverse)[http://tidyverse.org/] tools and the included stations()
function to access the data frame of stations in this package:
stations() %>%
filter(station_id %in% c(707, 4859, 6693,5397, 2315),
interval == "day") %>%
select(prov, station_name, station_id, start, end)
## # A tibble: 5 × 5
## prov station_name station_id start end
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AB TABER 2315 1907 2021
## 2 BC AGASSIZ CDA 707 1889 2021
## 3 NL PORT AUX BASQUES 6693 1909 2017
## 4 ON BELLEVILLE 4859 1866 2021
## 5 QC LENNOXVILLE 5397 1888 2021
These two weather stations will be our test data for this vignette. You can broaden or expand your analysis by choosing different or more station. Our next step is to use the weather_dl()
function to load in the data.
The following will take quite some time to download as it is downloading over 100 years of daily data for 5 stations.
Plot the data
ggplot(pancan_df, aes(x = date, y = mean_temp, colour = station_name)) +
geom_point() +
geom_line()
## Warning: Removed 29266 rows containing missing values (geom_point).
## Warning: Removed 8085 row(s) containing missing values (geom_path).
This is quite a large dataset.
Creating list-columns
pancan_df_nest <- pancan_df %>%
group_by(station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id, TC_id) %>%
nest()
pancan_df_nest
## # A tibble: 5 × 10
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 2 BELLEVILLE 4859 ON 44.2 -77.4 76.2 6150689 NA NA
## 3 PORT AUX BASQUES 6693 NL 47.6 -59.2 39.7 8402975 71197 WZB
## 4 LENNOXVILLE 5397 QC 45.4 -71.8 181 7024280 71611 WQH
## 5 TABER 2315 AB 49.8 -112. 811 3036360 NA NA
## # … with 1 more variable: data <list>
Fit some models
Define the model
clim_model <- function(df) {
lm(mean_temp ~ date, data = df)
}
Run the model with the existing data
## # A tibble: 5 × 11
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 2 BELLEVILLE 4859 ON 44.2 -77.4 76.2 6150689 NA NA
## 3 PORT AUX BASQUES 6693 NL 47.6 -59.2 39.7 8402975 71197 WZB
## 4 LENNOXVILLE 5397 QC 45.4 -71.8 181 7024280 71611 WQH
## 5 TABER 2315 AB 49.8 -112. 811 3036360 NA NA
## # … with 2 more variables: data <list>, model <list>
Then add the residuals to the model
pancan_df_nest <- pancan_df_nest %>%
mutate(model = map(data, clim_model),
resids = map2(data, model, add_residuals))
pancan_df_nest
## # A tibble: 5 × 12
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 2 BELLEVILLE 4859 ON 44.2 -77.4 76.2 6150689 NA NA
## 3 PORT AUX BASQUES 6693 NL 47.6 -59.2 39.7 8402975 71197 WZB
## 4 LENNOXVILLE 5397 QC 45.4 -71.8 181 7024280 71611 WQH
## 5 TABER 2315 AB 49.8 -112. 811 3036360 NA NA
## # … with 3 more variables: data <list>, model <list>, resids <list>
Working with list-columns
We can unnest the results then plot them
unnest()
resids <- unnest(pancan_df_nest, resids)
resids
## # A tibble: 187,125 × 14
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 2 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 3 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 4 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 5 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 6 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 7 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 8 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 9 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 10 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## # … with 187,115 more rows, and 5 more variables: data <list>, model <list>,
## # mean_temp <dbl>, date <date>, resid <dbl>
ggplot(data = resids, aes(date, resid)) +
geom_line(aes(group = station_name), alpha = 1 / 3) +
geom_point() +
geom_hline(yintercept = 0) +
facet_wrap(~ station_name, ncol = 1)
## Warning: Removed 8085 row(s) containing missing values (geom_path).
## Warning: Removed 29266 rows containing missing values (geom_point).
Using broom
glance_df <- pancan_df_nest %>%
mutate(glance = map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE) %>%
select(station_name, prov, r.squared, p.value, AIC)
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
station_id | lat | lon | elev | climate_id | WMO_id | TC_id | station_name | prov | r.squared | p.value | AIC |
---|---|---|---|---|---|---|---|---|---|---|---|
707 | 49.24 | -121.76 | 15.0 | 1100120 | NA | NA | AGASSIZ CDA | BC | 0.0025882 | 0.0000000 | 235016.0 |
4859 | 44.15 | -77.39 | 76.2 | 6150689 | NA | NA | BELLEVILLE | ON | 0.0027581 | 0.0000000 | 272877.5 |
6693 | 47.57 | -59.15 | 39.7 | 8402975 | 71197 | WZB | PORT AUX BASQUES | NL | 0.0041296 | 0.0000000 | 178541.2 |
5397 | 45.37 | -71.82 | 181.0 | 7024280 | 71611 | WQH | LENNOXVILLE | QC | 0.0010639 | 0.0000000 | 286717.8 |
2315 | 49.79 | -112.12 | 811.0 | 3036360 | NA | NA | TABER | AB | 0.0003020 | 0.0072753 | 184412.7 |
Looking at the predictions
preds <- pancan_df_nest %>%
mutate(model = map(data, clim_model),
preds = map2(data, model, add_predictions)) %>%
unnest(preds)
preds
## # A tibble: 187,125 × 15
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 2 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 3 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 4 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 5 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 6 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 7 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 8 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 9 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## 10 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 NA NA
## # … with 187,115 more rows, and 6 more variables: data <list>, model <list>,
## # resids <list>, mean_temp <dbl>, date <date>, pred <dbl>
ggplot(data = preds, aes(x = date, y = mean_temp, colour = station_name)) +
geom_point() +
geom_line(aes(y = pred)) +
facet_wrap(~ station_name, scales = "free_y", ncol = 1)
## Warning: Removed 29266 rows containing missing values (geom_point).