Essi Parent in the fediverse: @essi@write.as, @essi@blahaj.zone, @essicolo@pixel.social

Using R and weathercan to compute weather indices in Canada

When a colleague told me he assigned an undergrad student full-time to copy-paste historical weather data from Environment Canada, I searched into my files and found some code which I could format and share. The weathercan package, a gem written by Steffi LaZerte, allowed me to automate fetching tasks and save countless hours. All I had to provide was the spatial coordinates of my sites and, since my work was about agricultural data, time periods expressed as beginning of the season to harvest.

library("tidyverse")
library("weathercan")

My csv table contained coordinates and dates.

weather <- read_csv("data/sites.csv")

## Parsed with column specification:
## cols(
## siteid = coldouble(),
## latitude = coldouble(),
## longitude = col
double(),
## start = coldate(format = “”),
## end = col
date(format = “”)
## )

weather %>%
  sample_n(6)
site_id latitude longitude start end
5 45.80300 -74.96984 2016-06-01 2016-09-26
19 45.22333 -73.53528 2017-05-18 2017-09-01
26 45.24454 -73.48610 2014-06-03 2014-09-08
27 46.09112 -72.34862 2014-05-30 2014-09-30
24 45.24550 -73.48829 2015-05-13 2015-09-15
3 46.73023 -72.26801 2016-05-21 2016-09-13

The first thing to do is to look for the closest stations to each site, which I did within a loop. For each row of my weather table, I ran the stations_search() function from the weathercan package. stations_search() takes the coordinates, a maximum distance and an interval on which weather data are archived (hourly, daily, weekly, etc.). If I wanted hourly data, stations with a daily resolution would have been discarded.

I store each output of stations_search() in a list, which is a very useful object type where each element can be anything, from strings to machine learning models.

stations <- list()
for (i in 1:nrow(weather)) {
  stations[[i]] <- stations_search(coords = c(weather$latitude[i], weather$longitude[i]),
                                   dist = 50, interval = "day")
}

I verified each element of the list and wrote down the station identifier I wanted. This process could have been done with if statements within a loop, but since I had not so many sites, this was done quickly by hand. For example, the third site:

stations[[3]]
prov station_name station_id climate_id WMO_id TC_id lat lon elev tz interval start end distance
QC ST UBALD 5288 7017767 NA NA 46.68 -72.32 NA Etc/GMT+5 day 1963 1964 6.854702
QC ST ALBAN 5255 7016800 NA NA 46.72 -72.08 76.2 Etc/GMT+5 day 1949 2018 14.417353
QC STE ANNE DE LA PERADE 5257 7016840 NA NA 46.58 -72.23 16.0 Etc/GMT+5 day 1949 2018 16.951883
QC LAC AUX SABLES 5203 701LEEH NA NA 46.87 -72.40 160.0 Etc/GMT+5 day 1964 2018 18.518982
QC HERVEY JONCTION 5226 7013102 NA NA 46.85 -72.47 177.4 Etc/GMT+5 day 1926 1978 20.374795
QC ST TITE 5287 7017760 NA NA 46.73 -72.57 141.7 Etc/GMT+5 day 1920 1985 23.083509
QC DESCHAMBAULT 27325 7011983 71389 WHQ 46.69 -71.97 61.0 Etc/GMT+5 day 1997 2018 23.222388
QC ST NARCISSE 5281 7017585 NA NA 46.53 -72.43 46.0 Etc/GMT+5 day 1973 2018 25.481852
QC HEROUXVILLE 5225 7013100 NA NA 46.67 -72.60 145.0 Etc/GMT+5 day 1966 2018 26.258655
QC DESCHAMBAULT 5220 7011982 NA NA 46.67 -71.92 15.2 Etc/GMT+5 day 1971 2018 27.445147
QC STE CHRISTINE 5266 7017000 NA NA 46.82 -71.92 152.1 Etc/GMT+5 day 1950 2018 28.390738
QC CHAMPLAIN 5211 7011290 NA NA 46.47 -72.33 12.0 Etc/GMT+5 day 1980 2011 29.315576
QC RIVIERE A PIERRE 5253 7016560 NA NA 47.00 -72.17 221.0 Etc/GMT+5 day 1949 1994 30.907166
QC FORTIERVILLE 5362 7022494 NA NA 46.48 -72.05 53.3 Etc/GMT+5 day 1973 2018 32.445967
QC CHUTE PANET 5214 7011600 NA NA 46.87 -71.87 152.4 Etc/GMT+5 day 1949 1982 34.126112
QC GENTILLY 5367 7022700 NA NA 46.40 -72.37 6.1 Etc/GMT+5 day 1973 1975 37.533097
QC ST JOSEPH DE MEKINAC 5276 7017422 NA NA 46.92 -72.68 121.9 Etc/GMT+5 day 1973 1994 37.858884
QC STE FRANCOISE ROMAINE 5482 7027267 NA NA 46.48 -71.93 91.4 Etc/GMT+5 day 1963 1985 38.005007
QC STE CROIX 5459 7027088 NA NA 46.62 -71.78 70.0 Etc/GMT+5 day 1973 1993 39.299534
QC SHAWINIGAN 27646 7018001 71370 XSH 46.56 -72.73 110.0 Etc/GMT+5 day 1998 2018 40.113102
QC DONNACONA 5221 7012070 NA NA 46.67 -71.75 10.7 Etc/GMT+5 day 1918 1964 40.179312
QC SHAWINIGAN 5290 7018000 NA NA 46.57 -72.75 121.9 Etc/GMT+5 day 1902 2004 40.971074
QC DONNACONA 2 5222 7012071 NA NA 46.68 -71.73 45.7 Etc/GMT+5 day 1952 2008 41.520499
QC RIVIERE VERTE OUEST 5254 7016675 NA NA 46.98 -71.83 213.4 Etc/GMT+5 day 1966 2018 43.436700
QC CAP DE LA MADELEINE 5209 7011045 NA NA 46.37 -72.53 17.1 Etc/GMT+5 day 1920 1932 44.802297
QC BECANCOUR 5316 7020570 NA NA 46.33 -72.43 14.9 Etc/GMT+5 day 1966 1994 46.193809
QC TROIS-RIVIERES 10764 7018562 71724 WTY 46.35 -72.52 6.0 Etc/GMT+5 day 1993 2018 46.477383
QC LAC MINOGAMI 5235 7013678 NA NA 46.67 -72.87 259.1 Etc/GMT+5 day 1964 1978 46.524598
QC TROIS RIVIERES AQUEDUC 5201 701HE63 NA NA 46.38 -72.62 54.9 Etc/GMT+5 day 1974 2009 47.374297
QC TROIS RIVIERES 5292 7018564 NA NA 46.37 -72.60 53.3 Etc/GMT+5 day 1934 1986 47.453040
QC CLUB TOURILLI 5216 7011800 NA NA 47.08 -71.90 249.9 Etc/GMT+5 day 1949 1953 47.938720
QC DUCHESNAY 5224 7012240 NA NA 46.87 -71.65 166.1 Etc/GMT+5 day 1963 1994 49.670924
QC MANSEAU 5402 7024615 NA NA 46.33 -71.98 96.0 Etc/GMT+5 day 1978 1994 49.675412

I wrote down each station_id in a vector merged to my original table, so that each site has its station identifier.

weather$station_ids <- c(5266, 5255, 5255, 5619, 5619, 5619, 5257, 5255, 5203, 5619,
                         5619, 8321, 8321, 5266, 5393, 10872, 5940, 10869, 10762, 5237,
                         10843, 5522, 5522, 10843, 5237, 10843, 5522, 5237, 5532)

Now let's fetch our weather data. This is done with the weather_dl() function. Within a loop for each site (each row of the weather data table), I used the station_id, the start time and the end time to fetch daily weather data from Environment Canada and store the extracted data in a list. Of course, you must be connected to the web to do this (and expect some warnings).

weather_tables <- list()
for (i in 1:nrow(weather)) {
  weather_tables[[i]] <- weather_dl(station_ids = weather$station_ids[i],
                                    start = weather$start[i], end = weather$end[i],
                                    interval="day")
}

Daily weather data are stored in each element of the list. For example, the first site with selected columns.

weather_tables[[1]] %>% 
  select(year, month, day, mean_temp, total_precip) %>% 
  head(10)
year month day mean_temp total_precip
2016 05 09 NA 0.8
2016 05 10 9.5 0.0
2016 05 11 NA 0.0
2016 05 12 NA 0.0
2016 05 13 14.8 6.2
2016 05 14 NA NA
2016 05 15 NA NA
2016 05 16 2.3 1.4
2016 05 17 NA 0.0
2016 05 18 NA 0.0

You will likely inspect your data tables further. For my part, I’ll jump right away to the computation of weather indices: I need daily precipitations, mean temperature, a Shannon diversity index on precipitations (SDI, 0 means all precipitations occurred the same day, and 1 means it fell uniformly through the season) and growing degree days (GDD, the sum of Celsius degrees higher than a threshold). Total precipitations and mean temperature can be computed with sum and a mean. SDI and GDD will need custom functions.

SDI_f <- function(x) {
  p <- x/sum(x, na.rm = TRUE)
  SDI <- -sum(p * log(p), na.rm = TRUE) / log(length(x))
  return(SDI)
}

GDD_f <- function(x, delim = 5) {
  sum(x[x >= delim], na.rm = TRUE)
}

Since I want to store these indices in my original table, I create empty columns.

weather$total_precip <- NA
weather$mean_temp <- NA
weather$SDI <- NA
weather$GDD <- NA

I compute each index in this final loop, where total precipitation is the sum of daily total precipitations and mean temperature is the mean of daily temperatures. SDI and GDD are computed with the functions I defined earlier.

for (i in 1:nrow(weather)) {
  # cumulated prcipitations
  weather$total_precip[i] <- sum(weather_tables[[i]]$total_precip, na.rm = TRUE)
  
  # mean temperature
  weather$mean_temp[i] <- mean(weather_tables[[i]]$mean_temp, na.rm = TRUE)
  
  # Shannon diversity index of precipitations
  weather$SDI[i] <- SDI_f(weather_tables[[i]]$total_precip)
  
  # Growing degree days
  weather$GDD[i] <- GDD_f(weather_tables[[i]]$mean_temp, delim = 5)
}

That' it, our weather table is done!

weather %>%
  sample_n(6)
site_id latitude longitude start end station_ids total_precip mean_temp SDI GDD
20 45.93424 -73.32026 2016-05-11 2016-09-19 5237 471.0 19.16308 0.5665234 2486.9
7 46.53156 -72.23351 2017-05-21 2017-10-03 5257 324.5 17.63529 0.7089004 2398.4
21 45.24248 -73.51408 2016-05-20 2016-09-06 10843 314.2 19.86542 0.6048280 2125.6
3 46.73023 -72.26801 2016-05-21 2016-09-13 5255 405.6 18.53846 0.7054556 1687.0
12 46.43994 -72.66337 2017-05-27 2017-09-28 8321 421.4 17.30579 0.7033883 2094.0
5 45.80300 -74.96984 2016-06-01 2016-09-26 5619 428.8 17.56250 0.6450922 281.0

Data and codes are available on Gitlab

#rstats #en