Tidy Tuesday - 12/20

Author

Sam Weiner

Published

January 9, 2023

Hello everyone! Tidy Tuesday released this data set on December 20th, 2022 and it is so interesting. It was collected as a part of a data science capstone project where national weather data was acquired to learn which areas of the U.S. struggle with weather prediction and the possible reasons why. Specifically, the project focused on the error in high and low temperature forecasting.

The data includes 16 months of daily forecasts and observations from 167 cities, as well as a separate data frame of information about those cities and some other American cities.

So let’s dive into this data and learn something about forecasting weather!

library(tidyverse)
library(skimr)
#Tidyverts universe of packages
library(tsibble)
library(fable)
library(feasts)
library(lubridate)
weather_forecasts <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-12-20/weather_forecasts.csv")
cities <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-12-20/cities.csv")
outlook_meanings <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-12-20/outlook_meanings.csv")

The weather forecasts data set contains daily high and daily low temperature forecasts 48, 36, 24, and 12 prior to every date in the data set. The data also contains the observed high and low temperature for that date. (Note: some dates have missing data for observed temperature.)

Here’s a glimpse at the column names, the number of rows, and the first couple of values in the weather forecasts data set.

weather_forecasts |> glimpse()
Rows: 651,968
Columns: 10
$ date                  <date> 2021-01-30, 2021-01-30, 2021-01-30, 2021-01-30,…
$ city                  <chr> "ABILENE", "ABILENE", "ABILENE", "ABILENE", "ABI…
$ state                 <chr> "TX", "TX", "TX", "TX", "TX", "TX", "TX", "TX", …
$ high_or_low           <chr> "high", "high", "high", "high", "low", "low", "l…
$ forecast_hours_before <dbl> 48, 36, 24, 12, 48, 36, 24, 12, 48, 36, 24, 12, …
$ observed_temp         <dbl> 70, 70, 70, 70, 42, 42, 42, 42, 29, 29, 29, 29, …
$ forecast_temp         <dbl> NA, NA, NA, 70, NA, NA, 39, 38, NA, NA, NA, 30, …
$ observed_precip       <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, …
$ forecast_outlook      <chr> NA, NA, NA, "DUST", NA, NA, "DUST", "SUNNY", NA,…
$ possible_error        <chr> "none", "none", "none", "none", "none", "none", …

The data contains multiple time series, one for each combination of city, state, high_or_low, and forecast_hours_before. I think that looking into how the different forecast times compare in accuracy will be interesting to look at, as well as whether forecast accuracy differs at different times in the year.

I will be using the tsibble package for this data which is useful when working with time series data. The first of which is that it enables us contain multiple time series in the same data structure, a time series tibble, and this allows us run models or check for certain features on multiple series simultaneously. If you want to learn more about time series analysis and how to use the tidyverts packages, I highly recommend reading Forecasting: Principles and Practice (3rd ed) by Rob J Hyndman and George Athanasopoulos. It is a very detailed book and covers a lot of ground in the field of time series.

forecast_tsbl <- weather_forecasts |>
  mutate(city = factor(city),
         state = factor(state),
         high_or_low = factor(high_or_low),
         forecast_hours_before = factor(forecast_hours_before)) |> 
  as_tsibble(key = c(city, state, high_or_low, forecast_hours_before), index = date)

I am going to be subsetting the data to look at one city, Newark, NJ. This visualization is a time plot of the observed daily high temperature for Newark, NJ from January 30th, 2021 to June 1st, 2022.

forecast_tsbl |> 
  filter(city == "NEWARK", high_or_low == "high", forecast_hours_before == 12) |>
  autoplot(observed_temp)
Warning: Removed 2 rows containing missing values (`geom_line()`).

Sliding Window

We will be looking at the rolling mean absolute error between forecasted high and low temperatures versus observed high and low temperatures across the different temperature forecasts.

I am using a 30 day window around each date to compute the mean. I am also ignoring dates with missing values.

slide_tsbl <- forecast_tsbl |> 
  filter(city == "NEWARK") |> 
  mutate(error = forecast_temp - observed_temp) |> 
  mutate(mean_temp_error = slider::slide_dbl(error,
                                              ~mean(abs(.), na.rm = TRUE), 
                                              .before = 15,
                                              .after = 15, 
                                              .complete = TRUE))

This plot shows the mean absolute error across the dates for the high temperature forecast across forecast times.

Code
slide_tsbl |> 
  filter(high_or_low == "high") |> 
  ggplot() +
  geom_line(aes(x = date, y = mean_temp_error)) + 
  facet_grid(vars(high_or_low, forecast_hours_before)) +
  xlab("Date") +
  ylab("Mean Error")

This plot shows the average error across the dates for the low temperature forecast across forecast times.

Code
slide_tsbl |> 
  filter(high_or_low == "low") |> 
  ggplot() +
  geom_line(aes(x = date, y = mean_temp_error)) + 
  facet_grid(vars(high_or_low, forecast_hours_before))  +
  xlab("Date") +
  ylab("Mean Error")

We can see the overall mean absolute error for each forecast.

forecast_tsbl |> 
  as_tibble() |> 
  filter(city == "NEWARK", state == "NJ") |>  
  mutate(error = forecast_temp - observed_temp) |>
  group_by(forecast_hours_before) |> 
  summarise(MAE = mean(abs(error), na.rm = T))
# A tibble: 4 × 2
  forecast_hours_before   MAE
  <fct>                 <dbl>
1 12                     1.98
2 24                     2.12
3 36                     2.29
4 48                     2.41