Introduction

We will continue our work with daily water temperature and air temperature data observed for 31 rivers in Spain. In the preview below, W represents the daily maximum water temperature and A represents the daily maximum air temperature. The data contains almost a full year of data for each of the 31 different rivers.

## # A tibble: 6 × 8
##   JULIAN_DAY  YEAR     L     W     A  TIME MONTH   DAY
##        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1          1  2003   103  14.2  21.2     1     1     1
## 2          2  2003   103  14.4  16.8     2     1     2
## 3          3  2003   103  14.4  15.4     3     1     3
## 4          4  2003   103  10.9  10.8     4     1     4
## 5          5  2003   103  10.8  11.7     5     1     5
## 6          6  2003   103  10.7  12.4     6     1     6

Using the data, we seek to identify the best model for predicting the maximum water temperature given the maximum air temperature. Previously, we randomly selected 3 rivers to act as a test set. All models were evaluated based on the randomly selected test set. In this tutorial, we explore approaches that ensure that all data is used for both model training and model testing.

In this tutorial, we apply helpful functions in the purrr, modelr, and broom packages. See the following links for helpful articles on performing cross-validation within the tidyverse: Link 1, Link 2, and Link 3.

Part 1: Intelligent Use of Locations for Cross-Validation

Chunk 1: List-Column of Data Split By Location

NEST.DATA = DATA %>% group_by(L) %>% nest()
head(NEST.DATA)
## # A tibble: 6 × 2
## # Groups:   L [6]
##       L data              
##   <dbl> <list>            
## 1   103 <tibble [365 × 7]>
## 2   105 <tibble [366 × 7]>
## 3   107 <tibble [365 × 7]>
## 4   111 <tibble [365 × 7]>
## 5   112 <tibble [366 × 7]>
## 6   118 <tibble [366 × 7]>

Chunk 2: Combining filter() with unnest() To Split Data

NEST.DATA %>% filter(L==103) %>% unnest() %>% glimpse()
## Rows: 365
## Columns: 8
## Groups: L [1]
## $ L          <dbl> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103,…
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ YEAR       <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,…
## $ W          <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10.5, …
## $ A          <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5, 5…
## $ TIME       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ MONTH      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
NEST.DATA %>% filter(L!=103) %>% unnest() %>% glimpse()
## Rows: 10,972
## Columns: 8
## Groups: L [30]
## $ L          <dbl> 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105,…
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ YEAR       <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008,…
## $ W          <dbl> 7.9, 7.8, 8.7, 9.0, 9.4, 10.4, 9.9, 9.8, 10.3, NA, 9.5, 9.5…
## $ A          <dbl> 14.0, 15.0, 12.4, 14.6, 18.0, 18.2, 12.0, 14.2, 16.3, NA, 1…
## $ TIME       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ MONTH      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

Chunk 3: Fit Train Data, Predict Test Data, and Save Results

DATA2=DATA
DATA2$linpred=NA

TEST = NEST.DATA %>% filter(L==103) %>% unnest()
TRAIN  = NEST.DATA %>% filter(L!=103) %>% unnest()

linmod=lm(W~A,data=TRAIN)
linmodpred=predict(linmod,newdata=TEST)

DATA2$linpred[which(DATA2$L==103)]=linmodpred
tail(DATA2)

Chunk 4: Create a Loop to Iterate Process for Each Location

DATA2=DATA
DATA2$linpred=NA

for(k in unique(DATA2$L)){
  # Here we loop over 31 locations. k takes values from (103,105,107,...)
  TEST = COMPLETE
  TRAIN  = COMPLETE

  linmod= COMPLETE
  linmodpred=COMPLETE

  DATA2$linpred[FILL]=COMPLETE
}

Chunk 5: Calcuate Cross-Validated RMSE

RMSE.func=function(actual,predict){
  mse=mean((actual-predict)^2,na.rm=T)
  rmse=sqrt(mse)
  return(rmse)
}
RMSE.func(actual=DATA2$W,predict=DATA2$linpred)

Part 2: K-Fold CV for Polynomial Model Evaluation

Cross Validation

  • One way to evaluate and compare the performance of different models

  • Leave-one-out cross-validation
    • repeatedly fit the statistical learning method using training sets that contain \(n - 1\) observations, and get the prediction of the excluded observation. \[ CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_i \] where \(MSE_i = (y_i-\hat{y}_i)^2\) based on \(x_i\) and model fitted by \(\{(x_1,y_1),...,(x_{i-1},y_{i-1}),(x_{i+1},y_{i+1}),...,(x_n,y_n)\}\).
    • there is no randomness in the training/validation set splits; performing LOOCV multiple times will always yield the same results.
    • can also be used in the classification setting by replacing \(MSE_i\) with error rate \(Err_i = I(y_i\neq \hat{y_i})\)

  • K-fold cross-validation
    • involves randomly dividing the set of observations into k groups of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k-1 folds. The mean squared error, \(MSE_1\), is then computed on the observations in the held-out fold. This procedure is repeated k times. \[ CV_{(k)} = \frac{1}{k}\sum_{i=1}^k MSE_i \]

Chunk 1: Exploratory Figures

Chunk 2: Polynomial Fitting

\[W=a+\sum_{i=1}^Ib_iA^i+\sum_{j=1}^Jc_jD^j+\varepsilon\]

polymodel=lm(W~poly(A,4)+poly(JULIAN_DAY,3),data=na.omit(DATA))
tidy(polymodel)
glance(polymodel)

Chunk 3: Splitting Data for Cross-Validation

DATA3=na.omit(DATA) %>% crossv_kfold(10)
DATA3

Chunk 4: Fitted Models and Predictions

train.model.func=function(data,i,j){
  mod=lm(W~poly(A,i)+poly(JULIAN_DAY,j),data=data)
  return(mod)
}

i=4
j=3

DATA4=DATA3 %>% 
       mutate(tr.model=map(train,train.model.func,i=i,j=j))
DATA4

Chunk 5: Predicted Values and Cross-Validated RMSE

DATA4.PREDICT = DATA4 %>% 
          mutate(predict=map2(test,tr.model,~augment(.y,newdata=.x))) %>%
          select(predict) %>%
          unnest(predict)
head(DATA4.PREDICT)
RMSE.func(actual=DATA4.PREDICT$W,predict=DATA4.PREDICT$.fitted)

Chunk 6: Select Best (i,j) Pair

max_i = 10
max_j = 10
rmse_results = matrix(NA,max_i,max_j)
for(i in 1:max_i){
  for(j in 1:max_j){
    DATA4=DATA3 %>% 
       mutate(tr.model=map(train,train.model.func,i=i,j=j))
    DATA4.PREDICT = DATA4 %>% 
          mutate(predict=map2(test,tr.model,~augment(.y,newdata=.x))) %>%
          select(predict) %>%
          unnest(cols = c(predict))
    rmse_results[i,j] = RMSE.func(actual=DATA4.PREDICT$W,predict=DATA4.PREDICT$.fitted)
  }
}
which(rmse_results==min(rmse_results), arr.ind = T)

Part 3: Exercise on Simulated Data

  1. We will now perform cross-validation on a simulated data set.
  1. Generate a simulated data set as follows: n=100 and p=2, the model used is \[ Y = X - 2X^2 + \epsilon \] \(\epsilon\) here follow standard normal distribution.
set.seed(123)
x=rnorm(100)
y=COMPLETE
sim=tibble(predictor=x,response=y)
  1. Create a scatterplot of X against Y. Comment on what you find.
#

Comment: ANSWER_HERE.

  1. Compute the 5-Fold RMSE that result from fitting the following four models:
  1. \(Y = \beta_0 + \beta_1X + \epsilon\)

  2. \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon\)

  3. \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\)

  4. \(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \beta_4X^4 + \epsilon\)

max_i = 4
rmse_ex = rep(NA,max_i)
train.model.func2 = function(data,i){
  COMPLETE
  return(mod)
}
for(i in 1:max_i){
  sim_final = sim %>%
    crossv_kfold(k=FILL) %>%
    mutate(tr.model=map(FILL)) %>%
    mutate(predict=map2(FILL)) %>%
    select(predict) %>%
    unnest(cols = c(predict))
  rmse_ex[i] = RMSE.func(COMPLETE)
}

Check which model has the smallest RMSE:

rmse_ex
  1. Which of the models in (c) had the smallest RMSE? Is this what you expected? Explain your answer.

ANSWER_HERE