We will continue our work with daily water temperature and air
temperature data observed for 31 rivers in Spain. In the preview below,
W represents the daily maximum water temperature and
A represents the daily maximum air temperature. The data
contains almost a full year of data for each of the 31 different
rivers.
## # A tibble: 6 × 8
## JULIAN_DAY YEAR L W A TIME MONTH DAY
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2003 103 14.2 21.2 1 1 1
## 2 2 2003 103 14.4 16.8 2 1 2
## 3 3 2003 103 14.4 15.4 3 1 3
## 4 4 2003 103 10.9 10.8 4 1 4
## 5 5 2003 103 10.8 11.7 5 1 5
## 6 6 2003 103 10.7 12.4 6 1 6
Using the data, we seek to identify the best model for predicting the maximum water temperature given the maximum air temperature. Previously, we randomly selected 3 rivers to act as a test set. All models were evaluated based on the randomly selected test set. In this tutorial, we explore approaches that ensure that all data is used for both model training and model testing.
In this tutorial, we apply helpful functions in the
purrr, modelr, and broom
packages. See the following links for helpful articles on performing
cross-validation within the tidyverse: Link
1, Link
2, and Link
3.
NEST.DATA = DATA %>% group_by(L) %>% nest()
head(NEST.DATA)
## # A tibble: 6 × 2
## # Groups: L [6]
## L data
## <dbl> <list>
## 1 103 <tibble [365 × 7]>
## 2 105 <tibble [366 × 7]>
## 3 107 <tibble [365 × 7]>
## 4 111 <tibble [365 × 7]>
## 5 112 <tibble [366 × 7]>
## 6 118 <tibble [366 × 7]>
filter() with unnest()
To Split DataNEST.DATA %>% filter(L==103) %>% unnest() %>% glimpse()
## Rows: 365
## Columns: 8
## Groups: L [1]
## $ L <dbl> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103,…
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ YEAR <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003,…
## $ W <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10.5, …
## $ A <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5, 5…
## $ TIME <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
NEST.DATA %>% filter(L!=103) %>% unnest() %>% glimpse()
## Rows: 10,972
## Columns: 8
## Groups: L [30]
## $ L <dbl> 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105,…
## $ JULIAN_DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ YEAR <dbl> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008,…
## $ W <dbl> 7.9, 7.8, 8.7, 9.0, 9.4, 10.4, 9.9, 9.8, 10.3, NA, 9.5, 9.5…
## $ A <dbl> 14.0, 15.0, 12.4, 14.6, 18.0, 18.2, 12.0, 14.2, 16.3, NA, 1…
## $ TIME <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ MONTH <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DAY <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
DATA2=DATA
DATA2$linpred=NA
TEST = NEST.DATA %>% filter(L==103) %>% unnest()
TRAIN = NEST.DATA %>% filter(L!=103) %>% unnest()
linmod=lm(W~A,data=TRAIN)
linmodpred=predict(linmod,newdata=TEST)
DATA2$linpred[which(DATA2$L==103)]=linmodpred
tail(DATA2)
DATA2=DATA
DATA2$linpred=NA
for(k in unique(DATA2$L)){
# Here we loop over 31 locations. k takes values from (103,105,107,...)
TEST = COMPLETE
TRAIN = COMPLETE
linmod= COMPLETE
linmodpred=COMPLETE
DATA2$linpred[FILL]=COMPLETE
}
RMSE.func=function(actual,predict){
mse=mean((actual-predict)^2,na.rm=T)
rmse=sqrt(mse)
return(rmse)
}
RMSE.func(actual=DATA2$W,predict=DATA2$linpred)
\[W=a+\sum_{i=1}^Ib_iA^i+\sum_{j=1}^Jc_jD^j+\varepsilon\]
polymodel=lm(W~poly(A,4)+poly(JULIAN_DAY,3),data=na.omit(DATA))
tidy(polymodel)
glance(polymodel)
DATA3=na.omit(DATA) %>% crossv_kfold(10)
DATA3
train.model.func=function(data,i,j){
mod=lm(W~poly(A,i)+poly(JULIAN_DAY,j),data=data)
return(mod)
}
i=4
j=3
DATA4=DATA3 %>%
mutate(tr.model=map(train,train.model.func,i=i,j=j))
DATA4
DATA4.PREDICT = DATA4 %>%
mutate(predict=map2(test,tr.model,~augment(.y,newdata=.x))) %>%
select(predict) %>%
unnest(predict)
head(DATA4.PREDICT)
RMSE.func(actual=DATA4.PREDICT$W,predict=DATA4.PREDICT$.fitted)
max_i = 10
max_j = 10
rmse_results = matrix(NA,max_i,max_j)
for(i in 1:max_i){
for(j in 1:max_j){
DATA4=DATA3 %>%
mutate(tr.model=map(train,train.model.func,i=i,j=j))
DATA4.PREDICT = DATA4 %>%
mutate(predict=map2(test,tr.model,~augment(.y,newdata=.x))) %>%
select(predict) %>%
unnest(cols = c(predict))
rmse_results[i,j] = RMSE.func(actual=DATA4.PREDICT$W,predict=DATA4.PREDICT$.fitted)
}
}
which(rmse_results==min(rmse_results), arr.ind = T)
set.seed(123)
x=rnorm(100)
y=COMPLETE
sim=tibble(predictor=x,response=y)
#
Comment: ANSWER_HERE.
\(Y = \beta_0 + \beta_1X + \epsilon\)
\(Y = \beta_0 + \beta_1X + \beta_2X^2 + \epsilon\)
\(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \epsilon\)
\(Y = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 + \beta_4X^4 + \epsilon\)
max_i = 4
rmse_ex = rep(NA,max_i)
train.model.func2 = function(data,i){
COMPLETE
return(mod)
}
for(i in 1:max_i){
sim_final = sim %>%
crossv_kfold(k=FILL) %>%
mutate(tr.model=map(FILL)) %>%
mutate(predict=map2(FILL)) %>%
select(predict) %>%
unnest(cols = c(predict))
rmse_ex[i] = RMSE.func(COMPLETE)
}
Check which model has the smallest RMSE:
rmse_ex
ANSWER_HERE