Overview: For each question, show your R code that you used to answer each question in the provided chunks. When a written response is required, be sure to answer the entire question in complete sentences outside the code chunks. When figures are required, be sure to follow all requirements to receive full credit. Point values are assigned for every part of this analysis.
Helpful: Make sure you knit the document as you go
through the assignment. Check all your results in the created HTML file.
Change Eval=F
to Eval=T
and knit the document
to HTML to make sure it works.
Submission: Submit via Canvas. Must be submitted as an HTML file generated in RStudio.
The rivers of the world are home to numerous fish species whose existence is dependent on the temperature of the water. Specifically for salmonid, read this article by Katharine Carter, an environmental scientist and lover of fish from sunny California. Salmonid varieties all thrive under different temperature ranges and issues arise when river temperatures are outside these ranges.
It’s a cool place and they say it gets colder. You’re bundled up now wait ’til you get older.
As we have been notified, it’s getting hot in here. Global warming is happening, and these fish are getting heatstroke. You can take my snow, but hands off my fish. I need that high quality protein and omega-3 fatty acids for mad gains. Because of these “fake” facts, protectors of the water have become interested in developing predictive models for maximum water temperature.
But the meteor men beg to differ. Judging by the hole in the satellite picture.
Below is a preview of a dataset containing close to a full year of
daily observed maximum air and maximum water temperatures for 31
different rivers in Spain. The variable D
represents the
Julian day and takes values from 1 to 365. The
variable L
identifies 31 different measurement station
locations. Variables W
and A
are the maximum
water and air temperatures, respectively. Finally, the variable named
T
for time maintains the order of the data according to
when it was observed. For the sake of our sanity, all days missing
important information have been removed using
na.omit()
.
DATA=na.omit(read_csv("AirWaterTemp.csv"))
glimpse(DATA)
## Rows: 9,857
## Columns: 5
## $ D <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…
## $ L <dbl> 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103, 103…
## $ W <dbl> 14.2, 14.4, 14.4, 10.9, 10.8, 10.7, 10.3, 10.1, 9.8, 10.5, 10.5, 9.9…
## $ A <dbl> 21.2, 16.8, 15.4, 10.8, 11.7, 12.4, 13.2, 11.8, 5.1, 3.5, 5.3, 6.2, …
## $ T <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2…
The ice we skate is getting pretty thin. The water’s getting warm so you might as well swim.
What is the point of stealing your tuition, if I cannot grab some
fish? The model below is an expression of a family of polynomial
regression models that use A
and D
to explain
the variation in W
. Every choice of \(I\) and \(J\) leads to a different model. For our
purpose, we would like to consider all possible subset models where
\(I\leq 7\) and \(J\leq 7\). To choose the best model, we
rely on cross-validation (CV) to estimate out-of-sample root mean
squared error (RMSE). The best choice of \(I\) and \(J\) minimize prediction error. I highly
recommend seeing the previous Tutorial
which provides necessary code to complete this assignment. If you don’t
look at this tutorial, you will be miserable like these fish.
\[W = a+\sum\limits_{i=1}^I b_i A^i + \sum\limits_{j=1}^J c_j D^j + \epsilon\]
My world’s on fire. How about yours? That’s the way I like it and I’ll never get bored.
Where every you see COMPLETE
, there is code required for
you to write. Lines of code where you see #DO NOT CHANGE
are meant to not be touched. The output that results from these lines is
what will be graded. As you go through the assignment, knit the document
to PDF. Make sure you change eval=F
to eval=T
.
Check your final PDF document before you submit.
In a previous Tutorial,
I demonstrated how to use 10-Fold CV to obtain an out-of-sample RMSE
when \(I=4\) and \(J=3\). In this assignment, we will also use
10 folds. Using crossv_kfold()
, create a new dataframe
called DATA2
that contains two new variables
train
and test
that are list-columns.
Code and Output:
set.seed(216) #DO NOT CHANGE
COMPLETE
head(DATA2) #DO NOT CHANGE
Create a function called RMSE.func()
that takes two
vector arguments called actual and predict
representing actual responses in the data and predicted responses from a
model, respectively. This function should output the RMSE, which
measures the overall error between actual and predicted values.
Code and Output:
x=c(1,3,4) #DO NOT CHANGE
y=c(0,0,0) #DO NOT CHANGE
COMPLETE
RMSE.func(actual=x,predict=y) #DO NOT CHANGE
For a specific \(I\) and \(J\), the following function fits the desired polynomial model to a given set of data. This function can be utilized to fit polynomial regression models of varying degrees.
train.model.func=function(data,I,J){
mod=lm(W~poly(A,I)+poly(D,J),data=data)
return(mod)
}
In the code chunk below, we begin by initiating an empty \(7 \times 7\) matrix of missing values
called OUT.RMSE
. Each row corresponds to a different choice
of \(I\) and each column corresponds to
a different choice of \(J\). Apply the
code in the previous Tutorial
in a double loop that performs 10-Fold CV to estimate out-of-sample RMSE
under each polynomial model where \(I \in
\{1,2,3, \cdots,7\}\) and \(J \in
\{1,2,3, \cdots,7\}\) and then saves the RMSE in the \((I,J)\)-cell of the matrix
OUT.RMSE
.
Code and Output:
OUT.RMSE=matrix(NA,7,7) #DO NOT CHANGE
COMPLETE
print(OUT.RMSE) #DO NOT CHANGE
In the code chunk below, we start by making the information found the
summarized information in OUT.RMSE
tidy. There are three
columns in OUT.RMSE2
that links the cross-validated RMSE to
all considered combinations of \(I\)
and \(J\). Change eval=F
to eval=T
before knitting to HTML.
OUT.RMSE2=as.tibble(OUT.RMSE) %>%
mutate(I=1:7) %>%
rename(`1`=V1,`2`=V2,`3`=V3,`4`=V4,`5`=V5,`6`=V6,`7`=V7) %>%
select(I,everything()) %>%
gather(`1`:`7`,key="J",value="RMSE",convert=T) %>%
mutate(I=as.factor(I),J=as.factor(J))
head(OUT.RMSE2)
Create a tibble called BEST5.RMSE
which contains the
rows in OUT.RMSE2
corresponding to the best five models
according to the RMSE and sorted from best to worst.
Code and Output:
COMPLETE
head(BEST5.RMSE) #DO NOT CHANGE
Now, observe the figure below that shows the change in maximum water
temperature (\(W\)) across Julian days
(\(D\)). Change eval=F
to
eval=T
before knitting to HTML.
ggplot(DATA) +
geom_point(aes(x=D,y=W),alpha=0.05,stroke=0) +
theme_minimal() +
xlab("Julian Day")+
ylab("Max Water Temperature")
Using mutate()
, we create a tibble
BEST5.DATA
based off DATA
. The new object
BEST5.DATA
contains 5 columns of predictions under the top
5 models based on the values of I
and J
in
BEST5.RMSE
. The five columns of predictions should be given
names First
, Second
, Third
,
Fourth
,Fifth
, in order from best to worst.
Change eval=F
to eval=T
before knitting to
HTML.
BEST5.DATA=DATA %>%
mutate(First=predict(lm(W~poly(A,as.numeric(BEST5.RMSE$I[1]))+poly(D,as.numeric(BEST5.RMSE$J[1])),data=DATA)),
Second=predict(lm(W~poly(A,as.numeric(BEST5.RMSE$I[2]))+poly(D,as.numeric(BEST5.RMSE$J[2])),data=DATA)),
Third=predict(lm(W~poly(A,as.numeric(BEST5.RMSE$I[3]))+poly(D,as.numeric(BEST5.RMSE$J[3])),data=DATA)),
Fourth=predict(lm(W~poly(A,as.numeric(BEST5.RMSE$I[4]))+poly(D,as.numeric(BEST5.RMSE$J[4])),data=DATA)),
Fifth=predict(lm(W~poly(A,as.numeric(BEST5.RMSE$I[5]))+poly(D,as.numeric(BEST5.RMSE$J[5])),data=DATA)))
Then, I want you to use the pipe %>%
with
gather()
to create a new tibble called
BEST5.DATA2
that gathers all the predictions. A variable
named Model
should contain the values
“First”,“Second”,“Third”,“Fourth”, and “Fifth”. A variable named
Predict
should contain the predictions corresponding to the
appropriate models. In gather()
, be sure to set
factor_key=T
to ensure that the new variable
Model
is a factor variable with ordered levels logically
from “First” to “Fifth”.
Code and Output:
COMPLETE
head(BEST5.DATA2) #DO NOT CHANGE
Create a figure that overlays the raw and predicted maximum water
temperatures for the top 5 models given the Julian Day. The raw data
needs to be shown in a scatter plot using geom_point()
with
alpha=0.05
and stroke=0
. The predictions
should be created using geom_line()
with different colors
for each of the Models. Label the x-axis “Julian Day” and the y-axis
“Max Water Temperature”. In a complete sentence, explain how the
predicted maximum water temperatures differ for the top five models.
Code and Output (3 Points):
ggplot(BEST5.DATA2) +
COMPLETE
Answer (2 Points): ANSWER IN COMPLETE SENTENCES HERE
The following two figures show the marginal change in the average
out-of-sample RMSE as \(I\) and \(J\) increase. Based on these figures, what
\(I\) and \(J\) would you recommend going forward?
Critically, think about what these figures are telling us. Give a reason
for your answer based on these graphics. Answer the question below the
two figures in complete sentences. Change eval=F
to
eval=T
before knitting to HTML.
Answer: ANSWER IN COMPLETE SENTENCES HERE
I want you to create a simple function called
BEST.func()
that outputs a vector of length 2 where the
first element is the \(I\) and the
second element is the \(J\) that
corresponds to the lowest RMSE. The output vector should be a numeric
vector and not a tibble with factor variables. The only argument called
“data” should be a data frame that is identical in format to
OUT.RMSE2
. The last line will process the function when
data=OUT.RMSE2
and save the best choices of I
and J
to an object called BEST.CHOICE
and then
print out the vector of length 2 with the ideal \(I\) and \(J\) leading to the smallest RMSE. I advise
using the function which.min()
.
Code and Output:
BEST.func=function(data){
COMPLETE
}
BEST.CHOICE=BEST.func(data=OUT.RMSE2) #DO NOT CHANGE
print(BEST.CHOICE) #DO NOT CHANGE
Based on the best model, I want you to obtain the fitted values for
all observations in DATA
. Then, I want you to create a
figure that shows the relationship between the fitted maximum water
temperatures under the best model and the actual maximum water
temperatures. In this figure, you should label the x-axis “Actual Max
Water Temperature” and the y-axis “Fitted Max Water Temperature”. There
should also be a 45 degree reference line with y-intercept equal to 0
indicating perfect prediction. Make this reference line the color “red”.
The function predict()
or augment()
could be
helpful here along with the previously created function
train.model.func()
.
Code and Output:
COMPLETE
Based on the best model, I want you to obtain the residuals for all
observations in DATA
. Then, I want you to create a
scatterplot that shows the different residuals over day D
.
In this figure, you should label the x-axis “Day” and the y-axis
“Residual”. There should also be a horizontal reference line at 0
indicating perfect prediction. Make this reference line the color “red”.
The function residuals()
, works similar to
predict()
, but outputs the residuals and not the fitted
values.
Code and Output:
COMPLETE