Classification is often the goal of supervised learning. In classification, the response variable has a finite number of outcomes, and typically is a categorical variable. Relevant information from a set of predictor variables is utilized to classify an observation according to one of the response variable’s categories. K-Nearest Neighbors (\(k\)-NN) is a popular machine learning technique for classification. The \(k\)-NN method is intuitive, non-parametric, and free of assumptions.
Although our focus is on an binary response variable, \(k\)-NN is easily generalized to response variables with more than 2 classes. Unfortunately, \(k\)-NN is already computationally intensive and requires more training data in this situation. The method also becomes more unreliable and difficult when considering many input variables, especially when a portion of the input variables may be irrelevant. Furthermore, since \(k\)-NN is affected by the scale of the input variables, all predictor variables should be standardized.
The data contained in library(titanic)
is part of an
ongoing competition at Kaggle.com. In this
tutorial, we will develop an understanding of \(k\)-NN by attempting to predict whether a
passenger would survive or die given important information known prior
to the fatal iceberg collision. The data previewed below displays
information we know about 891 different passengers that aboarded the
death ship. For all these passengers, we know whether or not they lived
based on the variable Survived
. If
Survived==1
, we know the passenger lived to see better
days. In this dataset, 342 passengers survived and the rest perished.
The variables SibSp
(# of Siblings or Spouses),
Parch
(# of Parents or Children), and Fare
(Price of Ticket in $) represent information we want to utilize to
classify passengers to one of two categories: Survived=1 or
Died=0.
T1=titanic_train[,c("Survived","SibSp","Parch","Fare")]
head(T1)
## Survived SibSp Parch Fare
## 1 0 1 0 7.2500
## 2 1 1 0 71.2833
## 3 1 0 0 7.9250
## 4 1 1 0 53.1000
## 5 0 0 0 8.0500
## 6 0 0 0 8.4583
Given \(k\) and a test observation \(x_0\), the process of \(k\)-NN classification:
\(k\)-NN identifies the \(k\) points in the training data that are closest to \(x_0\), represented by \(\mathcal N_0\).
Usually, the distance is measured using the Euclidean distance.
It then estimates the conditional probability for class \(j\) as the fraction of points in \(\mathcal N_0\) whose response values equal \(j\): \[ \Pr(Y=j\mid X=x_0) =\frac1k\sum_{i\in\mathcal N_0}I(y_i=j). \]
The \(k\)-NN classifier assigns the test observation \(x_0\) to the class with the largest probability.
The difficult part of this process is choosing an appropriate \(k\):
If \(k\) is too small, we may have instable predictions.
If \(k\) is too large, we may be persistently producing biased predictions.
Some form of cross-validation in the training data is usually used to tune the technique for the best \(k\) so that we can confidently classify new observations.
For more education on \(k\)-nearest neighbors, check out these links:
Family
T2 = T1 %>%
mutate(Family=SibSp+Parch) %>%
select(-SibSp,-Parch)
head(T2)
ggplot(T2) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
theme_minimal()+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
Alice had 3 family members on the ship and paid 100 dollars for the ticket. Did she survive or die?
Alice=c(3,100)
ggplot(T2) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8,alpha=0.1)+
theme_minimal()+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
k=5
dist.func=function(point1,point2){
dist=sqrt(sum((point1-point2)^2))
return(dist)
}
T3=T2 %>%
mutate(d=apply(select(T2,Family,Fare),1,dist.func,point2=Alice)) %>%
arrange(d) %>%
filter(rank(d,ties.method="first")<=k)
print(T3)
ggplot(T3) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)))+
geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(min(T2$Family),max(T2$Family))+
ylim(min(T2$Fare),max(T2$Fare))+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
mean.Family=mean(T2$Family)
sd.Family=sd(T2$Family)
mean.Fare=mean(T2$Fare)
sd.Fare=sd(T2$Fare)
ST3= T2 %>%
mutate(Family=(Family-mean.Family)/sd.Family,
Fare=(Fare-mean.Fare)/sd.Fare)
Z.Alice=(Alice-c(mean.Family,mean.Fare))/c(sd.Family,sd.Fare)
ggplot(ST3) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(-1,10)+
ylim(-1,10)+
xlab("Standardized Family")+
ylab("Standardized Fare")+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
ST4=ST3 %>%
mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
arrange(d) %>%
filter(rank(d,ties.method="first")<=k)
print(ST4 %>%
mutate(Family=sd.Family*Family+mean.Family,
Fare=sd.Fare*Fare+mean.Fare)
)
ggplot(ST4) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(-1,10)+
ylim(-1,10)+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
k=500
ST5=ST3 %>%
mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
arrange(d) %>%
filter(rank(d,ties.method="first")<=k)
ggplot(ST5) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(-1,10)+
ylim(-1,10)+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
KNN.PREDICT=table(ST5$Survived)
print(KNN.PREDICT)
ST5 %>% mutate(Family=sd.Family*Family+mean.Family,
Fare=sd.Fare*Fare+mean.Fare) %>%
arrange(desc(d)) %>%
head(10)
possible.k=1:250
accuracy.k=rep(NA,250)
for(k in 1:250){
cv.out=knn.cv(train=select(ST3,Family,Fare),
cl=factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")),
k=k)
correct=mean(cv.out==factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")))
accuracy.k[k]=correct
}
ggplot(data=tibble(possible.k,accuracy.k)) +
geom_line(aes(x=possible.k,y=accuracy.k),color="lightskyblue2",size=2) +
theme_minimal() +
xlab("Choice of k") +
ylab("Percentage of Accurate Predictions") +
theme(text=element_text(size=20))
best.k=which.max(accuracy.k)
TEST = titanic_test[,c("SibSp","Parch","Fare")] %>%
mutate(Family=SibSp+Parch) %>%
select(-SibSp,-Parch) %>%
mutate(Fare=(Fare-mean.Fare)/sd.Fare,
Family=(Family-mean.Family)/sd.Family) %>%
na.omit()
TEST2 = TEST %>%
mutate(Predict=knn(train=select(ST3,Family,Fare),
test=select(TEST,Family,Fare),
cl=factor(ST3$Survived,levels=c(0,1),
labels=c("Died","Survived")),
k=best.k)) %>%
mutate(Family=sd.Family*Family+mean.Family,
Fare=sd.Fare*Fare+mean.Fare)
ggplot(TEST2) +
geom_point(aes(x=Family,y=Fare,color=Predict),
size=2,alpha=0.3) +
theme_minimal() +
theme(text=element_text(size=20))
This question should be answered using \(k\)-NN and the Weekly
data
set, which is part of the ISLR
package. This data contains
1089 weekly returns for 21 years, from the beginning of 1990 to the end
of 2010.
head(Weekly,5)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
Weekly
with
scale
function.Weekly[2:7] = COMPLETE
Year
less
than 2008 and put the rest into testing set.train = Weekly[COMPLETE,]
test = Weekly[COMPLETE,]
Lag1
, Lag2
,
Lag3
, Lag4
, Lag5
) and
Volume
as predictors and predict the label for the response
Direction
in the test set with k=10
and
generate the confustion matrix.knn.pred = knn(TRAIN_FEAT, TEST_FEAT ,LABLE_TRAIN, k = FILL)
table(PREDICTION , TRUE_LABEL)
k = seq(5,200,10)
. Make a plot that visualizes the
relationship between value of \(k\) and
testing accuracy.k = seq(5,200,10)
accuracy = rep(NA,20)
for (i in c(1:20)){
knn.pred = knn(TRAIN, TEST ,LABLE, k = FILL)
accuracy[i] = COMPLETE
}
ggplot(data=tibble(k,accuracy)) +
geom_line(aes(COMPLETE),color="lightskyblue2",size=2) +
theme_minimal() +
xlab("Choice of k") +
ylab("Percentage of Accurate Predictions") +
theme(text=element_text(size=10))
What is the best \(k\) based on the plot?
#
ANSWER_HERE:___________