Classification is often the goal of supervised learning. In classification, the response variable has a finite number of outcomes, and typically is a categorical variable. Relevant information from a set of predictor variables is utilized to classify an observation according to one of the response variable’s categories. K-Nearest Neighbors (\(k\)-NN) is a popular machine learning technique for classification. The \(k\)-NN method is intuitive, non-parametric, and free of assumptions.
Although our focus is on an binary response variable, \(k\)-NN is easily generalized to response variables with more than 2 classes. Unfortunately, \(k\)-NN is already computationally intensive and requires more training data in this situation. The method also becomes more unreliable and difficult when considering many input variables, especially when a portion of the input variables may be irrelevant. Furthermore, since \(k\)-NN is affected by the scale of the input variables, all predictor variables should be standardized.
The data contained in library(titanic) is part of an
ongoing competition at Kaggle.com. In this
tutorial, we will develop an understanding of \(k\)-NN by attempting to predict whether a
passenger would survive or die given important information known prior
to the fatal iceberg collision. The data previewed below displays
information we know about 891 different passengers that aboarded the
death ship. For all these passengers, we know whether or not they lived
based on the variable Survived. If
Survived==1, we know the passenger lived to see better
days. In this dataset, 342 passengers survived and the rest perished.
The variables SibSp (# of Siblings or Spouses),
Parch (# of Parents or Children), and Fare
(Price of Ticket in $) represent information we want to utilize to
classify passengers to one of two categories: Survived=1 or
Died=0.
T1=titanic_train[,c("Survived","SibSp","Parch","Fare")]
head(T1)
## Survived SibSp Parch Fare
## 1 0 1 0 7.2500
## 2 1 1 0 71.2833
## 3 1 0 0 7.9250
## 4 1 1 0 53.1000
## 5 0 0 0 8.0500
## 6 0 0 0 8.4583
Given \(k\) and a test observation \(x_0\), the process of \(k\)-NN classification:
\(k\)-NN identifies the \(k\) points in the training data that are closest to \(x_0\), represented by \(N_0\).
Usually, the distance is measured using the Euclidean distance.
It then estimates the conditional probability for class \(j\) as the fraction of points in \(N_0\) whose response values equal \(j\): \[ \Pr(Y=j\mid X=x_0) =\frac1k\sum_{i\in N_0}I(y_i=j). \]
The \(k\)-NN classifier assigns the test observation \(x_0\) to the class with the largest probability.
The difficult part of this process is choosing an appropriate \(k\):
If \(k\) is too small, we may have instable predictions.
If \(k\) is too large, we may be persistently producing biased predictions.
Some form of cross-validation in the training data is usually used to tune the technique for the best \(k\) so that we can confidently classify new observations.
For more education on \(k\)-nearest neighbors, check out these links:
FamilyT2 = T1 %>%
mutate(Family=SibSp+Parch) %>%
select(-SibSp,-Parch)
head(T2)
## Survived Fare Family
## 1 0 7.2500 1
## 2 1 71.2833 1
## 3 1 7.9250 0
## 4 1 53.1000 1
## 5 0 8.0500 0
## 6 0 8.4583 0
ggplot(T2) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
theme_minimal()+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
Alice had 3 family members on the ship and paid 100 dollars for the ticket. Did she survive or die?
Alice=c(3,100)
ggplot(T2) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8,alpha=0.1)+
theme_minimal()+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
k=5
dist.func=function(point1,point2){
dist=sqrt(sum((point1-point2)^2))
return(dist)
}
T3=T2 %>%
mutate(d=apply(select(T2,Family,Fare),1,dist.func,point2=Alice)) %>%
arrange(d) %>%
filter(rank(d,ties.method="first")<=k)
print(T3)
## Survived Fare Family d
## 1 1 93.500 2 6.576473
## 2 0 106.425 1 6.729088
## 3 1 106.425 0 7.090883
## 4 1 93.500 0 7.158911
## 5 1 108.900 1 9.121952
ggplot(T3) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)))+
geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(min(T2$Family),max(T2$Family))+
ylim(min(T2$Fare),max(T2$Fare))+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
mean.Family=mean(T2$Family)
sd.Family=sd(T2$Family)
mean.Fare=mean(T2$Fare)
sd.Fare=sd(T2$Fare)
ST3= T2 %>%
mutate(Family=(Family-mean.Family)/sd.Family,
Fare=(Fare-mean.Fare)/sd.Fare)
Z.Alice=(Alice-c(mean.Family,mean.Fare))/c(sd.Family,sd.Fare)
ggplot(ST3) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(-1,10)+
ylim(-1,10)+
xlab("Standardized Family")+
ylab("Standardized Fare")+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
ST4=ST3 %>%
mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
arrange(d) %>%
filter(rank(d,ties.method="first")<=k)
print(ST4 %>%
mutate(Family=sd.Family*Family+mean.Family,
Fare=sd.Fare*Fare+mean.Fare)
)
## Survived Fare Family d
## 1 1 120.0 3 0.4024677
## 2 1 120.0 3 0.4024677
## 3 1 120.0 3 0.4024677
## 4 1 120.0 3 0.4024677
## 5 1 93.5 2 0.6334387
ggplot(ST4) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(-1,10)+
ylim(-1,10)+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
# Part 4: Tuning \(k\) for \(k\)-NN
k=500
ST5=ST3 %>%
mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
arrange(d) %>%
filter(rank(d,ties.method="first")<=k)
ggplot(ST5) +
geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
theme_minimal()+
xlim(-1,10)+
ylim(-1,10)+
theme(text=element_text(size=20)) +
guides(color=guide_legend(title="Survived"))
KNN.PREDICT=table(ST5$Survived)
print(KNN.PREDICT)
##
## 0 1
## 251 249
ST5 %>% mutate(Family=round(sd.Family*Family+mean.Family),
Fare=sd.Fare*Fare+mean.Fare) %>%
arrange(desc(d)) %>%
head(10)
## Survived Fare Family d
## 1 1 13 0 2.553877
## 2 1 13 0 2.553877
## 3 1 13 0 2.553877
## 4 0 13 0 2.553877
## 5 0 13 0 2.553877
## 6 0 13 0 2.553877
## 7 1 13 0 2.553877
## 8 0 13 0 2.553877
## 9 0 13 0 2.553877
## 10 0 13 0 2.553877
possible.k=1:250
accuracy.k=rep(NA,250)
for(i in 1:250){
cv.out=knn.cv(train=select(ST3,Family,Fare),
cl=factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")),
k=i)
correct=mean(cv.out==factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")))
accuracy.k[i]=correct
}
ggplot(data=tibble(possible.k,accuracy.k)) +
geom_line(aes(x=possible.k,y=accuracy.k),color="lightskyblue2",size=2) +
theme_minimal() +
xlab("Choice of k") +
ylab("Percentage of Accurate Predictions") +
theme(text=element_text(size=20))
best.k=which.max(accuracy.k)
print(best.k)
## [1] 13
#best.k=which.max(accuracy.k)
TEST = titanic_test[,c("SibSp","Parch","Fare")] %>%
mutate(Family=SibSp+Parch) %>%
select(-SibSp,-Parch) %>%
mutate(Fare=(Fare-mean.Fare)/sd.Fare,
Family=(Family-mean.Family)/sd.Family) %>%
na.omit()
TEST2 = TEST %>%
mutate(Predict=knn(train=select(ST3,Family,Fare),
test=select(TEST,Family,Fare),
cl=factor(ST3$Survived,levels=c(0,1),
labels=c("Died","Survived")),
k=best.k)) %>%
mutate(Family=sd.Family*Family+mean.Family,
Fare=sd.Fare*Fare+mean.Fare)
ggplot(TEST2) +
geom_point(aes(x=Family,y=Fare,color=Predict),
size=2,alpha=0.3) +
theme_minimal() +
theme(text=element_text(size=20))
# Exercise
This question should be answered using \(k\)-NN and the Weekly data
set, which is part of the ISLR package. This data contains
1089 weekly returns for 21 years, from the beginning of 1990 to the end
of 2010.
head(Weekly,5)
## Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
## 1 1990 0.816 1.572 -3.936 -0.229 -3.484 0.1549760 -0.270 Down
## 2 1990 -0.270 0.816 1.572 -3.936 -0.229 0.1485740 -2.576 Down
## 3 1990 -2.576 -0.270 0.816 1.572 -3.936 0.1598375 3.514 Up
## 4 1990 3.514 -2.576 -0.270 0.816 1.572 0.1616300 0.712 Up
## 5 1990 0.712 3.514 -2.576 -0.270 0.816 0.1537280 1.178 Up
Weekly with
scale function.Weekly[2:7] = scale(Weekly[2:7])
Year less
than 2008 and put the rest into testing set.train = Weekly[Weekly$Year<2008,]
test = Weekly[Weekly$Year>=2008,]
Lag1, Lag2,
Lag3, Lag4, Lag5) and
Volume as predictors and predict the label for the response
Direction in the test set with k=10 and
generate the confustion matrix.knn.pred = knn(train[,2:7], test[,2:7] ,train[,9], k = 10)
table(knn.pred , test[,9])
##
## knn.pred Down Up
## Down 21 25
## Up 51 59
k = seq(5,200,10). Make a plot that visualizes the
relationship between value of \(k\) and
testing accuracy.k = seq(5,200,10)
accuracy = rep(0,20)
for (i in c(1:20)){
knn.pred = knn(train[,2:7], test[,2:7] ,train[,9], k = k[i])
accuracy[i] = mean(knn.pred == test[,9])
}
ggplot(data=tibble(k,accuracy)) +
geom_line(aes(x=k,y=accuracy),color="lightskyblue2",size=2) +
theme_minimal() +
xlab("Choice of k") +
ylab("Percentage of Accurate Predictions") +
theme(text=element_text(size=10))
What is the best \(k\) based on the plot?
k[which.max(accuracy)]
## [1] 25
ANSWER_HERE:_____25______