Introduction

Classification is often the goal of supervised learning. In classification, the response variable has a finite number of outcomes, and typically is a categorical variable. Relevant information from a set of predictor variables is utilized to classify an observation according to one of the response variable’s categories. K-Nearest Neighbors (\(k\)-NN) is a popular machine learning technique for classification. The \(k\)-NN method is intuitive, non-parametric, and free of assumptions.

Although our focus is on an binary response variable, \(k\)-NN is easily generalized to response variables with more than 2 classes. Unfortunately, \(k\)-NN is already computationally intensive and requires more training data in this situation. The method also becomes more unreliable and difficult when considering many input variables, especially when a portion of the input variables may be irrelevant. Furthermore, since \(k\)-NN is affected by the scale of the input variables, all predictor variables should be standardized.

The data contained in library(titanic) is part of an ongoing competition at Kaggle.com. In this tutorial, we will develop an understanding of \(k\)-NN by attempting to predict whether a passenger would survive or die given important information known prior to the fatal iceberg collision. The data previewed below displays information we know about 891 different passengers that aboarded the death ship. For all these passengers, we know whether or not they lived based on the variable Survived. If Survived==1, we know the passenger lived to see better days. In this dataset, 342 passengers survived and the rest perished. The variables SibSp (# of Siblings or Spouses), Parch (# of Parents or Children), and Fare (Price of Ticket in $) represent information we want to utilize to classify passengers to one of two categories: Survived=1 or Died=0.

T1=titanic_train[,c("Survived","SibSp","Parch","Fare")]
head(T1)
##   Survived SibSp Parch    Fare
## 1        0     1     0  7.2500
## 2        1     1     0 71.2833
## 3        1     0     0  7.9250
## 4        1     1     0 53.1000
## 5        0     0     0  8.0500
## 6        0     0     0  8.4583

\(k\)-NN Algorithm

Given \(k\) and a test observation \(x_0\), the process of \(k\)-NN classification:

  • \(k\)-NN identifies the \(k\) points in the training data that are closest to \(x_0\), represented by \(\mathcal N_0\).

  • Usually, the distance is measured using the Euclidean distance.

  • It then estimates the conditional probability for class \(j\) as the fraction of points in \(\mathcal N_0\) whose response values equal \(j\): \[ \Pr(Y=j\mid X=x_0) =\frac1k\sum_{i\in\mathcal N_0}I(y_i=j). \]

  • The \(k\)-NN classifier assigns the test observation \(x_0\) to the class with the largest probability.

Number of Neighbors

The difficult part of this process is choosing an appropriate \(k\):

  • If \(k\) is too small, we may have instable predictions.

  • If \(k\) is too large, we may be persistently producing biased predictions.

Some form of cross-validation in the training data is usually used to tune the technique for the best \(k\) so that we can confidently classify new observations.

For more education on \(k\)-nearest neighbors, check out these links:

Part 1: Feature Engineering and Visualization

Chunk 1: Creation of a New Variable Family

T2 = T1 %>%
      mutate(Family=SibSp+Parch) %>%
      select(-SibSp,-Parch)
head(T2)

Chunk 2: Visualizing the Relationship

ggplot(T2) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
  theme_minimal()+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

Part 2: \(k\)-NN For Prediction

Alice had 3 family members on the ship and paid 100 dollars for the ticket. Did she survive or die?

Chunk 1: Information for Alice

Alice=c(3,100)
ggplot(T2) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
  geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8,alpha=0.1)+
  theme_minimal()+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

Chunk 2: Finding the \(k=5\) Most Similar Passengers

k=5

dist.func=function(point1,point2){
  dist=sqrt(sum((point1-point2)^2))
  return(dist)
}

T3=T2 %>% 
      mutate(d=apply(select(T2,Family,Fare),1,dist.func,point2=Alice)) %>%
      arrange(d) %>%
      filter(rank(d,ties.method="first")<=k)

print(T3)

Chunk 3: Visualize Alice’s \(k=5\) Nearest Neighbors

ggplot(T3) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)))+
  geom_point(aes(x=Alice[1],y=Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(min(T2$Family),max(T2$Family))+
  ylim(min(T2$Fare),max(T2$Fare))+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

Part 3: Transform and Revisit \(k\)-NN

Chunk 1: Standardizing and Visualizing Data

mean.Family=mean(T2$Family)
sd.Family=sd(T2$Family)
mean.Fare=mean(T2$Fare)
sd.Fare=sd(T2$Fare)

ST3= T2 %>% 
        mutate(Family=(Family-mean.Family)/sd.Family,
               Fare=(Fare-mean.Fare)/sd.Fare)

Z.Alice=(Alice-c(mean.Family,mean.Fare))/c(sd.Family,sd.Fare)
ggplot(ST3) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),alpha=c(0.3))+
  geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(-1,10)+
  ylim(-1,10)+
  xlab("Standardized Family")+
  ylab("Standardized Fare")+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

Chunk 2: The \(k=5\) Most Similar Passengers Again

ST4=ST3 %>% 
      mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
      arrange(d) %>%
      filter(rank(d,ties.method="first")<=k)

print(ST4 %>%
        mutate(Family=sd.Family*Family+mean.Family,
               Fare=sd.Fare*Fare+mean.Fare)
      )

ggplot(ST4) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
  geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(-1,10)+
  ylim(-1,10)+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

Part 4: Tuning \(k\) for \(k\)-NN

Chunk 1: Predicting Alice’s Survival for Large \(k\)

k=500
ST5=ST3 %>% 
  mutate(d=apply(select(ST3,Family,Fare),1,dist.func,point2=Z.Alice)) %>%
  arrange(d) %>%
  filter(rank(d,ties.method="first")<=k)

ggplot(ST5) +
  geom_point(aes(x=Family,y=Fare,color=factor(Survived)),size=4)+
  geom_point(aes(x=Z.Alice[1],y=Z.Alice[2]),shape="A",size=8)+
  theme_minimal()+
  xlim(-1,10)+
  ylim(-1,10)+
  theme(text=element_text(size=20)) +
  guides(color=guide_legend(title="Survived"))

KNN.PREDICT=table(ST5$Survived)
print(KNN.PREDICT)
ST5 %>% mutate(Family=sd.Family*Family+mean.Family,
               Fare=sd.Fare*Fare+mean.Fare) %>%
  arrange(desc(d)) %>%
  head(10)

Chunk 2: Leave-One-Out CV for Different \(k \in \{1,2,\cdots, 250\}\)

possible.k=1:250
accuracy.k=rep(NA,250)

for(k in 1:250){
  cv.out=knn.cv(train=select(ST3,Family,Fare),
                cl=factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")),
                k=k)
  correct=mean(cv.out==factor(ST3$Survived,levels=c(0,1),labels=c("Died","Survived")))
  accuracy.k[k]=correct
}

ggplot(data=tibble(possible.k,accuracy.k)) +
  geom_line(aes(x=possible.k,y=accuracy.k),color="lightskyblue2",size=2) +
  theme_minimal() +
  xlab("Choice of k") +
  ylab("Percentage of Accurate Predictions") +
  theme(text=element_text(size=20))

Chunk 3: Using Optimal \(k\)-NN for Prediction

best.k=which.max(accuracy.k)
TEST = titanic_test[,c("SibSp","Parch","Fare")] %>%
        mutate(Family=SibSp+Parch) %>%
        select(-SibSp,-Parch) %>%
        mutate(Fare=(Fare-mean.Fare)/sd.Fare,
               Family=(Family-mean.Family)/sd.Family) %>%
        na.omit()

TEST2 = TEST %>% 
          mutate(Predict=knn(train=select(ST3,Family,Fare),
                             test=select(TEST,Family,Fare),
                             cl=factor(ST3$Survived,levels=c(0,1),
                                       labels=c("Died","Survived")),
                             k=best.k)) %>%
          mutate(Family=sd.Family*Family+mean.Family,
                 Fare=sd.Fare*Fare+mean.Fare)

ggplot(TEST2) +
  geom_point(aes(x=Family,y=Fare,color=Predict),
             size=2,alpha=0.3) +
  theme_minimal() +
  theme(text=element_text(size=20))

Exercise

This question should be answered using \(k\)-NN and the Weekly data set, which is part of the ISLR package. This data contains 1089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

head(Weekly,5)
##   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
## 1 1990  0.816  1.572 -3.936 -0.229 -3.484 0.1549760 -0.270      Down
## 2 1990 -0.270  0.816  1.572 -3.936 -0.229 0.1485740 -2.576      Down
## 3 1990 -2.576 -0.270  0.816  1.572 -3.936 0.1598375  3.514        Up
## 4 1990  3.514 -2.576 -0.270  0.816  1.572 0.1616300  0.712        Up
## 5 1990  0.712  3.514 -2.576 -0.270  0.816 0.1537280  1.178        Up
  1. Standardize the 2-7 columns in Weekly with scale function.
Weekly[2:7] = COMPLETE
  1. Split the data set into training set with Year less than 2008 and put the rest into testing set.
train = Weekly[COMPLETE,]
test = Weekly[COMPLETE,]
  1. Use five lag variables (Lag1, Lag2, Lag3, Lag4, Lag5) and Volume as predictors and predict the label for the response Direction in the test set with k=10 and generate the confustion matrix.
knn.pred = knn(TRAIN_FEAT, TEST_FEAT ,LABLE_TRAIN, k = FILL)
table(PREDICTION , TRUE_LABEL)
  1. Compute testing accuracy of different values of \(k\) in the sequence k = seq(5,200,10). Make a plot that visualizes the relationship between value of \(k\) and testing accuracy.
k = seq(5,200,10)
accuracy = rep(NA,20)
for (i in c(1:20)){
  knn.pred = knn(TRAIN, TEST ,LABLE, k = FILL)
  accuracy[i] = COMPLETE
}

ggplot(data=tibble(k,accuracy)) +
  geom_line(aes(COMPLETE),color="lightskyblue2",size=2) +
  theme_minimal() +
  xlab("Choice of k") +
  ylab("Percentage of Accurate Predictions") +
  theme(text=element_text(size=10))

What is the best \(k\) based on the plot?

#

ANSWER_HERE:___________