1-Nearest Neighbor Classification Essay

Posted on by Nik

In four years of my career into analytics I have built more than 80% of classification models and just 15-20% regression models. These ratios can be more or less generalized throughout the industry. The reason of a bias towards classification models is that most analytical problem involves making a decision. For instance will a customer attrite or not, should we target customer X for digital campaigns, whether customer has a high potential or not etc. These analysis are more insightful and directly links to an implementation roadmap. In this article, we will talk about another widely used classification technique called K-nearest neighbors (KNN) . Our focus will be primarily on how does the algorithm work and how does the input parameter effect the output/prediction.

When do we use KNN algorithm?

KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique we generally look at 3 important aspects:

1. Ease to interpret output

2. Calculation time

3. Predictive Power

Let us take a few examples to  place KNN in the scale :

KNN algorithm fairs across all parameters of considerations. It is commonly used for its easy of interpretation and low calculation time.

How does the KNN algorithm work?

Let’s take a simple case to understand this algorithm. Following is a spread of red circles (RC) and green squares (GS) :

You intend to find out the class of the blue star (BS) . BS can either be RC or GS and nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane. Refer to following diagram for more details:

The three closest points to BS is all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.

How do we choose the factor K?

First let us try to understand what exactly does K influence in the algorithm. If we see the last example, given that all the 6 training observation remain constant, with a given K value we can make boundaries of each class. These boundaries will segregate RC from GS. The same way, let’s try to see the effect of value “K” on the class boundaries. Following are the different boundaries separating the two classes with different values of K.

If you watch carefully, you can see that the boundary becomes smoother with increasing value of K. With K increasing to infinity it finally becomes all blue or all red depending on the total majority.  The training error rate and the validation error rate are two parameters we need to access on different K-value. Following is the curve for the training error rate with varying value of K :

As you can see, the error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself.Hence the prediction is always accurate with K=1. If validation error curve would have been similar, our choice of K would have been 1. Following is the validation error curve with varying value of K:

This makes the story more clear. At K=1, we were overfitting the boundaries. Hence, error rate initially decreases and reaches a minima. After the minima point, it then increase with increasing K. To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.

End Notes

KNN algorithm is one of the simplest classification algorithm. Even with such simplicity, it can give highly competitive results. KNN algorithm can also be used for regression problems. The only difference from the discussed methodology will be using averages of nearest neighbors rather than voting from nearest neighbors. KNN can be coded in a single line on R. I am yet to explore how can we use KNN algorithm on SAS.

Did you find the article useful? Have you used any other machine learning tool recently? Do you plan to use KNN in any of your business problems? If yes, share with us how you plan to go about it.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

 

Introduction to K-nearest neighbor classifier

Most of the machine learning algorithms are parametric.  What do we mean by parametric? Let’s say if we are trying to model an linear regression model with one dependent variable and one independent variable. The best fit we are looking is the line equations with optimized parameters.

The parameters could be the intercept and coefficient. For any classification algorithm, we will try to get a boundary. Which successfully separates different target classes.

Let’s say for support vector machine we will try to find the margins and the support vectors. In this case, also we will have some set of parameters. Which needs to be optimized to get decent accuracy.

But today we are going to learn a different kind of algorithm which is non-parametric classification algorithm. Thinking how we can model such algorithm.

Let’s walk through this post to get know how we can do that. Just to give you one line summary.

The algorithm uses the neighbor points information to predict the target class.

Before we drive further let’s look at the table of contents.

Table of contents

Introduction to K-nearest neighbor classifier

K-nearest neighbor classifier is one of the introductory supervised classifier, which every data science learner should be aware of. Fix & Hodges proposed K-nearest neighbor classifier algorithm in the year of 1951 for performing pattern classification task.

For simplicity, this classifier is called as Knn Classifier. To be surprised k-nearest neighbor classifier mostly represented as Knn, even in many research papers too. Knn address the pattern recognition problems and also the best choices for addressing some of the classification related tasks.

The simple version of the K-nearest neighbor classifier algorithms is to predict the target label by finding the nearest neighbor class. The closest class will be identified using the distance measures like Euclidean distance.

K-nearest neighbor classification step by step procedure

Before diving into the k-nearest neighbor, classification process lets’s understand the application-oriented example where we can use the knn algorithm.

Knn classification application

Let’s assume a money lending company “XYZ” like UpStart, IndiaLends, etc. Money lending XYZ company is interested in making the money lending system comfortable & safe for lenders as well as for borrowers. The company holds a database of customer’s details.

Using customer’s detailed information from the database, it will calculate a credit score(discrete value) for each customer. The calculated credit score helps the company and lenders to understand the credibility of a customer clearly. So they can simply take a decision whether they should lend money to a particular customer or not.

The customer’s details could be:

  • Educational background details.
    • Highest graduated degree.
    • Cumulative grade points average (CGPA) or marks percentage.
    • The reputation of the college.
    • Consistency in his lower degrees.
    • Whether to take the education loan or not.
    • Cleared education loan dues.
  • Employment details.
    • Salary.
    • Year of experience.
    • Got any onsite opportunities.
    • Average job change duration.

The company(XYZ) use’s these kinds of details to calculate credit score of a customer. The process of calculating the credit score from the customer’s details is expensive. To reduce the cost of predicting credit score, they realized that the customers with similar background details are getting a similar credit score.

So, they decided to use already available data of customers and predict the credit score using it by comparing it with similar data. These kinds of problems are handled by the k-nearest neighbor classifier for finding the similar kind of customers.

K-nearest neighbor (Knn) algorithm pseudocode:

Let (Xi, Ci) where i = 1, 2……., n be data points. Xi denotes feature values & Ci denotes labels for Xi for each i.
Assuming the number of classes as ‘c’
Ci ∈ {1, 2, 3, ……, c} for all values of i

Let x be a point for which label is not known, and we would like to find the label class using k-nearest neighbor algorithms.

Knn Algorithm Pseudocode:

  1. Calculate “d(x, xi)” i =1, 2, ….., n; where d denotes the Euclidean distance between the points.
  2. Arrange the calculated n Euclidean distances in non-decreasing order.
  3. Let k be a +ve integer, take the first k distances from this sorted list.
  4. Find those k-points corresponding to these k-distances.
  5. Let ki denotes the number of points belonging to the ith class among k points i.e. k ≥ 0
  6. If ki >kj ∀ i ≠ j then put x in class i.

For data science, beginners the about pseudocode will be hard to understand. So let’s understand the knn algorithm using an example.

K-Nearest neighbor algorithm example

K- Nearest neighbor algorithm example

Let’s consider the above image where we have two different target classes white and orange circles. We have total 26 training samples. Now we would like to predict the target class for the blue circle. Considering k value as three, we need to calculate the similarity distance using similarity measures like Euclidean distance.

If the similarity score is less which means the classes are close. In the image, we have calculated distance and placed the less distance circles to blue circle inside the Big circle.

To learn about different similarity measures, please check out

Five most similarity measures.

With the above example, you got some idea about the process of the knn algorithm. Now read the next paragraph to understand the knn algorithm in technical words.

Let’s consider a setup with “n” training samples, where xi is the training data point. The training data points are categorized into “c” classes. Using KNN, we want to predict class for the new data point. So, the first step is to calculate the distance(Euclidean) between the new data point and all the training data points.

Next step is to arrange all the distances in non-decreasing order. Assuming a positive value of “K” and filtering “K” least values from the sorted list. Now, we have K top distances. Let ki denotes no. of points belonging to the ith class among k points. If ki >kj ∀i ≠ j then put x in class i.

Nearest Neighbor Algorithm:

Nearest neighbor is a special case of k-nearest neighbor class. Where  k value is 1 (k = 1). In this case, new data point target class will be assigned to the  1st closest neighbor.

How to choose the value of K?

Selecting the value of K in K-nearest neighbor is the most critical problem. A small value of K means that noise will have a higher influence on the result i.e., the probability of overfitting is very high. A large value of K makes it computationally expensive and defeats the basic idea behind KNN (that points that are near might have similar classes ). A simple approach to select k is k = n^(1/2).

To optimize the results, we can use Cross Validation. Using the cross-validation technique, we can test KNN algorithm with different values of K. The model which gives good accuracy can be considered to be an optimal choice.

It depends on individual cases, at times best process is to run through each possible value of k and test our result.

Condensed Nearest Neighbor Data Reduction Rule:

Working on a big dataset can be an expensive task. Using the condensed nearest neighbor rule, we can clean our data and can sort the important observations out of it. This process can reduce the execution time of the machine learning algorithm. But there is a chance of accuracy reduction.

The steps to condense is to divide data points into these:

  1. Outliers: Observations that lie at an abnormal distance from all the data points. Most of these are extreme values. Removing these observations will increase the accuracy of the model.
  2. Prototypes: Minimum points in training set required to recognize non-outlier points.
  3. Absorbed points: These are points that are correctly identified to be non-outlier points.

K-Nearest Neighbor case study

Breast cancer diagnosis using k-nearest neighbor (Knn) algorithm

To diagnose Breast Cancer, the doctor uses his experience by analyzing details provided by a) Patient’s Past Medical History b) Reports of all the tests performed. At times, it becomes difficult to diagnose cancer even for experienced doctors. Since the information provided by the patient might be unclear and insufficient.

Breast cancer dataset details

Breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. It contains 699 samples with 10 attributes.

The Main objective is to predict whether it’s benign or malignant. Many Scientists worked on this dataset and predicted class using different algorithms.

According to a research paper  of M.Sarkar & T.Y. Leong, Department of Computer Science, The National University Singapore 

“Conceptually and implementation-wise, the K-nearest neighbors algorithm is simpler than other techniques that have been applied to this problem. In addition, the K-nearest neighbors algorithm produces the overall classification result 1.17% better than the best result known for this problem.”

This is a classification problem and most of the classification techniques can be classified into the following three groups:

  1. Parametric
  2. Semiparametric
  3. Non-Parametric

Parametric & Semiparametric classifiers need specific information about the structure of data in training set. It is difficult to fulfill this requirement in many cases. So, non-parametric classifier like KNN was considered.

Data was randomly split into training, cross-validation & testing data. Experimentation was done with the value of K from K = 1 to 15. With KNN algorithm, the classification result of test set fluctuates between 99.12% and 98.02%. The best performance was obtained when K is 1.

Advantages of K-nearest neighbors algorithm

  • Knn is simple to implement.
  • Knn executes quickly for small training data sets.
  • performance asymptotically approaches the performance of the Bayes Classifier.
  • Don’t need any prior knowledge about the structure of data in the training set.
  • No retraining is required if the new training pattern is added to the existing training set.

Limitation to K-nearest neighbors algorithm

  • When the training set is large, it may take a lot of space.
  • For every test data, the distance should be computed between test data and all the training data. Thus a lot of time may be needed for the testing.

References

[1] Leif E. Peterson (2009)

[2] Sarkar M, Leong TY. Application of K-nearest neighbors algorithm on breast cancer diagnosis problem

[3] Breast Cancer Wisconsin (Original) Data Set, UCI Repository

[4] Introduction to k Nearest Neighbour Classification and Condensed Nearest Neighbour Data Reduction,, Oliver Sutton

Follow us:

FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT| FLIPBOARD | MEDIUM | GITHUB

I hope you like this post. If you have any questions then feel free to comment below.  If you want me to write on one specific topic then do tell it to me in the comments below.

Related Courses:

Do check out unlimited data science courses

Related

   #  Attribute                     Domain

   -------------------------------------------

   1.Sample codenumber            idnumber

   2.Clump Thickness               1-10

   3.Uniformity of Cell Size       1-10

   4.Uniformity of Cell Shape      1-10

   5.Marginal Adhesion             1-10

   6.Single Epithelial Cell Size   1-10

   7.Bare Nuclei                   1-10

   8.Bland Chromatin               1-10

   9.Normal Nucleoli               1-10

  10.Mitoses                       1-10

  11.Class:                        (2forbenign,4formalignant)

Categories: 1

0 Replies to “1-Nearest Neighbor Classification Essay”

Leave a comment

L'indirizzo email non verrà pubblicato. I campi obbligatori sono contrassegnati *