How can we find the optimum K in K-Nearest Neighbor?

02 February 2014 41 8K Report

Sometimes it's mentioned that, as a rule of thumb, setting K to the square root of the number of training patterns/samples can lead to better results.

Is there any justification for that term or have you ever seen that in a paper?

Any other straight forward solutions?

Konstantin Nagaev Popular answer

It seems to me that there are many confusions with "K"s.

K in K-fold (KFCV) and K in K-Nearest Neighbours (KNN) are distinctly different characteristics.

K in K-fold is the ratio of splitting a dataset into training and test samples.

K in KNN is the number of instances that we take into account for determination of affinity with classes.

They have strictly different meaning!

Of course, you can use KFCV for testing performance of KNN with some various quantities of neighbours and it will be useful.

But you can't (and you should not) take the value for K in KNN from K of KFCV.

So you need to investigate performance of KNN near rule-of-thumb-value and make a decision about the optimal one using any algorithm of performance testing (such as KFCV).

Abdullah Yousafzai

You can go through this paper it is bootstrap method :

"Choice of neighbor order in nearest-neighbor classification". There are also some parameter optimization heuristics Google it

Mahnaz Behroozi

M. Naresh, you're right but at the same time, by doing that, the class/decision boundaries will not be precise any more.

Khurram Khurshid

There is no rule of thumb Mahnaz. Choice of K is somewhat driven by the end application as well as the dataset.

Simone Scardapane

The simplest solution is probably K-Fold Cross Validation:

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

You can use this method for any kind of model selection, but it is particularly suited here since: (i) you tipically have only a few possible candidates for K (e.g. in the order 3 - 10 or 50 - 100); and (ii) performance is rather monotone on the number of neighbours.

An alternative, widely used technique is bootstrapping, as already written by Abdullah Yousafzai. The choice of K equal to the square root of the number of instances is an empirical rule-of-thumb popularized by the "Pattern Classification" book by Duda et al., it is probably a good starting point (as shown by the fact that many packages implementing KNN use it).

Evangelos Kanoulas

Another option would be to use Bayesian methods (see this thesis http://www.gatsby.ucl.ac.uk/~heller/thesis.pdf) and let the data play against the priors to decide what the best K is.

Ahmad Hassanat

This is mentioned some where in this chapter "Classifiers Based on Inverted Distances" by Marcel Jirina and Marcel Jirina, Jr.

Arun Vats

At the same time we also can say that, If K - is a Prime number then it would lead to better result

Saeed Reza Kheradpisheh

The KNN method is a non-parametric statistical classification technique which supposes that no statistical distribution is fitted to the data of each class. Hence this method tries to predict the class of new data points based on the nearest neighbors. The value of K is extremely training-data dependent, changing the position of a few training data may led to a significant loss of performance. Hence this method is not stable particularly in class borders. However, the K-fold cross-validation should be useful to find the K value which led to the highest classification generalizability.

Mannes Poel

In my opinion the first thing to resolve in the question by Mahnaz is: What is meant by "optimal"?

Mahnaz Behroozi

Well, on a particular dataset if we employ KNN with different amounts for "K", we obtain different accuracy at each round. So it seems natural, to call the "K" which leads to achieving the best accuracy, as optimum "K".

Mannes Poel

Now the question becomes: What do you mean by "accuracy"?

George Terzakis

Hi Mahnaz.

The most typical solution to the problem of identifying K is to try to minimize the log-likelihood of the data plus a K-dependent penalty for a number of candidate values.

In other words,

Run the K-means algorithm for K = 1, 2, 3, 4, ...

Find the K that minimizes the expression: sumi(sumj(norm(x[i,j] - c[i])))) + a * K

where a is a positive cost, x[i,j] is the j-th point in the i-th cluster, c[i] is the i-th cluster center and by sumi I mean the sum over the index-i. Try to choose an a that causes significant changes in the log-likelihood from one K to another. In practice, you should be able to identify a along with the optimal K.

The expression sumi(sumj(norm(x[i,j] + c[i])))) + a * K is essentially an instantiation of Bayesian Information Criterion should you wish to look it up. You can extend this expression to take the scatter (covariance) of the data around each cluster center if that works any better.

Reudismam Rolim

Cross-Validation is a great tool to perform this tasks.

To learn more about cross-validation see this excellent book (available for free) chapter 5 - Resampling methods.

http://www-bcf.usc.edu/~gareth/ISL/

Walid Aly

The Elbow method

http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

Goran S. Milovanović

MCLUS package, R. It will utilize expected-maximization to perform model-based clustering (i.e. clusters are multivariate Normal components of a probability mixture over the feature space) and provide a range of differently constrained solutions. BIC is used to compare the log-likelihoods. If you opt for MCLUST, you do not rely on a non-parametric approach anymore, but you will never ever again have to use rule of thumbs and heuristics to select the optimal number of clusters.

Thorsten Merten

As Simone Scardapane correctly said you have to know your data in terms of which range for k (K) might be reasonable. The cross validation can then be seen as emprirical evaluation (or trial and error ;) ). You come up with some K run the algorithm a few times and then pick the best k.

However, if you do not have a clue about the number of clusters (e.g. the range of Ks is huge or unpredictable or varies a lot or ...), you may want to choose an algorithm which tries to avoid the problem picking the number of clusters up front, e.g. Hierarchical Clustering. This - as always in data mining - depends on your data and you problem.

Robert Patton

Goran, do you know how much data can MCLUST handle?

Mahnaz Behroozi

Thanks to all. But unfortunately it seems that some of our friends misunderstand the question! I'm talking about K-nearest neighbor classification algorithm, not K-means or C-means clustering method. They are quite different methods!

Hashim Yasin

The value of optimum K totally depends on the data that you are using. In different scenarios, the optimum K may vary. The cross validation may be tried to find out the optimum K. It is more or less hit and trail method otherwise you have to calculate the probability or likelihood of the data for the value of K.

Gianluca Bontempi

Techniques of local cross validation for nearest neighbours are discussed in my PhD thesis

Thesis Local Learning Techniques for Modeling, Prediction and Control

and some of my papers:

see for instance the NIPS papers

Article Lazy Learning Meets the Recursive Least Squares Algorithm

https://www.researchgate.net/publication/2296344_Lazy_Learners_at_work_the_Lazy_Learning_Toolbox?ev=prf_cont_person_publish_publication_object and references inside

Article Lazy Learners at work: the Lazy Learning Toolbox

The lazy learning algorithm is available in the R CRAN package lazy

https://cran.r-project.org/web/packages/lazy/index.html

Fuad M. Alkoot

Yes the square root usually yields best results, however, setting it to k=1 can yield improved results. An important factor is your distance metric.

After this many years of using it, it became common sense and I lost where it was originally suggested.

A good book is "Introduction to statistical Pattern Recognition" by Fukunaga. If you are looking for an analytical and mathematical perspective, in comparison to the descriptive books.

Alvaro Rodriguez

The question is related to the input space distribution the number of clases and the separability in the output. KNN is a simple and fast technique, easy to understand, easy to implement. Sometimes performance is the critical factor here, In most cases a small k value empirically adjusted will simply be good enough. One popular way of choosing the empirically optimal k is via bootstrap method.

For complex problems, i would test also different classification techniques such as SVM or ANN.

An useful reference: Hall P, Park BU, Samworth RJ (2008). "Choice of neighbor order in nearest-neighbor classification". Annals of Statistics 36 (5): 2135–2152. doi:10.1214/07-AOS537

Pau Climent i Pérez

Also, you could consider more advanced clustering methods, similar to k-Means, in which the k or number of clusters is determined by the algorithm automatically. A GNG might do that for you, for instance.

Hiranmay Ghosh

Have you tried canopy clustering?

http://en.wikipedia.org/wiki/Canopy_clustering_algorithm

Pedro L Galindo

There are two main approaches.

First is using some kind of validation process(cross-validation, leave-one-out,...) for K=1, 2, 3, ... As K increases, the error usually goes down, then stabilizes, and then raises again. Set optimum K at the beginning of the stable zone.

The second approach is obtaining resubstitution error (using the same data for training and test) for K=1,2,3,4,... and applying some penalization criterion(AIC, BIC, etc.) considering K to be the number of free parameters.

Both approaches should give approximately the same value, that corresponds to the optimal value of K.

Michael Rothman

There are different algorithmic approaches to selecting K, but the real impetus should be what are you trying to learn from the analysis.

There is no correct number of classes.

Select a K such that the clusters you obtain are interpretable and meaningful.

As some others note this is often a matter if trial and error... or more correctly, successive experimentation.

Piotr Artiemjew

An interesting discusson,

As it was mention before, there is no one proper method of k estimation,

it depends on data type, the structure, density of data, the amount of data you have,

The perfect situation is in case you have huge enough data set,

you can split it in factor 0.5, and estimate k on the first half (using one of below listed methods), and even validate it in the second half of data.

With fixed metric,

Depends on data you can use different methods like:

for enaugh huge data set

k - fold Cross Validation,

for smaller one

Monte Carlo Cross Validation,

Bagging (multiple Bootstrap),

I can not forgive myself, to suggest use a few k parameters around the optimal one, and use classifiers committee,

Thales Sehn Körting

Some basic rules that I have found in Theodoridis, I have put in a video about kNN algorithm, please take a look at:

http://youtu.be/UqYde-LULfs

Best regards

Nancy Sarah Yacovzada

The spectral properties of your adjacency matrix can tell you a lot about number of clusters and the nature of the input data.

You just need to compute the eigenvalues of the Laplacian matrix (of your adjacency / similarity matrix) and the spectral algorithm is able to estimate the cluster number correctly and reveal natural grouping of the input data/patterns by the smallest eigenvalues.

That's magical - the eigenvalues contain useful information about the natural grouping of the data.

Theodoros Thodoris Anagnostopoulos

A lot of conversation could be done but in the end you should evaluate the K value through a set of training-test evaluations. Note here that the K value may vary from dataset to dataset even of the same conceptual model. Further analysis could be done with data cleaning and preprocessing while a feature selection algorithm should capture the notion of each specific attribute and the whole set of instances which in turns has impact on K and the classification result.

Konstantin Nagaev

It seems to me that there are many confusions with "K"s.

K in K-fold (KFCV) and K in K-Nearest Neighbours (KNN) are distinctly different characteristics.

K in K-fold is the ratio of splitting a dataset into training and test samples.

K in KNN is the number of instances that we take into account for determination of affinity with classes.

They have strictly different meaning!

Of course, you can use KFCV for testing performance of KNN with some various quantities of neighbours and it will be useful.

But you can't (and you should not) take the value for K in KNN from K of KFCV.

So you need to investigate performance of KNN near rule-of-thumb-value and make a decision about the optimal one using any algorithm of performance testing (such as KFCV).

Behrouz Ahmadi-Nedushan

K-fold cross validation (CV) or Leave-one-out cross validation (LOOCV) can be effectuvely used. I used LOOCv in my article (KNN and optimization):

https://www.researchgate.net/publication/257392360_An_optimized_instance_based_learning_algorithm_for_estimation_of_compressive_strength_of_concrete

Article An optimized instance based learning algorithm for estimatio...

Fuad M. Alkoot

optimum K depends on your metric. However, a general rule of thumb is square root of the number of samples.

Ali A. Amer

You can allow the data to help resolve this dilemma by running the cross-validation. In other words, try various values of k with different randomly selected training sets and pick the k value that effectively minimizes the classification or estimation error.

Alok Pandey

Prescriptive choice of k as square root of n is mentioned in article by Lall and Sharma (1996). Also optimal k can be obtained using error metrics (for e.g generalized cross validation score).

Federico Amato

One very fast and effective way of doing it is by using the k-fold cross validation to minimize the validation error. To better understand this, a good reference is:

http://math.univ-lille1.fr/~celisse/Papers/knn_celisse_maryhuard.pdf

Moreover, using k-fold cross validation, the error-k function can be explored. If this function has a local minimum, than we can state that data are structured\correlated and the cross validation error is smaller than the variance of raw data. This means that if data are predictable, the ucertainties decreases. To learn more about this, a good reading is: Article Machine learning for spatial environmental data. Theory, app...

Anastasia Widiarti

Maybe you can read papers:

Choice of Neighbor Order in Nearest-Neighbor Classification, By Peter Hall, Byeong U.Park, and Richard J Samwort in The Annals of Statistics, 2018, Vol 36, No 5, 2135-2152,

~or~

Ahmad Basheer Hassanat, Mohammad Ali Abbadi, Ghada Awad Altarawneh, Ahmad Ali Alhasanat, 2014, Solving the Problem of the K Parameter in the KNN Classifier Using an Ensemble Learning Approach, in (IJCSIS) International Journal of Computer Science and Information Security, Vol. 12, No. 8, August 2014

Hardik Gupts

You can plot the graph for WCSS(Within Cluster Sum of Squares) which is also called as Elbow like graph because of it's shape. So, after plotting the graph you just have to find the point which resembles human elbow in the whole graph which looks like a human hand. Just notice the X-coordinate for the point and that will the optimal value of K required in K-Means Clustering.

I have attached a file, which is fitted on a particular data. So here you can see that for n=5, the elbow point occurs.Therefore, we have the values of n_clusters=5.

Mohammadreza Zandehshahvar

The first way to find k is using cross validation. As you have more training data, it is better to increase k and trust your training to generalize your classifier. However, increasing k results in under-fitting and choosing lower k results in more complex decision boundary and you will have the overfitting problem. So you can choose a range of k and find the best value using cross validation.

If you have no idea about the range of k values and you want to check almost all of them to find the best value, trying all possible values might not be a good solution. In this case one can use Bayesian Optimization to optimize the k value. There is no guarantee to get the optimum value using Bayesian Optimization, however, you will significantly reduce your search space and the resulting classifier will be close to the optimum that you can get using grid search among all possible k values.

Dr R Senthilkumar

If it is 1D K nearest neighbour matrix, min(EuclideanDistance) or if it is 2D K nearest neighbour matrix, min(min(EuclideanDistance)). If you want K vectors, sort in Ascending order and choose first K velaues.

Badges
Science topic

Similar topics
Mathematics
Statistics

More Mahnaz Behroozi's questions See All

Is there any method or rule of thumb that can define the percentage of bearable missing data in one variable?

I'm working on a set of clinical data and am eager to know if there is any method that can help me to decide whether to keep a variable (feature) or discard it because of the percentage of its...

31 December 2014 5,919 5 View

How to optimize a combination of parameters using genetic algorithms?

I have a number of parameters in my proposed method in the field of cascade classification. With the means of different parameter settings, I obtained different accuracy. I want to find the best...

08 September 2013 8,427 47 View

I'm new to Weka and want to invoke classifiers and other Weka's capabilities from Matlab. Is there anyone who can help?

I implemented the following piece of code. clc;close all;clear all; import weka.classifiers.Classifier import weka.classifiers.bayes.BayesNet import weka.classifiers.Evaluation; v0 =...

04 May 2013 5,934 5 View

Is there anyone who is expert in "cascade ensemble learning"? I need a powerful and preferably new survey on this area.

The best thesis I found in this field is "Efficient Boosted Ensemble-Based Machine Learning In The Context Of Cascaded Frameworks", written by Teo Susnjak. It is a very useful resource and also a...

31 December 2012 6,909 2 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

Need help with my research project on open source SIEM and machine learning?

Hello everyone, I am currently working on a research project that aims to integrate machine learning techniques into an open source SIEM tool to automate the creation of security use cases from...

04 August 2024 3,196 2 View

Swimming/space travel depends on the proprioceptive muscle spindles?

When the entire neocortex is ablated in rodents, although they are still able to swim, all the limbs move continuously and asynchronously (Vanderwolf 2006; Vanderwolf et al. 1978). Normal animals...

03 August 2024 835 3 View

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Machine learning (ML) has shown great potential in predicting the compressive strength of concrete, an important property for structural engineering. However, its practical application comes with...

03 August 2024 2,546 2 View