k-Nearest Neighbors is another very simple but most used classification algorithm. It is a supervised learning algorithm that can also significantly used for regression problems.
K-Nearest Neighbors(popularly known as kNN) is non-parametric(that is doesn’t assume anything over the underlying distribution of data). It is an instance-based algorithm which means that the algorithm doesn’t learn a model explicitly, instead of this, it memorizes the training instances.
How does it work?
As we already know that kNN is a classification algorithm, therefore it predicts a class(discrete values) based on some instances. There are three important key elements of the kNN algorithm to work with: a set of labeled data, a distance between the instances, and the k’s value which is a must as it signifies the number of nearest neighbors from which the distance is to be calculated.
How to make predictions using kNN?
To classify the objects from which class a particular object belongs based on some identified characteristics, the distance of that particular object to the labeled objects is to be computed. Here, the value of k is set such that the distance of the object(unknown) from the all the existing points is to be computed from which the smallest k distances will be taken into consideration out of which the smallest one will identify the class of the object having similar properties. For such kind of real-valued inputs, the distance measure used is Euclidean distance.
Euclidean distance is the most appropriate distance measure to compute the distance of real-valued inputs. It is calculated as the square root of the sum of squared differences between two points in which one will be the new point(which has to be classified) and the other will be the existing point.
Other measures for computing the distance are:
Value of k
In kNN, the value of k plays a vital role to begin the classification process and it is not an easy task to find the perfect value of k. If we are taking a smaller value of k then our model will have a higher influence over the predictive outcome whereas the large value of k will be proven computationally expensive. To fetch up the perfect value of k for our classification model, the elbow method is the most appropriate one or you can go through the individual cases of different values of k and decide yourself for the best value.
While designing a Machine Learning model with many features, it is to be considered that having too many features could potentially lead our model to predict the result with less accuracy, especially if certain features have no effects over the outcomes or may have a drastic effect on the variables.
There are some methods to select the most appropriate variables from the huge dataset.
To start building a model with multiple features, the following steps can be taken into consideration:
Step 1: Data Preprocessing
It includes importing the libraries and datasets, checking for missing values, proper visualization of data.
The next task is to deal with categorical data. For this, encode the categorical data using LaberEncoder() and make the dummy variables if required.
Feature scaling should be taken care of to clean the data for better accuracy.
Step 2: Fitting the model with the training set.
After the data is cleaned and finalized, the features and target must be picked out from the finalized dataset, and then further it has to be split into training and testing dataset. The instantiated model object must be fitted with the training data using the fit() method of kNearestNeighbor. The kNearestNeighbor class includes both simple and multiple linear regression algorithms.
Step 3: Predicting the output
To predict the outcome, we use the predict() method of class kNearestNeighbor on the regressor which has been fitted before.