Visual K-Nearest Neighbors (KNN) Classification

Title: Simple classifier using K-Nearest Neighbors (KNN)

Author: Antonio Lorenzo

Subject: Machine learning

Language: English

This project involves the implementation of a K-Nearest Neighbors (KNN) classifier using the famous Iris dataset. The Iris dataset is often used for machine learning tasks because it contains labeled data for three species of iris flowers, with four features measured for each sample: sepal length, sepal width, petal length, and petal width. Here, we'll walk through the entire process, from loading the data to training and visualizing the results.

Importing Libraries and Loading Data

import numpy as np
from sklearn import datasets

We begin by importing numpy for numerical operations and datasets from sklearn, which provides the Iris dataset, a classic dataset in machine learning.

np.random.seed(0)
iris = datasets.load_iris()

We load the Iris dataset and set a random seed to ensure reproducibility. The Iris dataset contains 150 samples of iris flowers, each labeled with one of three species.

Data Preparation

x = iris.data
y = iris.target

Here, x contains the features (sepal length, sepal width, petal length, petal width), and y contains the labels (species of the iris flower: Setosa, Versicolor, or Virginica).

i = np.random.permutation(len(iris.data))
x_train = x[i[:-10]]
y_train = y[i[:-10]]
x_test = x[i[-10:]]
y_test = y[i[-10:]]

We shuffle the data and split it into a training set (x_train, y_train) and a test set (x_test, y_test). The training set consists of all but the last 10 samples, which are reserved for testing.

Training the KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

We import the KNeighborsClassifier from sklearn, which is a type of supervised learning algorithm. This classifier works by finding the 'k' nearest points to a test sample and classifies it based on the majority class among these neighbors.

knn = KNeighborsClassifier()
knn.fit(x_train,y_train)

Here, we initialize the KNeighborsClassifier and train it using the fit() method on the training data (x_train and y_train).

Making Predictions

knn.predict(x_test)

We use the trained model to predict the species of the iris flowers in the test set.

y_test

This line outputs the actual species of the test samples. Comparing this with the predictions allows us to evaluate the performance of our classifier.

Visualizing the Results

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

We import matplotlib.pyplot for plotting and ListedColormap to set custom colors for the visualization.

iris = datasets.load_iris()
x = iris.data[:,:2] # X-Axis - sepal length and width
y = iris.target # Y-Axis - species

We reload the Iris dataset, but this time we only select the first two features (sepal length and sepal width) to simplify the visualization.

x_min, x_max = x[:,0].min() - .5, x[:,0].max() + .5
y_min, y_max = x[:,1].min() - .5, y[:,1].max() + .5

We set the boundaries for our plot based on the minimum and maximum values of the sepal length and width.

cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = .02
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max,h))

We create a mesh grid to display the decision boundaries. The ListedColormap defines the colors for the three iris species.

knn = KNeighborsClassifier()
knn.fit(x, y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)

We train a KNeighborsClassifier on the full dataset and use it to predict the class for each point in the mesh grid. The results are reshaped to match the grid structure for plotting.

plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

We plot the decision boundaries using the pcolormesh method, which fills the mesh grid with colors representing the predicted class at each point.

plt.scatter(x[:,0],x[:,1],c=y)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

Finally, we plot the actual data points on top of the decision boundaries using scatter, where each point's color corresponds to its species. The xlim and ylim methods ensure the plot spans the correct range.

Conclusion

In this project, we successfully implemented a K-Nearest Neighbors classifier to classify iris species based on sepal length and width. The decision boundaries were visualized using Matplotlib, showing how the KNN algorithm divides the feature space. This simple classification task demonstrates the power and flexibility of KNN for non-linear classification problems.