Iris Dataset Analysis using PCA (Principal Component Analysis)

Title: Iris Dataset Analysis using PCA (Principal Component Analysis)

Author: Antonio Lorenzo

Subject: Machine learning

Language: English

In this project, we aim to explore the famous Iris dataset, a classic example in machine learning, by visualizing its features and then applying Principal Component Analysis (PCA) to reduce dimensionality and gain insights. Below, I will explain each step of the code and its corresponding graphical output.

Loading the Iris Dataset

First, we import the necessary libraries and load the Iris dataset:

from sklearn import datasets
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

The Iris dataset is a simple dataset that includes data about three species of iris flowers: Setosa, Versicolor, and Virginica. For each species, the dataset contains four measurements:

Sepal length
Sepal width
Petal length
Petal width

We load this dataset with the following command:

iris = datasets.load_iris()

Once loaded, we can access the features and labels (target species) using iris.data and iris.target respectively:

iris.data   # This gives us the 4 feature measurements for each iris flower
iris.target # The target specifies the species of each sample (Setosa, Versicolor, Virginica)

Additionally, we can see the names of the target classes using:

iris.target_names

This gives us the labels as Setosa, Versicolor, and Virginica.

Visualization of Sepal Dimensions

Next, we visualize the relationship between the sepal length and sepal width using a scatter plot. We assign the color of each point based on its species:

x = iris.data[:,0]  # Sepal length
y = iris.data[:,1]  # Sepal width
species = iris.target

We then adjust the axis limits to ensure all data points are well-distributed:

x_min, x_max = x.min() - .5,x.max() + .5
y_min , y_max = y.min() - .5,y.max() + .5

Finally, we generate the scatter plot with appropriate labels and limits:

plt.figure()
plt.title('Iris Data - Sepal Dimensions Classification')
plt.scatter(x, y, c=species)  # Scatter plot with colors based on species
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

This graph displays the distribution of flowers based on their sepal dimensions, colored according to their species.

Visualization of Petal Dimensions

Next, we repeat the same process, but now we visualize the petal length and petal width:

x=iris.data[:,2]
y=iris.data[:,3]
species = iris.target

We plot the results:

plt.figure()
plt.title('Iris Data - Petal Dimensions Classification', size=14)
plt.scatter(x, y, c=species)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

This graph shows the petal dimension distribution, which offers a clearer separation between the three species compared to sepal dimensions.

Applying PCA for Dimensionality Reduction

With the dataset loaded and visualized, we apply Principal Component Analysis (PCA), a technique to reduce the dimensionality of the data while retaining most of its variance. PCA transforms the data into a new set of orthogonal components called principal components.

We use the PCA class from sklearn.decomposition:

from sklearn.decomposition import PCA

We reduce the data to 3 principal components:

x_reduced = PCA(n_components=3).fit_transform(iris.data)

Visualizing the Data in 3D

Finally, we visualize the transformed data in a 3D scatter plot, with each point representing one iris flower and its position determined by the first three principal components:

from mpl_toolkits.mplot3d import Axes3D

We generate the 3D plot as follows:

fig = plt.figure()
ax = Axes3D(fig)
ax.set_title('Iris Dataset by PCA', size=14)
ax.scatter(x_reduced[:,0], x_reduced[:,1], x_reduced[:,2], c=species)
ax.set_xlabel('First eigenvector')
ax.set_ylabel('Second eigenvector')
ax.set_zlabel('Third eigenvector')
ax.w_xaxis.set_ticklabels(())
ax.w_yaxis.set_ticklabels(())
ax.w_zaxis.set_ticklabels(())

This 3D graph illustrates how the PCA has transformed the data into a new space where the three species can be visually distinguished based on their principal components.

Conclusion

In this project, we explored the Iris dataset using both 2D and 3D visualizations. By plotting the sepal and petal dimensions, we could see some differentiation between species. However, applying PCA allowed us to reduce the data to three components while maintaining a clearer separation between the species. This showcases the power of PCA in dimensionality reduction and visualization of complex datasets.