Simple Polynomial Regression

Title: Simple Polynomial Regression

Author: Antonio Lorenzo

Subject: Machine learning

Language: English

This project demonstrates the use of Polynomial Regression for predicting housing prices based on the Boston dataset. Polynomial regression is a form of linear regression where the relationship between the independent variable (X) and the dependent variable (Y) is modeled as an (n)-degree polynomial.

In this project, we will explore and implement a model that uses polynomial regression to predict housing prices based on one feature of the dataset (number of rooms).

Importing Libraries

We start by importing the necessary libraries:

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model

NumPy: Used for handling arrays and performing numerical operations.
Matplotlib: Useful for creating visualizations such as scatter plots.
scikit-learn: A library with tools for machine learning, including regression models and datasets.

Loading the Dataset

We use the Boston Housing dataset, a popular dataset used for regression tasks in machine learning. It contains data on various factors influencing housing prices in Boston, such as the number of rooms, crime rates, etc.

# Load the Boston dataset
boston = datasets.load_boston()

The dataset is now loaded and will be used for our analysis.

Understanding the Data

To understand the structure of the dataset, we can inspect its contents and the number of available data points:

# Check the information in the dataset
print(boston.keys())

# Check the number of data points in the dataset
print("Data shape:")
print(boston.data.shape)

boston.keys(): Displays all the keys or metadata in the dataset, such as the feature names and target labels.
boston.data.shape: Shows the number of data points and features. The dataset has 506 instances with 13 features.

Preparing the Data for Polynomial Regression

Next, we select one feature for our polynomial regression model. In this case, we choose the number of rooms (column 6).

# Select the feature (number of rooms) for regression
X_p = boston.data[:, np.newaxis, 5]

# Target values (house prices)
y_p = boston.target

Here:

X_p contains the feature of interest (number of rooms).
y_p contains the target variable (housing prices).

Visualizing the Data

We plot the data to visualize the relationship between the number of rooms and housing prices:

# Scatter plot of the data
plt.scatter(X_p, y_p)
plt.show()

This plot helps us see if there's a non-linear relationship between the number of rooms and the housing prices, indicating that polynomial regression might be a good fit.

Implementing Polynomial Regression

We now implement the polynomial regression model. The data is split into training and testing sets for evaluation purposes:

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(X_p, y_p, test_size=0.2)

We split 80% of the data for training and 20% for testing.

Next, we define the polynomial features. We choose a polynomial of degree 2 for this project:

# Define polynomial features
from sklearn.preprocessing import PolynomialFeatures
poli_reg = PolynomialFeatures(degree=2)

The model will now consider quadratic terms in addition to the linear terms, allowing it to fit non-linear data more effectively.

Transforming the Features

The polynomial features are transformed into higher-degree features:

# Transform the training and testing data to include polynomial terms
X_train_poli = poli_reg.fit_transform(X_train_p)
X_test_poli = poli_reg.fit_transform(X_test_p)

This step generates additional polynomial terms from the original feature, making the data suitable for polynomial regression.

Defining the Model

We use a Linear Regression model from scikit-learn to fit the polynomial data:

# Define the linear regression model
pr = linear_model.LinearRegression()

Training the Model

The model is trained using the polynomial features:

# Train the model with the polynomial features
pr.fit(X_train_poli, y_train_p)

Making Predictions

We can now make predictions based on the test data:

# Make predictions on the test data
Y_pred_poli = pr.predict(X_test_poli)

Making Predictions

To visualize how well the model fits the data, we plot the predicted values alongside the actual values:

# Plot the test data and the polynomial regression model
plt.scatter(X_test_p, y_test_p)
plt.plot(X_test_p, Y_pred_poli, color="red", linewidth=2)
plt.show()

The red line represents the polynomial regression model, and the scatter points are the actual data. This gives us an idea of how well the model fits the data.

Model Coefficients

Next, we examine the model's coefficients:

# Display the model's coefficients
print('Polynomial Regression Model Coefficients:')
print('Coefficient "a":', pr.coef_)
print('Intercept "b":', pr.intercept_)

These coefficients represent the slope and intercept of the polynomial curve.

Evaluating the Model

Finally, we evaluate the model's accuracy using the R-squared score, which measures how well the model explains the variance in the data:

# Evaluate the model's accuracy
print("Model Accuracy (R-squared):")
print(pr.score(X_train_poli, y_train_p))

The R-squared value indicates the percentage of variance explained by the model, with a score closer to 1 meaning a better fit.

Conclusion

In this project, we implemented a Polynomial Regression model to predict housing prices based on the number of rooms in the Boston dataset. By visualizing the data, training the model, and evaluating its performance, we demonstrated the power of polynomial regression in modeling non-linear relationships.