Simple Linear Regression Model

Title: Simple Linear Regression Model

Author: Antonio Lorenzo

Subject: Machine learning

Language: English

In this project, I will demonstrate how to build a linear regression model using the scikit-learn library to predict housing prices based on the number of rooms. The dataset used is the "Boston Housing Dataset," which contains information about different factors that influence housing prices in Boston.

Importing Libraries

First, we need to import the necessary libraries. These include numpy for numerical operations, matplotlib for plotting, and sklearn for accessing datasets and implementing the linear regression algorithm.

# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model

Loading and Understanding the Data

We use the Boston dataset from sklearn to train our model. This dataset contains 13 features for various factors affecting housing prices.

# Preparing the Data
# We load the Boston dataset from sklearn's library
boston = datasets.load_boston()

Next, we explore the structure of the dataset by looking at its keys and the data it contains. This helps us understand what the dataset is composed of.

# Understanding the Data
# Checking the information contained in the dataset
print("Dataset Information")
print(boston.keys())

We can also review the description of the dataset to get a better understanding of its contents, though it's commented out here.

# Checking the dataset description
# print(boston.DESCR)

We then verify the size and shape of the dataset, which tells us how many rows and columns are present in the data.

# Checking the amount of data in the dataset
print("Dataset Shape")
print(boston.data.shape)

Finally, we print the names of the columns, which correspond to the different features in the dataset, such as crime rate, number of rooms, and property tax.

# Checking the column names
print("Column Names:")
print(boston.feature_names)

Nombres de las columnas: ['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

Preparing the Data for Linear Regression

To perform linear regression, we focus on one feature: the average number of rooms per dwelling (column 5 of the dataset).

# Selecting only the data from column 5 (number of rooms)
X = boston.data[:, np.newaxis, 5]

# Defining the target data (median house value)
y = boston.target

To visualize the data, we plot the number of rooms against the median home value.

# Plotting the data
plt.scatter(X,y)
plt.xlabel("Número de habitaciones")
plt.ylabel("Valor Medio")
plt.show()

Implementing Simple Linear Regression

We now proceed to implement linear regression using the LinearRegression model from sklearn.

Splitting the Data

Before training the model, we split the dataset into training and test sets. This allows us to evaluate the model's performance on unseen data.

# Importing the train_test_split function
from sklearn.model_selection import train_test_split

# Splitting the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Training the Model

We define and train the linear regression model using the training data.

# Defining the linear regression algorithm
lr = linear_model.LinearRegression()

# Training the model
lr.fit(X_train, y_train)

Making Predictions

After training, we use the model to predict housing prices based on the test data.

# Making predictions
Y_pred = lr.predict(X_test)

We also visualize the predictions by plotting the test data points alongside the regression line.

# Plotting the test data and the model's prediction line
plt.scatter(X_test, y_test)
plt.plot(X_test,Y_pred,color='red',linewidth=3)
plt.title('Regresión Lineal Simple')
plt.xlabel('Número de habitaciones')
plt.ylabel('Valor Medio')
plt.show()

Model Evaluation

Finally, we evaluate the model by extracting key information such as the slope (coefficient), intercept, and the model's accuracy.

Model Coefficients

The coefficient (slope) indicates how much the house price changes for each additional room, while the intercept tells us the price when there are no rooms (which might not be meaningful but helps understand the line's position).

# Displaying the model's coefficients
print("Simple Linear Regression Model Details")
print('Coefficient (Slope):')
print(lr.coef_)
print('Intercept:')
print(lr.intercept_)

The regression equation derived from the model is:

# Displaying the equation of the model
print("The model's equation is:")
print('y = ', lr.coef_, 'x + ', lr.intercept_)

Model Accuracy

To assess the accuracy, we use the R² score, which indicates how well the model fits the training data.

# Displaying the model's accuracy
print('Model Accuracy:')
print(lr.score(X_train, y_train))

In this case, the accuracy is approximately 44.2%, which indicates that while the model explains some of the variability in house prices, there is still room for improvement.

This project demonstrates the basic steps involved in implementing simple linear regression to predict house prices based on the number of rooms. Through this process, we explore how machine learning models can learn patterns from data and make predictions.