Car Mileage Prediction Using Multiple Linear Regression: A Step-by-Step Guide

This project demonstrates predicting a car’s miles per gallon (MPG) using multiple linear regression. By analyzing the relationship between weight and MPG, the model offers a beginner-friendly introduction to regression analysis with Python, pandas, and scikit-learn, providing insights for those exploring predictive modeling.

Keywords: car mileage prediction, multiple linear regression, data science project, MPG prediction, regression model tutorial

Introduction: Why Predict Car Mileage?

Fuel efficiency (measured in miles per gallon, or MPG) is a key factor for both car buyers and manufacturers. In this project, we'll predict a vehicle’s MPG using multiple linear regression, a powerful technique for modeling the relationship between multiple independent variables and a dependent variable. This guide is designed for data science professionals and beginners looking to understand how multiple features affect MPG and how to build a regression model to predict it.

What is Multiple Linear Regression?

Multiple linear regression is an extension of simple linear regression that models the relationship between two or more independent variables (e.g., car weight, horsepower, displacement) and a dependent variable (MPG). The equation:MPG = β₀ + β₁ weight + β₂ horsepower + β₃ * displacement + …Our goal is to find the best-fit line (or hyperplane) that minimizes prediction errors across multiple variables.

Step-by-Step Implementation

1. Data Loading and Initial Exploration

We start by importing necessary libraries and loading the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load dataset
df = pd.read_csv('CarData.csv')

Key observations:

The dataset contains 398 entries with features like weight, cylinders, horsepower, displacement, and more.
There are a few missing values in the horsepower column, which will be handled in the data preprocessing step.

2. Data Preprocessing

We simplify the dataset by dropping columns we don’t plan to use and removing rows with missing values:

# Drop columns we don't need
df = df.drop(columns=['origin', 'name'], axis=1)

# Remove rows with missing values
df = df.dropna()

Why this step matters: Removing irrelevant features and handling missing data ensures the model only uses the most meaningful information.

3. Visualizing the Relationships

To understand the correlation between the features and MPG, we create scatter plots for each independent feature:

# Visualizing scatter plots
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

columns = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']
colors = ["#9ACBD0", "#FFB6C1", "#8FBC8F", "#FFD700", "#FF6347", "#6A5ACD"]

for i, col in enumerate(columns):
    row, col_index = divmod(i, 3)
    axes[row, col_index].scatter(df[col], df['mpg'], c=colors[i],   alpha=0.6)
    axes[row, col_index].set_xlabel(col.capitalize())
    axes[row, col_index].set_ylabel('MPG')
    axes[row, col_index].set_title(f'{col.capitalize()} vs MPG')

plt.tight_layout()
plt.show()

Scatter Plots for independent features vs mpg

Based on the visual inspection of the scatter plots :

Cylinders vs MPG: This shows a clear trend where fewer cylinders generally correspond to higher MPG. This feature seems relevant.
Displacement vs MPG: There is a strong negative relationship between displacement and MPG, making this feature important for the model.
Horsepower vs MPG: A negative relationship is observed, suggesting that higher horsepower leads to lower MPG. This feature is relevant.
Weight vs MPG: The trend shows that higher vehicle weight correlates with lower MPG. This is another key feature.
Acceleration vs MPG: The relationship is less clear and looks scattered. This feature may not strongly influence MPG directly, so it could be tested for significance or excluded.
Model Year vs MPG: There is a positive trend indicating that newer model years correspond to higher MPG. This feature seems relevant.

4. Correlation Matrix

We check the correlation between the features to identify which ones are most strongly related to MPG:

# Correlation matrix
correlation_matrix = df[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'mpg']].corr()

# Plotting the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

Key observation: We exclude features with lower correlation to MPG (like cylinders and acceleration) to improve the model's performance.

5. Preparing Data for Model Training

We split the dataset into independent variables (features) and the dependent variable (MPG). We then divide the data into training and testing sets:

# Split the data into dependent and independent variables
y = df['mpg']
x = df.drop(columns=['mpg'], axis=1)

# Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=66)

6. Model Training and Evaluation

We initialize the multiple linear regression model and fit it to the training data. Then, we make predictions on the test set and evaluate the model's performance using the R-squared score:

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(x_train, y_train)

# Predict the target values
y_pred = model.predict(x_test)

# Evaluate the model's performance using R-squared
r2 = r2_score(y_test, y_pred)
print(f"R2 = {r2:.1f}")

Interpretation: An R² of 0.8 means 80% of the variance in MPG is explained by the selected features. This indicates a strong fit for the model.

7. Conclusion

Results: Our model successfully predicts car MPG using multiple features, with a strong R-squared value of 0.8.
Limitations: While multiple linear regression captures the impact of multiple factors, it doesn’t account for more complex, non-linear relationships. For better accuracy, consider exploring polynomial regression or advanced models like decision trees or random forests.

Github Repo

Access the GitHub repository and run the Jupyter notebook to explore the code interactively.

Anshuman Bal

Car Mileage Prediction Using Multiple Linear Regression: A Step-by-Step Guide

Introduction: Why Predict Car Mileage?

What is Multiple Linear Regression?

Step-by-Step Implementation

1. Data Loading and Initial Exploration

2. Data Preprocessing

3. Visualizing the Relationships

4. Correlation Matrix

5. Preparing Data for Model Training

6. Model Training and Evaluation

7. Conclusion

Github Repo

Contact
Information

Subscribe to my newsletter!

I respect your privacy!

Anshuman Bal

Car Mileage Prediction Using Multiple Linear Regression: A Step-by-Step Guide

Introduction: Why Predict Car Mileage?

What is Multiple Linear Regression?

Step-by-Step Implementation

1. Data Loading and Initial Exploration

2. Data Preprocessing

3. Visualizing the Relationships

4. Correlation Matrix

5. Preparing Data for Model Training

6. Model Training and Evaluation

7. Conclusion

Github Repo

Contact Information

Subscribe to my newsletter!

I respect your privacy!

Contact
Information