top of page

Car Mileage Prediction Using Multiple Linear Regression: A Step-by-Step Guide

This project demonstrates predicting a car’s miles per gallon (MPG) using multiple linear regression. By analyzing the relationship between weight and MPG, the model offers a beginner-friendly introduction to regression analysis with Python, pandas, and scikit-learn, providing insights for those exploring predictive modeling.

Keywords: car mileage prediction, multiple linear regression, data science project, MPG prediction, regression model tutorial


Introduction: Why Predict Car Mileage?


Fuel efficiency (measured in miles per gallon, or MPG) is a key factor for both car buyers and manufacturers. In this project, we'll predict a vehicle’s MPG using multiple linear regression, a powerful technique for modeling the relationship between multiple independent variables and a dependent variable. This guide is designed for data science professionals and beginners looking to understand how multiple features affect MPG and how to build a regression model to predict it.



What is Multiple Linear Regression?

Multiple linear regression is an extension of simple linear regression that models the relationship between two or more independent variables (e.g., car weight, horsepower, displacement) and a dependent variable (MPG). The equation:MPG = β₀ + β₁ weight + β₂ horsepower + β₃ * displacement + …Our goal is to find the best-fit line (or hyperplane) that minimizes prediction errors across multiple variables.



Step-by-Step Implementation


1. Data Loading and Initial Exploration

We start by importing necessary libraries and loading the dataset:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Load dataset
df = pd.read_csv('CarData.csv')

Key observations:

  • The dataset contains 398 entries with features like weight, cylinders, horsepower, displacement, and more.

  • There are a few missing values in the horsepower column, which will be handled in the data preprocessing step.



2. Data Preprocessing


We simplify the dataset by dropping columns we don’t plan to use and removing rows with missing values:

# Drop columns we don't need
df = df.drop(columns=['origin', 'name'], axis=1)

# Remove rows with missing values
df = df.dropna()

Why this step matters: Removing irrelevant features and handling missing data ensures the model only uses the most meaningful information.




3. Visualizing the Relationships


To understand the correlation between the features and MPG, we create scatter plots for each independent feature:

# Visualizing scatter plots
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

columns = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']
colors = ["#9ACBD0", "#FFB6C1", "#8FBC8F", "#FFD700", "#FF6347", "#6A5ACD"]

for i, col in enumerate(columns):
    row, col_index = divmod(i, 3)
    axes[row, col_index].scatter(df[col], df['mpg'], c=colors[i],   alpha=0.6)
    axes[row, col_index].set_xlabel(col.capitalize())
    axes[row, col_index].set_ylabel('MPG')
    axes[row, col_index].set_title(f'{col.capitalize()} vs MPG')

plt.tight_layout()
plt.show()

Scatter Plots for independent features vs mpg
Scatter Plots for independent features vs mpg

Based on the visual inspection of the scatter plots :

  1. Cylinders vs MPG: This shows a clear trend where fewer cylinders generally correspond to higher MPG. This feature seems relevant.

  2. Displacement vs MPG: There is a strong negative relationship between displacement and MPG, making this feature important for the model.

  3. Horsepower vs MPG: A negative relationship is observed, suggesting that higher horsepower leads to lower MPG. This feature is relevant.

  4. Weight vs MPG: The trend shows that higher vehicle weight correlates with lower MPG. This is another key feature.

  5. Acceleration vs MPG: The relationship is less clear and looks scattered. This feature may not strongly influence MPG directly, so it could be tested for significance or excluded.

  6. Model Year vs MPG: There is a positive trend indicating that newer model years correspond to higher MPG. This feature seems relevant.



4. Correlation Matrix


We check the correlation between the features to identify which ones are most strongly related to MPG:

# Correlation matrix
correlation_matrix = df[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'mpg']].corr()

# Plotting the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

Correlation Matrix
Correlation Matrix

Key observation: We exclude features with lower correlation to MPG (like cylinders and acceleration) to improve the model's performance.



5. Preparing Data for Model Training


We split the dataset into independent variables (features) and the dependent variable (MPG). We then divide the data into training and testing sets:

# Split the data into dependent and independent variables
y = df['mpg']
x = df.drop(columns=['mpg'], axis=1)

# Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=66)


6. Model Training and Evaluation


We initialize the multiple linear regression model and fit it to the training data. Then, we make predictions on the test set and evaluate the model's performance using the R-squared score:

# Initialize the model
model = LinearRegression()

# Train the model
model.fit(x_train, y_train)

# Predict the target values
y_pred = model.predict(x_test)

# Evaluate the model's performance using R-squared
r2 = r2_score(y_test, y_pred)
print(f"R2 = {r2:.1f}")

Interpretation: An R² of 0.8 means 80% of the variance in MPG is explained by the selected features. This indicates a strong fit for the model.



7. Conclusion


  • Results: Our model successfully predicts car MPG using multiple features, with a strong R-squared value of 0.8.

  • Limitations: While multiple linear regression captures the impact of multiple factors, it doesn’t account for more complex, non-linear relationships. For better accuracy, consider exploring polynomial regression or advanced models like decision trees or random forests.


Github Repo


Access the GitHub repository and run the Jupyter notebook to explore the code interactively.

Contact
Information

Anshuman Bal

anshbal06@gmail.com

Bhubaneswar 

Odisha, India 

anshumanbal.com

+91 943-856-0707

  • LinkedIn
  • GitHub

Subscribe to my newsletter!

I respect your privacy!

©2025 by Anshuman Bal. 

bottom of page