
Car Mileage Prediction Using Multiple Linear Regression: A Step-by-Step Guide
This project demonstrates predicting a car’s miles per gallon (MPG) using multiple linear regression. By analyzing the relationship between weight and MPG, the model offers a beginner-friendly introduction to regression analysis with Python, pandas, and scikit-learn, providing insights for those exploring predictive modeling.
Keywords: car mileage prediction, multiple linear regression, data science project, MPG prediction, regression model tutorial
Introduction: Why Predict Car Mileage?
Fuel efficiency (measured in miles per gallon, or MPG) is a key factor for both car buyers and manufacturers. In this project, we'll predict a vehicle’s MPG using multiple linear regression, a powerful technique for modeling the relationship between multiple independent variables and a dependent variable. This guide is designed for data science professionals and beginners looking to understand how multiple features affect MPG and how to build a regression model to predict it.
What is Multiple Linear Regression?
Multiple linear regression is an extension of simple linear regression that models the relationship between two or more independent variables (e.g., car weight, horsepower, displacement) and a dependent variable (MPG). The equation:MPG = β₀ + β₁ weight + β₂ horsepower + β₃ * displacement + …Our goal is to find the best-fit line (or hyperplane) that minimizes prediction errors across multiple variables.
Step-by-Step Implementation
1. Data Loading and Initial Exploration
We start by importing necessary libraries and loading the dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Load dataset
df = pd.read_csv('CarData.csv')Key observations:
The dataset contains 398 entries with features like weight, cylinders, horsepower, displacement, and more.
There are a few missing values in the horsepower column, which will be handled in the data preprocessing step.
2. Data Preprocessing
We simplify the dataset by dropping columns we don’t plan to use and removing rows with missing values:
# Drop columns we don't need
df = df.drop(columns=['origin', 'name'], axis=1)
# Remove rows with missing values
df = df.dropna()Why this step matters: Removing irrelevant features and handling missing data ensures the model only uses the most meaningful information.
3. Visualizing the Relationships
To understand the correlation between the features and MPG, we create scatter plots for each independent feature:
# Visualizing scatter plots
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
columns = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']
colors = ["#9ACBD0", "#FFB6C1", "#8FBC8F", "#FFD700", "#FF6347", "#6A5ACD"]
for i, col in enumerate(columns):
row, col_index = divmod(i, 3)
axes[row, col_index].scatter(df[col], df['mpg'], c=colors[i], alpha=0.6)
axes[row, col_index].set_xlabel(col.capitalize())
axes[row, col_index].set_ylabel('MPG')
axes[row, col_index].set_title(f'{col.capitalize()} vs MPG')
plt.tight_layout()
plt.show()
Based on the visual inspection of the scatter plots :
Cylinders vs MPG: This shows a clear trend where fewer cylinders generally correspond to higher MPG. This feature seems relevant.
Displacement vs MPG: There is a strong negative relationship between displacement and MPG, making this feature important for the model.
Horsepower vs MPG: A negative relationship is observed, suggesting that higher horsepower leads to lower MPG. This feature is relevant.
Weight vs MPG: The trend shows that higher vehicle weight correlates with lower MPG. This is another key feature.
Acceleration vs MPG: The relationship is less clear and looks scattered. This feature may not strongly influence MPG directly, so it could be tested for significance or excluded.
Model Year vs MPG: There is a positive trend indicating that newer model years correspond to higher MPG. This feature seems relevant.
4. Correlation Matrix
We check the correlation between the features to identify which ones are most strongly related to MPG:
# Correlation matrix
correlation_matrix = df[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'mpg']].corr()
# Plotting the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
Key observation: We exclude features with lower correlation to MPG (like cylinders and acceleration) to improve the model's performance.
5. Preparing Data for Model Training
We split the dataset into independent variables (features) and the dependent variable (MPG). We then divide the data into training and testing sets:
# Split the data into dependent and independent variables
y = df['mpg']
x = df.drop(columns=['mpg'], axis=1)
# Split into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=66)6. Model Training and Evaluation
We initialize the multiple linear regression model and fit it to the training data. Then, we make predictions on the test set and evaluate the model's performance using the R-squared score:
# Initialize the model
model = LinearRegression()
# Train the model
model.fit(x_train, y_train)
# Predict the target values
y_pred = model.predict(x_test)
# Evaluate the model's performance using R-squared
r2 = r2_score(y_test, y_pred)
print(f"R2 = {r2:.1f}")Interpretation: An R² of 0.8 means 80% of the variance in MPG is explained by the selected features. This indicates a strong fit for the model.
7. Conclusion
Results: Our model successfully predicts car MPG using multiple features, with a strong R-squared value of 0.8.
Limitations: While multiple linear regression captures the impact of multiple factors, it doesn’t account for more complex, non-linear relationships. For better accuracy, consider exploring polynomial regression or advanced models like decision trees or random forests.
Github Repo
Access the GitHub repository and run the Jupyter notebook to explore the code interactively.