
Car Mileage Prediction Using Simple Linear Regression: A Step-by-Step Guide
This project demonstrates predicting a car’s miles per gallon (MPG) using simple linear regression. By analyzing the relationship between weight and MPG, the model offers a beginner-friendly introduction to regression analysis with Python, pandas, and scikit-learn, providing insights for those exploring predictive modeling.
Keywords: car mileage prediction, simple linear regression, data science project, MPG prediction, regression model tutorial
Introduction: Why Predict Car Mileage?
Fuel efficiency (measured in miles per gallon, or MPG) is a critical factor for both car buyers and manufacturers. In this project, we’ll predict a vehicle’s MPG using simple linear regression, a foundational machine learning algorithm. This guide is designed for data science professionals and beginners looking to understand regression modeling in practice.
What is Simple Linear Regression?
Simple linear regression models the relationship between one independent variable (e.g., car weight) and a dependent variable (MPG) using a straight line. The equation:MPG = slope × weight + interceptOur goal is to find the best-fit line that minimizes prediction errors.
Step-by-Step Implementation
1. Data Loading and Initial Exploration
We start by importing libraries and loading the dataset:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# Load dataset
df = pd.read_csv('CarData.csv')
Key observations:
The dataset contains 398 entries with features like weight, cylinders, and horsepower.
df.info() reveals 6 missing values in the horsepower column (we’ll simplify the model by dropping non-essential columns).
2. Data Preprocessing
To focus on the weight-MPG relationship, we retain only relevant columns:
df = df.drop(columns=['cylinders', 'displacement', 'horsepower',
'acceleration', 'model_year', 'origin', 'name'], axis=1)
Why this step matters: Reducing noise from unrelated features simplifies the model for beginners.
3. Visualizing the Relationship
A scatter plot reveals the negative correlation between weight and MPG:
plt.scatter(x=df['weight'], y=df['mpg'], c="#9ACBD0")
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.show()

Insight: Heavier cars generally have lower fuel efficiency.
4. Splitting Data for Training
We divide the data into training (75%) and testing (25%) sets:
x = df[['weight']] # Independent variable
y = df['mpg'] # Dependent variable
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.25, random_state=66)
5. Model Training and Evaluation
Training the Model
model = LinearRegression()
model.fit(x_train, y_train)
Making Predictions
y_pred = model.predict(x_test)
Evaluating Accuracy
We use the R-squared (R²) score to measure how well the regression line fits the data:
r2 = r2_score(y_test, y_pred)
print(f"R2 = {r2:.1f}") # Output: R2 = 0.7
Interpretation: An R² of 0.7 means 70% of the variance in MPG is explained by weight – a decent fit for a simple model.
6. Visualizing the Regression Line
plt.scatter(x_test, y_test, color='#9ACBD0', label='Actual Data')
plt.plot(x_test, y_pred, color='#2973B2', label='Regression Line', linewidth=2)
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend()
plt.title('Simple Linear Regression: Weight vs. MPG')
plt.show()

Conclusion
Results: Our model shows a moderate negative correlation between car weight and MPG.
Limitations: Simple linear regression ignores other factors (e.g., engine power). For better accuracy, consider multiple linear regression.
Next Steps:
Experiment with polynomial regression for non-linear relationships.
Include features like horsepower or origin for a richer model.
Github Repo
Access the GitHub repository and run the Jupyter notebook to explore the code interactively.