Formula 1 Race Winner Prediction Model

A comprehensive machine learning project that predicts Formula 1 race winners using historical data from 1950 to 2024. This system compares multiple ML algorithms, automatically selects the best performer, and deploys it via a premium web application with a modern dark-mode interface.

🎯 Overview

This project trains and compares multiple machine learning algorithms to predict Formula 1 race winners with high accuracy. The pipeline automatically selects the best-performing model and deploys it through a sleek, interactive web application.

Key Highlights

Data Leakage Prevention: Uses pre-race cumulative_points instead of post-race points to ensure realistic results.
Multi-Model Comparison: Evaluates Random Forest, XGBoost, Logistic Regression, Gradient Boosting, and more.
Automatic Model Selection: Identifies the best performer using ROC-AUC and F1-Score metrics.
Robust Preprocessing: Standardized feature scaling (StandardScaler) and automated categorical encoding.
Premium Web Interface: Modern Flask-powered web app with F1-themed dark mode and dynamic circuit metadata.
Real-Time Predictions: Instant winning probability for any driver/circuit/team combination.

⚠️ Critical Implementation Fixes

The model includes several critical improvements over traditional F1 predictors:

Fixing Data Leakage: Replaced post-race points with cumulative_points (points earned before the current race). This ensures the model only uses information that would be available prior to the race start.
Feature Scaling: Implemented StandardScaler to normalize features like cumulative_points (0-400+) and grid_position (1-20), preventing large-scale variables from dominating the model.
Class Imbalance Handling: Applied class_weight='balanced' and optimized for Weighted F1-Score to account for the fact that only one driver wins per race.

📊 Dataset

The dataset is sourced from the Kaggle Formula 1 World Championship (1950-2024), compiled from the Ergast API.

The pipeline integrates 14 CSV files:

circuits.csv: Circuit metadata (location, country, coordinates).
constructor_results.csv: Constructor race points.
constructor_standings.csv: Constructor championship positions.
constructors.csv: Team names and nationalities.
driver_standings.csv: Driver championship points and wins.
drivers.csv: Driver names, nationalities, and DOB.
lap_times.csv: Lap-by-lap timing data.
pit_stops.csv: Pit stop durations and lap numbers.
qualifying.csv: Q1, Q2, and Q3 session times.
races.csv: Race calendar and metadata.
results.csv: Final race results (primary target data).
seasons.csv: Historical season links.
sprint_results.csv: Sprint race outcome data.
status.csv: Finish status codes (Finished, DNF, etc.).

🚀 Installation

Clone the repository

git clone https://github.com/heyisula/f1.git
cd f1

Install dependencies
```
pip install -r requirements.txt
```

🚀 Usage

Step 1: Train the Model

Open and run all cells in train.ipynb:

 jupyter notebook train.ipynb

The notebook will:

Load and preprocess data from the data/ directory.
Implement data leakage fixes and feature engineering.
Train multiple models (XGBoost, RandomForest, Gradient Boosting).
Save the best model and preprocessors to out/models/f1_model_data.pkl.

Step 2: Launch the Web App

Start the Flask server:

python app.py

Open your browser and navigate to http://127.0.0.1:5000. You can:

Select a driver, team, and circuit.
Watch as the Laps field auto-fills based on the circuit selected.
Get an AI-predicted winning probability.

🔧 Features

Current Feature Set

Feature	Description	Importance
Grid Position	Starting position on the grid (1-20)	High
Cumulative Points	Points earned entering the race	High
Circuit ID	Encoded circuit identifier	Medium
Driver/Team ID	Encoded driver and constructor identifiers	Medium-High
Driver Age	Calculated age at race time	Low
Laps	Total race distance	Low

Potential Future Features

Qualifying gap to pole (seconds)
Reliability index (DNF rate over last 5 races)
Teammate vs. Teammate historical performance
Weather conditions (Rain/Dry probability)

📊 Model Performance

Metric	Target	Interpretation
Weighted F1-Score	~0.96	High overall classification accuracy
ROC-AUC	~0.95	Excellent model discriminative ability
Winner Precision	~0.55-0.65	Realistic given single-winner probability

Note: Metrics represent the optimized Gradient Boosting model.

📄 License

This project is open source and available under the MIT License.

🙏 Acknowledgments

Dataset: Ergast F1 API via Kaggle
Inspiration: F1 racing analytics community

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
out		out
webapp		webapp
LICENSE		LICENSE
README.md		README.md
app.py		app.py
get_circuit_laps.py		get_circuit_laps.py
requirements.txt		requirements.txt
train.ipynb		train.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula 1 Race Winner Prediction Model

📋 Table of Contents

🎯 Overview

Key Highlights

⚠️ Critical Implementation Fixes

📊 Dataset

🚀 Installation

🚀 Usage

Step 1: Train the Model

Step 2: Launch the Web App

🔧 Features

Current Feature Set

Potential Future Features

📊 Model Performance

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Formula 1 Race Winner Prediction Model

📋 Table of Contents

🎯 Overview

Key Highlights

⚠️ Critical Implementation Fixes

📊 Dataset

🚀 Installation

🚀 Usage

Step 1: Train the Model

Step 2: Launch the Web App

🔧 Features

Current Feature Set

Potential Future Features

📊 Model Performance

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages