A comprehensive machine learning project that predicts Formula 1 race winners using historical data from 1950 to 2024. This system compares multiple ML algorithms, automatically selects the best performer, and deploys it via a premium web application with a modern dark-mode interface.
- Overview
- Critical Implementation Fixes
- Dataset
- Installation
- Usage
- Features
- Model Performance
- Web Application
This project trains and compares multiple machine learning algorithms to predict Formula 1 race winners with high accuracy. The pipeline automatically selects the best-performing model and deploys it through a sleek, interactive web application.
- Data Leakage Prevention: Uses pre-race
cumulative_pointsinstead of post-race points to ensure realistic results. - Multi-Model Comparison: Evaluates Random Forest, XGBoost, Logistic Regression, Gradient Boosting, and more.
- Automatic Model Selection: Identifies the best performer using ROC-AUC and F1-Score metrics.
- Robust Preprocessing: Standardized feature scaling (
StandardScaler) and automated categorical encoding. - Premium Web Interface: Modern Flask-powered web app with F1-themed dark mode and dynamic circuit metadata.
- Real-Time Predictions: Instant winning probability for any driver/circuit/team combination.
The model includes several critical improvements over traditional F1 predictors:
- Fixing Data Leakage: Replaced post-race
pointswithcumulative_points(points earned before the current race). This ensures the model only uses information that would be available prior to the race start. - Feature Scaling: Implemented
StandardScalerto normalize features likecumulative_points(0-400+) andgrid_position(1-20), preventing large-scale variables from dominating the model. - Class Imbalance Handling: Applied
class_weight='balanced'and optimized forWeighted F1-Scoreto account for the fact that only one driver wins per race.
The dataset is sourced from the Kaggle Formula 1 World Championship (1950-2024), compiled from the Ergast API.
The pipeline integrates 14 CSV files:
- circuits.csv: Circuit metadata (location, country, coordinates).
- constructor_results.csv: Constructor race points.
- constructor_standings.csv: Constructor championship positions.
- constructors.csv: Team names and nationalities.
- driver_standings.csv: Driver championship points and wins.
- drivers.csv: Driver names, nationalities, and DOB.
- lap_times.csv: Lap-by-lap timing data.
- pit_stops.csv: Pit stop durations and lap numbers.
- qualifying.csv: Q1, Q2, and Q3 session times.
- races.csv: Race calendar and metadata.
- results.csv: Final race results (primary target data).
- seasons.csv: Historical season links.
- sprint_results.csv: Sprint race outcome data.
- status.csv: Finish status codes (Finished, DNF, etc.).
-
Clone the repository
git clone https://github.com/heyisula/f1.git cd f1 -
Install dependencies
pip install -r requirements.txt
Open and run all cells in train.ipynb:
jupyter notebook train.ipynbThe notebook will:
- Load and preprocess data from the
data/directory. - Implement data leakage fixes and feature engineering.
- Train multiple models (XGBoost, RandomForest, Gradient Boosting).
- Save the best model and preprocessors to
out/models/f1_model_data.pkl.
Start the Flask server:
python app.pyOpen your browser and navigate to http://127.0.0.1:5000. You can:
- Select a driver, team, and circuit.
- Watch as the Laps field auto-fills based on the circuit selected.
- Get an AI-predicted winning probability.
| Feature | Description | Importance |
|---|---|---|
| Grid Position | Starting position on the grid (1-20) | High |
| Cumulative Points | Points earned entering the race | High |
| Circuit ID | Encoded circuit identifier | Medium |
| Driver/Team ID | Encoded driver and constructor identifiers | Medium-High |
| Driver Age | Calculated age at race time | Low |
| Laps | Total race distance | Low |
- Qualifying gap to pole (seconds)
- Reliability index (DNF rate over last 5 races)
- Teammate vs. Teammate historical performance
- Weather conditions (Rain/Dry probability)
| Metric | Target | Interpretation |
|---|---|---|
| Weighted F1-Score | ~0.96 | High overall classification accuracy |
| ROC-AUC | ~0.95 | Excellent model discriminative ability |
| Winner Precision | ~0.55-0.65 | Realistic given single-winner probability |
Note: Metrics represent the optimized Gradient Boosting model.
This project is open source and available under the MIT License.
- Dataset: Ergast F1 API via Kaggle
- Inspiration: F1 racing analytics community