The discovery of the Higgs boson particle at CERN in 2012 marked a monumental achievement in modern physics, confirming the existence of the Higgs field — a cornerstone of the Standard Model.
Detecting such elusive particles requires advanced computing, sophisticated machine learning (ML), and global collaboration .
This repository implements a GPU-accelerated machine learning pipeline for classifying signal processes (Higgs-producing) vs background processes (non-Higgs).
It leverages the RAPIDS framework for GPU-parallel data processing and Google Colab (NVIDIA T4 GPU) for accelerated model training, evaluation, and testing.
ML has become an essential tool in particle physics, traditionally applied to processed data from reconstruction algorithms.
Today’s approaches enable direct analysis of raw detector data, helping with:
- Event selection
- Event classification
- Background suppression
These innovations allow physicists to separate meaningful signal events from enormous background noise.
High-energy physics experiments — like those at the Large Hadron Collider (LHC) — generate tens of terabytes per second of raw data.
The High Luminosity LHC (HL-LHC) will produce up to 15× more data.
Traditional CPUs struggle with this scale, motivating the use of GPUs for their massive parallelism in matrix operations and data transformations .
In collider experiments, signal events correspond to Higgs boson decay, while background events come from other particles.
Using Monte Carlo–simulated data, this project applies machine learning to classify events as either signal (1) or background (0).
- Dataset: UCI HIGGS Dataset
- Size: ~11 million instances generated via Monte Carlo simulations.
- Features: 29 total — first column = binary label (1 = signal, 0 = background), 28 physics-based features.
Key analyses performed:
- Checked for missing values and class imbalance.
- Visualized feature correlations to identify redundancy.
- Identified and managed outliers using the 3×IQR rule (less aggressive, preserving more data).
- Noted strong correlation pairs like
m_wbb↔m_wwbbandm_jj↔m_jjj.
Feature reduction: Removed m_wbb and m_jjj to prevent redundancy.
- Removed correlated features (to avoid multicollinearity).
- Handled outliers (3×IQR trimming).
- Addressed class imbalance using SMOTE (Synthetic Minority Over-sampling Technique).
- Normalized using
cuML’s GPU-based standard scaler for better convergence and accuracy. - Saved processed datasets for reuse and rapid model training.
Three machine learning models were trained and compared:
| Model | Description | Accelerator |
|---|---|---|
| Logistic Regression | Linear baseline classifier | GPU/CPU |
| Random Forest | Ensemble model with bootstrapped decision trees | GPU/CPU |
| XGBoost | Gradient-boosted trees (best-performing) | GPU (gpu_hist) |
A grid search was used for hyperparameter tuning to achieve optimal performance.
- Logistic Regression: Tested on CPU & GPU — similar accuracy, but GPU 3× faster ⏩.
- Random Forest: Tuned
n_estimatorsandmax_depth; balanced accuracy and computational load. - XGBoost: Trained with and without normalization — normalization significantly improved stability and AUC.
- Applied PCA to explore dimensionality reduction.
- Re-trained models on reduced feature sets.
- PCA preserved ~95% variance with 18 components — slight drop in accuracy but faster training time.
- XGBoost maintained top performance even with reduced dimensions.
After multiple experiments, Logistic Regression was dropped due to underperformance.
| Model | Accuracy | Precision | Recall | Specificity | F1-Score | AUC | Training Time (s) |
|---|---|---|---|---|---|---|---|
| Logistic Regression | 0.61 | 0.61 | 0.61 | 0.62 | 0.69 | 0.73 | 52 |
| Random Forest | 0.73 | 0.74 | 0.72 | 0.73 | 0.73 | 0.80 | 176 |
| XGBoost (Selected) | 0.74 | 0.75 | 0.73 | 0.74 | 0.74 | 0.82 | 132 |
XGBoost outperformed all models, particularly in precision and AUC, making it the most suitable model for Higgs boson process classification.
The trained XGBoost model is saved in:xgb.v1.model
This project demonstrates the effectiveness of GPU-based ML in high-energy physics for Higgs boson process classification.
Key achievements:
- Accelerated training with Google Colab (T4 GPU).
- Efficient handling of an 11M-row dataset using RAPIDS-based and GPU-enhanced libraries.
- Rigorous preprocessing pipeline (outlier handling, normalization, feature reduction).
- XGBoost achieved the best accuracy (74%) and AUC (0.82) — recommended model for deployment.
- Python 3.10
- Google Colab (NVIDIA T4 GPU)
- RAPIDS cuML/cuDF
- XGBoost (GPU
gpu_histmode) - scikit-learn
- pandas, matplotlib, seaborn
- imbalanced-learn (SMOTE)
- UCI Machine Learning Repository — HIGGS Dataset.
- RAPIDS Framework — https://rapids.ai.
- XGBoost Documentation — https://xgboost.readthedocs.io.
- CERN Open Data Portal — https://opendata.cern.ch.
Final Recommendation:
Use XGBoost (GPU-accelerated) as the primary classifier for Higgs boson process discrimination.
Achieved 74% accuracy and AUC = 0.82 on the test dataset — with superior inference speed and stability on GPU compared to CPU models. ⚡
“GPU computing has transformed theoretical physics from hours to minutes — accelerating discovery itself.”