Skip to content

mverab/WorldCupBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

51 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

WorldCupBench โ€” 10 Frontier LLMs predicted the entire 2026 FIFA World Cup

WorldCupBench โšฝ๐Ÿค–

The World Cup is the ultimate LLM eval.
10 frontier AI models predicted every match of the 2026 FIFA World Cup โ€” frozen pre-tournament, scored live.

Stars License Frozen Last Commit

๐Ÿ‡ช๐Ÿ‡ธ Versiรณn en Espaรฑol


๐Ÿ† Live Leaderboard

No predictions available yet.


โšก How It Works (in 4 lines)

  1. Same prompt โ†’ 10 SOTA LLMs via OpenRouter.
  2. JSON predictions โ†’ every match, every round, every score, with 1X2 probabilities.
  3. Frozen before kickoff โ†’ no post-hoc editing. Credibility is everything.
  4. Scored live โ†’ as real results come in, we compute accuracy, Brier score, and ROI vs Polymarket.

๐Ÿ”ฎ Featured Predictions

What do 10 frontier models agree on? What do they disagree on?

Featured predictions will appear here once all models have submitted. Check back soon!


๐Ÿš€ Quick Start

# Clone and setup
git clone https://github.com/mverab/WorldCupBench.git
cd WorldCupBench
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# Set your OpenRouter API key
cp .env.example .env
# Edit .env and add your key

# Run predictions for all models
python src/run_predictions.py

# Or run specific models only
python src/run_predictions.py --models GPT-5.5 Grok-3

# Validate setup without calling APIs
python src/run_predictions.py --dry-run

# Generate leaderboard from collected predictions
python src/generate_leaderboard.py --inject-readme

# Run the API server (for the dashboard disagreement view)
uvicorn src.api.main:app --reload --port 8000

The disagreement endpoint is available at http://localhost:8000/api/disagreement.


๐Ÿค– Compared Models (SOTA, June 2026)

Model Provider OpenRouter ID
GPT-5.5 OpenAI openai/gpt-5.5
Claude Fable 5 Anthropic anthropic/claude-fable-5
Gemini 3.5 Flash Google google/gemini-3.5-flash
Grok 4.3 xAI x-ai/grok-4.3
DeepSeek V4-Pro DeepSeek deepseek/deepseek-v4-pro
Qwen 3.7 Max Alibaba qwen/qwen-3.7-max
Kimi K2.6 Moonshot AI moonshotai/kimi-k2.6
GLM-5.1 Zhipu AI z-ai/glm-5.1
MiniMax M3 MiniMax minimax/minimax-m3
MiMo V2.5-Pro Xiaomi xiaomi/mimo-v2.5-pro
Nex-N2-Pro Nex AGI nex-agi/nex-n2-pro:free

All models receive the exact same prompt with tournament data and must return structured JSON covering all 104 matches. See prompts/prediction_prompt.txt.


๐Ÿ“ Methodology

Prediction Schema

Each model outputs a JSON object validated against schema/predictions_schema.json (Draft-07):

  • 72 group stage matches with exact score and 1X2 probabilities (sum = 1.0 ยฑ 0.02)
  • Group qualifiers: 12ร— 1st place, 12ร— 2nd place, 8ร— best 3rd place
  • Knockout stage: Round of 32 โ†’ Round of 16 โ†’ Quarter Finals โ†’ Semi Finals โ†’ Third Place + Final
  • Final standings: Champion, Runner-up, Third, Fourth

Key Rules

  • FIFA codes only: 3-letter codes (e.g., ARG, FRA, BRA)
  • Knockout = no draws: probs.draw must be 0.0; if the model predicts a draw in 90 min, it must indicate the winner of extra time/penalties
  • Frozen timestamp: All predictions were generated and committed before the opening match (June 11, 2026)

๐Ÿ“Š How the ranking is computed

WorldCupBench scores every model on three independent metrics. The leaderboard ordering is driven by the probabilistic metrics, not by the single-outcome pick.

Metric What it measures Input field used
Brier score โ†“ Calibration quality of the 1X2 probabilities probs.{home,draw,away}
Outcome accuracy โ†‘ Did the most likely outcome happen? (argmax(probs)) probs.{home,draw,away}
Exact-score points โ†‘ Did the predicted scoreline match exactly? predicted_result + predicted_score

Important

The leaderboard (Brier + outcome accuracy) is computed strictly from the 1X2 probabilities (probs). The fields predicted_result and predicted_score feed only the exact-score metric.

This is why you may see a match where probs.away is the highest value but predicted_result is "draw": in tight matches (e.g. 0.30 / 0.30 / 0.40) a model can rationally pick a draw as its single best guess while still assigning the marginally higher probability to one side. This is a legitimate model decision, not a data error. All 792 frozen predictions (11 models ร— 72 group matches) were audited: 0 inconsistencies between predicted_result and predicted_score.

๐ŸงŠ Freeze provenance (freeze-v3)

All pre-tournament predictions were frozen before kickoff and carry an audit trail:

  • source_schema: "freeze-v3" โ€” the schema version the prediction was generated under.
  • model_id โ€” the exact model checkpoint queried (e.g. anthropic/claude-5-fable-20260609).
  • generated_at โ€” UTC timestamp of generation.
  • orientation_flipped โ€” true when the match was stored in the opposite home/away orientation vs. the official fixture. On these matches probs, predicted_result and predicted_score are all normalized to the official orientation, so the data is internally consistent.

โšฝ MEXโ€“RSA (match 1) counts toward scoring: the freeze timestamp (2026-06-10) precedes the match (2026-06-11). freeze-v3 does not include a bracket / champion prediction, so those points are scored as 0 for this modality.

๐Ÿ‡ช๐Ÿ‡ธ Resumen (ES): el ranking (Brier + acierto de resultado) se calcula solo sobre las probabilidades 1X2. Los campos predicted_result y predicted_score alimentan รบnicamente la mรฉtrica de marcador exacto. Por eso en partidos parejos un modelo puede tener la prob mรกs alta en un lado y aun asรญ elegir empate como pick puntual: es una decisiรณn vรกlida del modelo, no un error de datos. Los 792 partidos congelados fueron auditados: 0 inconsistencias.


๐Ÿ“ Project Structure

.
โ”œโ”€โ”€ README.md                       # This file
โ”œโ”€โ”€ README.es.md                    # Spanish version
โ”œโ”€โ”€ FREEZE.md                       # Audit log: commit hash, timestamps, checksums
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ .env.example
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ schema/
โ”‚   โ””โ”€โ”€ predictions_schema.json     # JSON Schema draft-07
โ”œโ”€โ”€ prompts/
โ”‚   โ””โ”€โ”€ prediction_prompt.txt       # Standard prompt for ALL models
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ run_predictions.py          # Main execution script
โ”‚   โ”œโ”€โ”€ models_config.py            # Model definitions
โ”‚   โ”œโ”€โ”€ utils.py                    # Parsing, validation, I/O
โ”‚   โ””โ”€โ”€ generate_leaderboard.py     # Auto-generate leaderboard
โ”œโ”€โ”€ predictions/                    # Model prediction JSONs
โ”œโ”€โ”€ data/
โ”‚   โ””โ”€โ”€ tournament.json             # Official FIFA draw data
โ””โ”€โ”€ assets/
    โ”œโ”€โ”€ banner.png                  # README banner
    โ””โ”€โ”€ social-preview.png          # GitHub social preview (1280ร—640)

๐Ÿท๏ธ Repository Topics

llm benchmark llm-evaluation ai world-cup fifa-world-cup-2026 predictions forecasting leaderboard sports-analytics gpt-5 claude gemini


๐Ÿค Contributing

Add Your Model

Want to add a new model? It's one PR:

  1. Add your model to src/models_config.py:
    {
        "name": "Your-Model-Name",
        "model_id": "provider/model-name",
        "provider": "Your Lab",
    }
  2. Run python src/run_predictions.py --models Your-Model-Name
  3. Submit a PR with the generated JSON

Add Real Results

As matches conclude, add actual results to data/results.json (format TBD) so we can compute live accuracy.

Improve Scoring

The scoring system is evolving. Open an issue or PR with your proposed metric.


๐Ÿ“œ License

MIT โ€” see LICENSE.

Tournament data sourced from official FIFA sources. This project is for educational and research purposes.


Built with โšฝ and ๐Ÿค– by @mverab

About

โšฝ๐Ÿค– 11 frontier LLMs predicted the entire 2026 World Cup โ€” frozen before kickoff. Live leaderboard: Brier score, bracket points & Polymarket ROI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages