About the project
LLM SoccerArena
A live, timestamped benchmark for how large language models forecast the FIFA World Cup 2026.
Section 1
About LLM SoccerArena
What the project is, what the dashboard shows, and how to read the ranking.
Overview
What this benchmark does
It works like a prediction game in which the players are GPT-5.5, Claude, Gemini, Grok, DeepSeek, Qwen, Mistral, and other model setups instead of people. The forecasts are research outputs and public benchmark data, not betting advice.
What this is
LLM SoccerArena is a live football prediction benchmark for artificial intelligence. During the 2026 FIFA World Cup, leading AI language models predict matches and tournament outcomes, and we compare those forecasts with the official results.
The question
AI chatbots often sound confident. Football is a hard public test: the fixtures are known, the outcomes are unambiguous, and nobody knows the result in advance. We ask whether language models can forecast football, and whether live web access helps.
What the site shows
The dashboard shows model leaderboards, every model pick for each match, tournament-long question predictions, a tournament tree, the match schedule, and detailed analytics for different model setups.
What the ranking means
A high ranking means that a model setup has done well on matches scored so far. It does not prove football understanding, and it does not predict the next match with certainty.
Section 2
Methodology
How predictions are generated, validated, scored, and separated by setup.
Pipeline
How the benchmark works
Each model setup predicts the most likely 90-minute score, outcome probabilities, expected goals, full-match probabilities, and knockout advancement probabilities where relevant.
Predictions are stored with prompt, raw response, run id, match id, timing metadata, model id, access condition, prompt strategy, forecast horizon, and sample id.
Responses must be valid JSON with required fields, probabilities in range, probability vectors summing to 1 within tolerance, and non-negative integer scorelines.
After official results are available, the system computes match points plus probabilistic, categorical, scoreline, and reliability metrics.
Research questions
What we test
- How accurately do different models forecast World Cup 2026 matches?
- Does open-book web access improve forecasts compared with closed-book prediction?
- Does probabilistic prompting improve forecasts compared with direct-score prompting?
- How often do models produce valid, usable, internally consistent predictions?
- How well do models predict knockout advancement?
Models
Active flagship set
The active 2x2 comparison is run through OpenRouter. The exact roster can change with model availability, but the current active set is:
Methodology
Experimental design
The core design crosses two information-access conditions, closed book and open book, with two prompt strategies, direct score and probabilistic forecast.
The unit is model x match x forecast horizon x access condition x prompt strategy x sample id. The core benchmark uses one deterministic call per unit.
The public interface distinguishes complete setups, not only model names. This keeps open-book, closed-book, direct-score, probabilistic, and horizon variants separate.
Methodology
Forecast timing
Predictions made roughly 24 hours before kickoff. Primary analyses can be restricted to valid T-24h predictions.
Predictions made roughly two hours before kickoff. This horizon is operationally more fragile, but can include later public information in open-book runs.
Group-stage fixtures are predicted once at stage opening; knockout fixtures are predicted once the pairing is known. These forecasts are not used to fill missing T-24h forecasts.
Methodology
Information access and prompts
Exactly one access condition is used for a prediction run.
Fixture-identifying fields only; no web search, tools, odds, news, form, rankings, injuries, or lineups.
Same fixture block plus configured web-search/tool access and an instruction to retrieve current public information.
Exactly one prompt strategy is paired with the access condition.
Predict the most likely scoreline first, then provide probabilities consistent with it.
Estimate calibrated probabilities and expected goals first, then derive the scoreline.
Evaluation
Scoring and metrics
Exact 90-minute score receives 5 points, correct goal difference receives 2, correct tendency receives 1, and misses receive 0.
The benchmark reports 90-minute multiclass Brier score and multiclass log loss. Lower values are better.
We track top-outcome accuracy, tendency accuracy from the predicted score, exact-score accuracy, goal-difference accuracy, and knockout advancement accuracy.
We report invalid-output, repair, normalization, missing, open-book search-observed, and score-probability consistency rates.
Each model setup answers 15 tournament-long questions. These are ranked separately from match predictions, with 5 points for each correct call.
Limitations
How to interpret the ranking
- One tournament is a small sample, especially early on.
- Newer models may know more recent public information even in closed-book mode.
- Open-book models can read public odds or market summaries, so open-book results must be interpreted carefully.
- Knockout matches after extra time and penalties require separate advancement metrics.
- Football is noisy and low scoring, so even well-calibrated forecasts often miss.
Rankings shift as more matches are played, more horizons are added, and pending tournament-long answers resolve.
Section 3
Team
The researchers and contributors behind LLM SoccerArena.
People
Who is behind it
PhD researcher at LMU Munich and the Munich Center for Machine Learning (MCML).
PhD researcher at LMU Munich and the Munich Center for Machine Learning (MCML).
Professor of Data Analytics at Paderborn University and Head of the AI Competence Center at SICP.
Professor of Business Analytics at the University of Cologne and the Institute for Business AI.
Professor of AI for Management at LMU Munich School of Management and MCML.
Research paper
LLM SoccerArena Paper
PDFOpen the paperThe full paper will be available here as a PDF.