About the project

LLM SoccerArena

A live, timestamped benchmark for how large language models forecast the FIFA World Cup 2026.

Section 1

About LLM SoccerArena

What the project is, what the dashboard shows, and how to read the ranking.

Overview

What this benchmark does

It works like a prediction game in which the players are GPT-5.5, Claude, Gemini, Grok, DeepSeek, Qwen, Mistral, and other model setups instead of people. The forecasts are research outputs and public benchmark data, not betting advice.

What this is

LLM SoccerArena is a live football prediction benchmark for artificial intelligence. During the 2026 FIFA World Cup, leading AI language models predict matches and tournament outcomes, and we compare those forecasts with the official results.

The question

AI chatbots often sound confident. Football is a hard public test: the fixtures are known, the outcomes are unambiguous, and nobody knows the result in advance. We ask whether language models can forecast football, and whether live web access helps.

What the site shows

The dashboard shows model leaderboards, every model pick for each match, tournament-long question predictions, a tournament tree, the match schedule, and detailed analytics for different model setups.

What the ranking means

A high ranking means that a model setup has done well on matches scored so far. It does not prove football understanding, and it does not predict the next match with certainty.

Section 2

Methodology

How predictions are generated, validated, scored, and separated by setup.

Pipeline

How the benchmark works

1. Before a match

Each model setup predicts the most likely 90-minute score, outcome probabilities, expected goals, full-match probabilities, and knockout advancement probabilities where relevant.

2. Timestamp and store

Predictions are stored with prompt, raw response, run id, match id, timing metadata, model id, access condition, prompt strategy, forecast horizon, and sample id.

3. Validate

Responses must be valid JSON with required fields, probabilities in range, probability vectors summing to 1 within tolerance, and non-negative integer scorelines.

4. Evaluate

After official results are available, the system computes match points plus probabilistic, categorical, scoreline, and reliability metrics.

Research questions

What we test

  • How accurately do different models forecast World Cup 2026 matches?
  • Does open-book web access improve forecasts compared with closed-book prediction?
  • Does probabilistic prompting improve forecasts compared with direct-score prompting?
  • How often do models produce valid, usable, internally consistent predictions?
  • How well do models predict knockout advancement?

Models

Active flagship set

The active 2x2 comparison is run through OpenRouter. The exact roster can change with model availability, but the current active set is:

GPT-5.5Claude Opus 4.8Claude Fable 5Gemini 3.1 ProGrok 4.3DeepSeek V4 ProQwen 3.7 MaxMistral Large 2512

Methodology

Experimental design

2x2 benchmark

The core design crosses two information-access conditions, closed book and open book, with two prompt strategies, direct score and probabilistic forecast.

Experimental unit

The unit is model x match x forecast horizon x access condition x prompt strategy x sample id. The core benchmark uses one deterministic call per unit.

Model setups

The public interface distinguishes complete setups, not only model names. This keeps open-book, closed-book, direct-score, probabilistic, and horizon variants separate.

Methodology

Forecast timing

T-24h

Predictions made roughly 24 hours before kickoff. Primary analyses can be restricted to valid T-24h predictions.

T-2h

Predictions made roughly two hours before kickoff. This horizon is operationally more fragile, but can include later public information in open-book runs.

STAGE_OPENING

Group-stage fixtures are predicted once at stage opening; knockout fixtures are predicted once the pairing is known. These forecasts are not used to fill missing T-24h forecasts.

Methodology

Information access and prompts

One model setup
Information access

Exactly one access condition is used for a prediction run.

Closed book

Fixture-identifying fields only; no web search, tools, odds, news, form, rankings, injuries, or lineups.

Open book

Same fixture block plus configured web-search/tool access and an instruction to retrieve current public information.

Prompt strategy

Exactly one prompt strategy is paired with the access condition.

Direct score

Predict the most likely scoreline first, then provide probabilities consistent with it.

Probabilistic forecast

Estimate calibrated probabilities and expected goals first, then derive the scoreline.

Evaluation

Scoring and metrics

Game-style points

Exact 90-minute score receives 5 points, correct goal difference receives 2, correct tendency receives 1, and misses receive 0.

Probability metrics

The benchmark reports 90-minute multiclass Brier score and multiclass log loss. Lower values are better.

Accuracy metrics

We track top-outcome accuracy, tendency accuracy from the predicted score, exact-score accuracy, goal-difference accuracy, and knockout advancement accuracy.

Reliability diagnostics

We report invalid-output, repair, normalization, missing, open-book search-observed, and score-probability consistency rates.

Tournament questions

Each model setup answers 15 tournament-long questions. These are ranked separately from match predictions, with 5 points for each correct call.

Limitations

How to interpret the ranking

  • One tournament is a small sample, especially early on.
  • Newer models may know more recent public information even in closed-book mode.
  • Open-book models can read public odds or market summaries, so open-book results must be interpreted carefully.
  • Knockout matches after extra time and penalties require separate advancement metrics.
  • Football is noisy and low scoring, so even well-calibrated forecasts often miss.

Rankings shift as more matches are played, more horizons are added, and pending tournament-long answers resolve.

Section 3

Team

The researchers and contributors behind LLM SoccerArena.

People

Who is behind it

Jonas Schweisthal

PhD researcher at LMU Munich and the Munich Center for Machine Learning (MCML).

jonas.schweisthallmu.de
Jonas Schröder

PhD researcher at LMU Munich and the Munich Center for Machine Learning (MCML).

jonas.schroederlmu.de
Oliver Müller

Professor of Data Analytics at Paderborn University and Head of the AI Competence Center at SICP.

Markus Weinmann

Professor of Business Analytics at the University of Cologne and the Institute for Business AI.

Stefan Feuerriegel

Professor of AI for Management at LMU Munich School of Management and MCML.

feuerriegellmu.de

Research paper

LLM SoccerArena Paper

PDFOpen the paper

The full paper will be available here as a PDF.