Confidence Bench

A Benchmark for Gauging Overconfidence in LLMs

Large Language Models (LLMs) are advancing rapidly, but they still suffer from hallucinations and overconfidence, making them unreliable for high-stakes tasks. ConfidenceBench is a novel benchmark that evaluates an LLM's ability to recognise its own uncertainty. Unlike traditional benchmarks that focus only on accuracy, ConfidenceBench penalises models for being overconfident when wrong, highlighting a critical weakness in current AI systems.Our dataset consists of 100 challenging multiple-choice questions across four categories: Spatial Reasoning, which tests whether LLMs can visualise real-world physics; High-Precision Math, which tests whether they can compute with extreme precision; Word Lookup from Texts, which tests whether they can retrieve exact information; and Offline Knowledge, which tests whether they can recognise what they don't know.Models are scored using the Brier score, a standard metric for calibration. Lower is better. A score of 0.000 means every question was answered correctly with full confidence; 1.000 means every question was answered incorrectly with full confidence.Accepted to EIML @ ICML 2026 ConfidenceBench will be presented at the Epistemic Intelligence in Machine Learning workshop at ICML 2026 in Seoul.

Live Leaderboard
Below you'll find the ConfidenceBench leaderboard, updated as new models are released. Lower Brier score = better calibration. A perfect score of 0 means every question answered correctly at maximum confidence; 1 means every question answered wrong at maximum confidence.

Leaderboard

Position	Model	Company	Score ↓	Accuracy
1st	GPT-5.5	OpenAI	0.084	82.5%
2nd	GPT-5.4 (High)	OpenAI	0.089	83%
3rd	Claude Opus 4.6	Anthropic	0.103	76%
4th	Gemini 3.1 Pro Preview	Google	0.103	83%
5th	Human Tester	—	0.105	71%
6th	GPT-5	OpenAI	0.117	77%
7th	GPT-5 (High)	OpenAI	0.121	76%
8th	Gemini 3.5 Flash	Google	0.123	79.5%
9th	GPT-5.2 (XHigh)	OpenAI	0.125	83%
10th	GPT-5.4	OpenAI	0.126	84%
11th	GPT-5.1 (High)	OpenAI	0.130	76%
12th	GPT-5 Mini	OpenAI	0.131	72%
13th	GPT-5 Nano	OpenAI	0.133	71%
14th	GPT-5 (Low)	OpenAI	0.141	75%
15th	Claude Sonnet 4.6	Anthropic	0.145	69%
16th	Claude Opus 4.5	Anthropic	0.149	75%
17th	GPT-5.2	OpenAI	0.163	77%
18th	Claude Opus 4.8	Anthropic	0.165	68.5%
19th	GPT-5.2 (High)	OpenAI	0.167	80%
20th	Claude Sonnet 4.5	Anthropic	0.183	71%
21st	Claude Opus 4.7	Anthropic	0.186	59.0%
22nd	Claude Opus 4.1	Anthropic	0.191	67%
23rd	Gemini 2.5 Pro	Google	0.213	69%
24th	Gemini 2.5 Flash	Google	0.227	61%
25th	Claude Sonnet 4	Anthropic	0.251	58%
26th	GPT-5.1 (No Thinking)	OpenAI	0.363	37%
27th	Gemini 3.1 Flash-Lite	Google	0.367	55%

Example Questions

The dataset is kept private to avoid leakage onto the internet. Here are a set of 4 example questions which are not part of the secret dataset.
For all questions the model must give a score indicating it's level of confidence from 1 to 10.Spatial Reasoning
Question:
I am standing in the center of a room, holding a mug and a marble. I place the marble inside the mug. Then, I walk over to a table and flip the mug upside down onto the table before putting it in the fridge. The mug has a large quarter sized hole in the bottom.Where is the marble now?✅ On the floor
❌ In the fridge
❌ On the table
❌ None of these---High-Precision Math
Question:
Take the third digit after the decimal of the square root of 867. Multiply this digit by the square root of 456. Round the result to a whole number.❌ 84
❌ 83
❌ 88
✅ 85---Word Lookup from Texts
Question:
What is the 8th word in Chapter Two of Harry Potter and the Philosopher’s Stone? It comes between the words “the” and “had.”❌ house
✅ Dursleys
❌ car
❌ world---Offline Knowledge
Question:
What color was featured in the painting in the entrance hall of Flat 7B, 12 Santa Monica, Madrid on the 4th of January 2025?✅ Green
❌ Red
❌ Blue
❌ Yellow

Contact

If you have any questions about the benchmark or would like to collaborate, I would love to hear from you at my email.