Confidence Bench

A Benchmark for Gauging Overconfidence in LLMs

Large Language Models (LLMs) are advancing rapidly, but they still suffer from hallucinations and overconfidence, making them unreliable for high-stakes tasks. ConfidenceBench is a novel benchmark that evaluates an LLM's ability to recognise its own uncertainty. Unlike traditional benchmarks that focus only on accuracy, ConfidenceBench penalises models for being overconfident when wrong, highlighting a critical weakness in current AI systems.Our dataset consists of 100 challenging multiple-choice questions across four categories: Spatial Reasoning, which tests whether LLMs can visualise real-world physics; High-Precision Math, which tests whether they can compute with extreme precision; Word Lookup from Texts, which tests whether they can retrieve exact information; and Offline Knowledge, which tests whether they can recognise what they don't know.Models are scored using the Brier score, a standard metric for calibration. Lower is better. A score of 0.000 means every question was answered correctly with full confidence; 1.000 means every question was answered incorrectly with full confidence.πŸ† Accepted to EIML @ ICML 2026 β€” ConfidenceBench will be presented at the Epistemic Intelligence in Machine Learning workshop at ICML 2026 in Seoul.

Live Leaderboard
Below you'll find the ConfidenceBench leaderboard, updated as new models are released. Lower Brier score = better calibration. A perfect score of 0 means every question answered correctly at maximum confidence; 1 means every question answered wrong at maximum confidence.

Leaderboard

PositionModelCompanyScore ↓Accuracy
1stGPT-5.5OpenAI0.08482.5%
2ndGPT-5.4 (High)OpenAI0.08983%
3rdClaude Opus 4.6Anthropic0.10376%
4thGemini 3.1 Pro PreviewGoogle0.10383%
5thHuman Testerβ€”0.10571%
6thGPT-5OpenAI0.11777%
7thGPT-5 (High)OpenAI0.12176%
8thGemini 3.5 FlashGoogle0.12379.5%
9thGPT-5.2 (XHigh)OpenAI0.12583%
10thGPT-5.4OpenAI0.12684%
11thGPT-5.1 (High)OpenAI0.13076%
12thGPT-5 MiniOpenAI0.13172%
13thGPT-5 NanoOpenAI0.13371%
14thGPT-5 (Low)OpenAI0.14175%
15thClaude Sonnet 4.6Anthropic0.14569%
16thClaude Opus 4.5Anthropic0.14975%
17thGPT-5.2OpenAI0.16377%
18thClaude Opus 4.8Anthropic0.16568.5%
19thGPT-5.2 (High)OpenAI0.16780%
20thClaude Sonnet 4.5Anthropic0.18371%
21stClaude Opus 4.7Anthropic0.18659.0%
22ndClaude Opus 4.1Anthropic0.19167%
23rdGemini 2.5 ProGoogle0.21369%
24thGemini 2.5 FlashGoogle0.22761%
25thClaude Sonnet 4Anthropic0.25158%
26thGPT-5.1 (No Thinking)OpenAI0.36337%
27thGemini 3.1 Flash-LiteGoogle0.36755%

Example Questions

The dataset is kept private to avoid leakage onto the internet. Here are a set of 4 example questions which are not part of the secret dataset.
For all questions the model must give a score indicating it's level of confidence from 1 to 10.
Spatial Reasoning
Question:
I am standing in the center of a room, holding a mug and a marble. I place the marble inside the mug. Then, I walk over to a table and flip the mug upside down onto the table before putting it in the fridge. The mug has a large quarter sized hole in the bottom.
Where is the marble now?βœ… On the floor
❌ In the fridge
❌ On the table
❌ None of these
---High-Precision Math
Question:
Take the third digit after the decimal of the square root of 867. Multiply this digit by the square root of 456. Round the result to a whole number.
❌ 84
❌ 83
❌ 88
βœ… 85
---Word Lookup from Texts
Question:
What is the 8th word in Chapter Two of Harry Potter and the Philosopher’s Stone? It comes between the words β€œthe” and β€œhad.”
❌ house
βœ… Dursleys
❌ car
❌ world
---Offline Knowledge
Question:
What color was featured in the painting in the entrance hall of Flat 7B, 12 Santa Monica, Madrid on the 4th of January 2025?
βœ… Green
❌ Red
❌ Blue
❌ Yellow

Contact

If you have any questions about the benchmark or would like to collaborate, I would love to hear from you at my email.