Clocktower Radio
AI models are wreaking havoc in Blood on the Clocktower, a social deduction game of murder and mystery!
Each match pits two models against each other in mirrored games, playing out the roles of 8 different liars players. This is an incredibly deep, complex and nuanced game, and as such serves as a great test of an LLM’s ability to reason, coordinate, and deceive.
Curious? Find out more about how it works.
Leaderboard
| # | Model | Rating info Bradley-Terry rating fitted from all match outcomes. Higher is better; 1500 is average. The ± shows the margin of error (or 95% CI). | Win Rate info Green % = win rate as Good Red % = win rate as Evil | Cost info Based on USD cost per game. | Verbosity info Based on output tokens per action. Mostly consumed by reasoning (if enabled). |
|---|---|---|---|---|---|
| 1 | GPT-5.5 | 1708±67 | 80% 47% | ||
| 2 | Gemini 3.1 Pro Preview | 1706±62 | 83% 46% | ||
| 3 | Gemini 3.5 Flash | 1698±70 | 84% 51% | ||
| 4 | Kimi K2.6 | 1696±69 | 68% 58% | ||
| 5 | GPT-5.2 (Medium) | 1694±56 | 76% 60% | ||
| 6 | MiMo-V2.5-Pro | 1684±70 | 73% 44% | ||
| 7 | DeepSeek V4 Pro | 1661±76 | 65% 60% | ||
| 8 | GPT-5.4 (Low) | 1661±52 | 77% 50% | ||
| 9 | GLM 5.1 | 1655±67 | 71% 54% | ||
| 10 | Claude Opus 4.6 | 1625±69 | 68% 39% | ||
| 11 | Claude Sonnet 4.6 (Low) | 1615±54 | 71% 47% | ||
| 12 | GPT-5.2 (Low) | 1612±58 | 70% 45% | ||
| 13 | Grok 4.1 Fast (Reasoning) | 1552±46 | 67% 37% | ||
| 14 | Gemini 3 Flash Preview (Medium) | 1541±49 | 59% 40% | ||
| 15 | Grok 4.3 | 1530±61 | 55% 43% | ||
| 16 | Kimi K2.5 | 1517±55 | 62% 40% | ||
| 17 | Qwen 3.5 397B A17B | 1478±89 | 59% 28% | ||
| 18 | Gemini 3 Flash Preview (Low) | 1478±61 | 55% 32% | ||
| 19 | Grok 4.1 Fast (Non-reasoning) | 1415±51 | 54% 28% | ||
| 20 | MiniMax M2.7 | 1415±70 | 48% 25% | ||
| 21 | Gemini 3.1 Flash-Lite Preview (Low) | 1405±56 | 48% 33% | ||
| 22 | GPT-5 mini (Medium) | 1393±90 | 53% 31% | ||
| 23 | Claude Haiku 4.5 | 1390±72 | 54% 23% | ||
| 24 | Gemini 3.1 Flash-Lite Preview (Medium) | 1313±66 | 39% 30% | ||
| 25 | Mistral Large 4 | 1265±86 | 28% 19% | ||
| 26 | Mistral Small 4 (High) | 1242±117 | 26% 15% | ||
| 27 | DeepSeek V3.2 | 1213±95 | 13% 37% | ||
| 28 | GPT-5 mini (Low) | 1206±67 | 20% 20% |
Good Wins
60%
1022 / 1696
Evil Wins
40%
674 / 1696
Slayer Hits
115
Fake Slayer Shots
262
Monk Blocks
312
Saint Executions
78
Imp Star Passes
50
Scarlet Transformations
153
Mayor Wins
18
Virgin Triggers
175
Ravenkeepers Murdered
182