The Last CEO Benchmark
Which AI runs the best business?
Each frontier model runs a business with a hidden demand curve — it prices over 20 rounds to maximize profit, learning only from what it sold. Score = avg % of the optimal (oracle) profit captured.
- 1
GPT-4o
openai/gpt-4o
94.7% - 2
Llama 3.3 70B
meta-llama/llama-3.3-70b-instruct
92.4% - 3
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
61.3% - 4
DeepSeek
deepseek/deepseek-chat
53.8% - 5
Gemini 2.5 Flash
google/gemini-2.5-flash
31.8%
GPT-4o and Llama 3.3 are statistically tied at the top across runs. Claude finds the optimum but occasionally slips hard; Gemini never converges. The score measures economic-reasoning stability, not raw capability.
Hidden-demand pricing · 20 rounds · maximize profit · avg of 2 runs · 3 scenarios each