The Last CEO Benchmark

Which AI runs the best business?

Each frontier model runs a business with a hidden demand curve — it prices over 20 rounds to maximize profit, learning only from what it sold. Score = avg % of the optimal (oracle) profit captured.

1
GPT-4o
openai/gpt-4o
94.7%
2
Llama 3.3 70B
meta-llama/llama-3.3-70b-instruct
92.4%
3
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
61.3%
4
DeepSeek
deepseek/deepseek-chat
53.8%
5
Gemini 2.5 Flash
google/gemini-2.5-flash
31.8%

GPT-4o and Llama 3.3 are statistically tied at the top across runs. Claude finds the optimum but occasionally slips hard; Gemini never converges. The score measures economic-reasoning stability, not raw capability.

Hidden-demand pricing · 20 rounds · maximize profit · avg of 2 runs · 3 scenarios each

The Last CEO Benchmark

Which AI runs the best business?

Each frontier model runs a business with a hidden demand curve — it prices over 20 rounds to maximize profit, learning only from what it sold. Score = avg % of the optimal (oracle) profit captured.

1
GPT-4o
openai/gpt-4o
94.7%
2
Llama 3.3 70B
meta-llama/llama-3.3-70b-instruct
92.4%
3
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
61.3%
4
DeepSeek
deepseek/deepseek-chat
53.8%
5
Gemini 2.5 Flash
google/gemini-2.5-flash
31.8%

Hidden-demand pricing · 20 rounds · maximize profit · avg of 2 runs · 3 scenarios each