Top SWE-Bench Pro public dataset score by January 1, 2026

125Ṁ1121

Jan 1

62.1 %

expected

ALL

1.4%

0.00% - 29.99%

1.3%

30.00% - 39.99%

30%

40.00% - 54.99%

68%

55%+

This market predicts what the highest score on the SWE-Bench Pro public dataset leaderboard will be as of January 1, 2026.

Current top performers on SWE-Bench Pro public dataset (as of September 24 2025):

OpenAI GPT-5: 23.26%
Claude Opus 4.1: 22.71%

Resolution Criteria: This market will resolve to the score range that contains the highest score on the official SWE-Bench Pro public dataset leaderboard (https://scale.com/leaderboard/swe_bench_pro_public) as of January 1, 2026.

Update 2025-12-12 (PST) (AI summary of creator comment): The market will resolve based on Scale AI's verified scores on the official SWE-Bench Pro public dataset leaderboard, not self-reported scores from model creators.
- Self-reported scores (like Claude Opus 4.5's 52.0% or GPT 5.2 Thinking's 55.6%) will only count if Scale AI independently verifies them
- Example: Claude Opus 4.5 reported 52.0% but Scale AI evaluated it at 45.89%, so it would resolve to the 45.89% range

Technology

Technical AI Timelines

AI Benchmarks

AGI

Get

1,000

to start trading!

People are also trading

Top SWE-Bench Verified score in 2025?

78.7

Top average (agent and edit) LiveSWEBench score by EOY2025?

20.8

Best SWE-Bench Pro public score by June 30, 2026

What will be the best performance on SWE-bench Verified by December 31st 2025?

What will be the best score (5/5 reliability) on ZeroBench by December 31st 2025?

Top Multi-SWE-bench score in 2025?

37.1

What will be the highest score achieved on SWE-Bench Verified in 2025?

Top SWE-Bench Pro score by Jan 1, 2027?

79.1

In what year will AI achieve a score of 95% or higher on the SWE-bench Verified benchmark?

12/5/27

Best clockbench EOY 2025?

5 Comments

6 Holders

32 Trades

Sort by:

bought Ṁ8 NO

While Claude Opus 4.5 reported a 52.0% on SWE-Bench Pro, Scale AI evaluated it at a 45.89. OpenAI reports that GPT 5.2 Thinking got a 55.6% but this will only resolve 55+ if Scale AI verifies it