Will AI pass the Winograd schema challenge by the end of 2024?

1kṀ1917

resolved Dec 10

Resolved

ALL

https://en.wikipedia.org/wiki/Winograd_schema_challenge

Resolves positivly if a computer program exists that can solve Winograd schemas as well as an educated, fluent-in-English human can.

Press releases making such a claim do not count; the system must be subjected to adversarial testing and succeed.

(Failures on sentences that a human would also consider ambiguous will not prevent this market from resolving positivly.)

/IsaacKing/will-ai-pass-the-winograd-schema-ch

/IsaacKing/will-ai-pass-the-winograd-schema-ch-1d7f8b4ad30e

/IsaacKing/will-ai-pass-the-winograd-schema-ch-35f9dca7fa7d

/IsaacKing/will-ai-pass-the-winograd-schema-ch-d574a4067e75

Update 2025-12-09 (PST) (AI summary of creator comment): Current performance benchmarks:
- GPT-4: 87.5% accuracy
- Human baseline: 94% accuracy

For reference, see the leaderboard mentioned in the creator's comment.

Get

1,000

to start trading!

🏅 Top traders

#	Name	Total profit
1		Ṁ648
2		Ṁ350
3		Ṁ89
4		Ṁ55
5		Ṁ44

People are also trading

Will AI pass the Winograd schema challenge by the end of 2025?

39% chance

Top score on codeforces by an AI model at the end of 2025

AI solves Millenium Prize Problem in 2025?

1% chance

Will AI pass the Winograd schema challenge by the end of 2030?

94% chance

Will an AI system beat humans in the GAIA benchmark before the end of 2025?

15% chance

Will open-source AI win (through 2025)?

7% chance

Will an AI model achieve superhuman ELO on Codeforces by the 31 December 2025?

35% chance

Will any AI model score above 95% on GRAB by the end of 2025?

34% chance

Will AI pass the Rube Goldberg Turing test by the end of 2028?

38% chance

Will AI pass the Longbets version of the Turing test by the end of 2029?

Sort by:

Top spot on this leaderboard is GPT-4 at 87.5%, compared to humans at 94%.

This paper claims 91%.

That paper is from 2021, so it seems likely to me that a newer thinking model that's designed specifically for this sort of problem could break 94%. But I can't find any evidence of this actually having happened, and general-purpose thinking models do not seem capable of this. (Not to mention that developments this year don't count, this market ended at the end of 2024.) So I'm resolving NO.

Gemini failed the first one I tried. This is from the original public dataset, so it's surely in its training data!

ChatGPT makes it a little further, failing on the third one

Claude got through 7 correctly, and Gemini with thinking turned on got through 3 before I ran out of usage. ChatGPT with thinking turned on continues getting this one wrong, however.

Should this count? I think an educated human can do much better than 90%. I think to resolve YES there needs to be something reaching ~99%.

@IsaacKing

@IsaacKing how would you figure out whether this market resolves YES? if you want to give some ai like claude newsonnet a few winograd schemas, it's clear it can solve them correctly