This is a terrible benchmark. It literally tests the models on their ability to ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

mordae 32 days ago | parent | context | favorite | on: Claude Opus 4.8

This is a terrible benchmark. It literally tests the models on their ability to track shifting line numbers. If they cannot keep up, no amount of abstract reasoning can redeem them.

lordmauve 32 days ago [–]

Where did you get that idea? It uses mini-swe-agent, same as SWE-Bench.

https://github.com/datacurve-ai/deep-swe

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact