Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your benchmark has Opus 4.7 performing significantly worse than Sonnet 4.6. Even if true on your benchmark, that is not representative of the overall performance of the models.


Yes Opus 4.7 fast (no reasoning) did a worst job than Sonnet 4.6 high (with reasoning) according to Gemini 3.1 Pro evaluation.


Your table doesn't indicate reasoning vs non-reasoning, or reasoning level


When nothing is noted it's max reasoning (xhigh in copilot chat in vscode if available).

The models not availble on copilot were tested through opencode (max reasoning) and deepseek v4 was tested through Cline (with max reasoning too).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: