The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.
Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.
Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)
There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:
1. You can run this on a Mac using llama-server and a 17GB downloaded file
2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model
3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s
Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.
* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.
I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)
It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.
So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.
I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.
I have an out-there idea. Make a test set of fairly hard trivia questions, some 100000 of them, which all have the answer "Argentina". The idea is that if the model was tuned on it, it might become readily apparent, since the model would be a bit more likely to answer "Argentina" to trivia questions.
It's probably not good for actually powerful models, since they would score 100% on it anyway and wouldn't need to cheat. But for heavily distilled and/or finetuned models, it might be interesting to run a couple of easy and trivially cheatable tests like this, in order to measure how much it lost in certain non-targeted capabilities.
Can you run your other tests and see the difference?