I feel like this time it is indeed in the training set, because it is too good t...

simonw · 2026-04-22T17:01:34 1776877294

It went pretty wild with "Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER":

https://gist.github.com/simonw/95735fe5e76e6fdf1753e6dcce360...

throwaw12 · 2026-04-22T17:09:44 1776877784

compared to your test with GLM 5.1, this indeed looks off

https://xcancel.com/simonw/status/2041646779553476801

simonw · 2026-04-22T17:21:57 1776878517

Yeah GLM 5.1 did an outstanding job on the possum - better than Opus 4.7 or GPT-5.4 and I think better than Gemini 3.1 Pro too.

But GLM 5.1 is a 1.51TB model, the Qwen 3.6 I used here was 17GB - that's 1/88 the size.

zamadatix · 2026-04-22T18:25:52 1776882352

The point is in the relative difference between the Pelican vs "other" test for each model suggesting the Pelican is being treated special these days (could be as simple as being common in recent data), not the relative difference between the models on the "other" case in isolation.

refulgentis · 2026-04-22T17:18:24 1776878304

Hoping this doesn't turn into a pelican-SVG back-and-forth: yesterday's GPT Image 2 thread ended up being three screenfuls of "I tried the prompt too" replies, and nothing on the model until you scroll past it. I appreciate the testing, and I know this sounds like fun police, but there's a pattern where well-known commenter + one-off vibe test + 1:1 sub-threads eats the whole discussion. It being fun makes it hard to push back on without looking picky.

simonw · 2026-04-22T17:20:39 1776878439

You can collapse the pelican thread with the little [-] toggle at the top.

taspeotis · 2026-04-22T17:24:26 1776878666

Why would you though?

And by the way: Thanks for relentlessly holding new models’ feet to the pelican SVG fire.

refulgentis · 2026-04-22T17:34:12 1776879252

Because I want to read about Qwen, not someone's one-off vibe test followed by 1:1 conversations. (case in miniature here: which is the last comment in this thread that says something about Qwen? The root post. Is that fun policing? Yes, apologies.)

simonw · 2026-04-22T18:02:30 1776880950

There's a bunch of useful information in my comment that's independent of the fact that it drew a pelican:

1. You can run this on a Mac using llama-server and a 17GB downloaded file

2. That version does indeed produce output (for one specific task) that's of a good enough quality to be worth spending more time checking out this model

3. It generated 4,444 tokens in 2min 53s, which is 25.57 tokens/s

refulgentis · 2026-04-22T18:16:49 1776881809

Right, that is exactly what I meant by "the root post [had info about Qwen]" - you shouldn't feel I'm being critical of you or asking you to do anything different, at all. I admire you deeply and feel humbled* by interacting with you, so I really want that to be 100% clear, because this is the 2nd time I'm reading that it might be personal.

* er, that probably sounds strange, but I did just spend 6 weeks working on integrating the Willison Trifecta for my app I've been building for 2.5 years, and I considered it a release blocker. It's a simple mental model that is a significant UX accomplishment IMHO.

mlyle · 2026-04-22T19:32:55 1776886375

I like the pelican-bicycle test because it's pretty predictive of how the model does helping me with TikZ. And I hate writing TikZ.

interstice · 2026-04-22T19:31:25 1776886285

Somewhat ironically - as of when I write this this tangent is dominating the size of this topic.

subscribed · 2026-04-23T08:47:39 1776934059

I understand your reasoning and it's valid, but I think the best you can do is indeed collapse the thread (not sure if any mobile clients do better than that?)

It's perhaps not a serious test, it isn't to me, but on the edges of jokes about pelicans they're usually some useful things people smarter than me say, and additionally if providers are spending some time on making pelicans or svg look better, this benefits all of us.

So, no hard feelings, you're understood (and I'm not trying to be patronising, I'm just awkward with the language), but pelicans are here to stay because it seems that the consensus is they're beneficial and on topic.

All the best!

rob · 2026-04-22T17:56:54 1776880614

I think it's to help drive traffic to his blog now that he's accepted sponsors in the header of every page. I do see this pelican thing come up from him on every model post that gets released.

simonw · 2026-04-22T18:12:32 1776881552

The traffic I get from a comment with a link to a pelican is pretty tiny.

ai_critic · 2026-04-22T18:56:17 1776884177

"Create me an SVG to drive MAXIMUM ENGAGEMENT for my sponsors".

Missing an opportunity here, lol.

sifar · 2026-04-22T21:37:07 1776893827

I think at this point we can safely put the pelican test in the category of Goodhart's law.

amelius · 2026-04-22T21:39:46 1776893986

If I were them I'd run such requests through a diffusion model, and then try to distill an SVG out of that.

m3kw9 · 2026-04-22T17:32:29 1776879149

if they cook these in, i wonder what else was cooked in there to make it look good.

zargon · 2026-04-22T17:40:40 1776879640

Everything is benchmaxxed. Whack-a-mole training is at least as representative of what is getting added to models as more general training advances.

vintermann · 2026-04-23T15:18:26 1776957506

I have an out-there idea. Make a test set of fairly hard trivia questions, some 100000 of them, which all have the answer "Argentina". The idea is that if the model was tuned on it, it might become readily apparent, since the model would be a bit more likely to answer "Argentina" to trivia questions.

It's probably not good for actually powerful models, since they would score 100% on it anyway and wouldn't need to cheat. But for heavily distilled and/or finetuned models, it might be interesting to run a couple of easy and trivially cheatable tests like this, in order to measure how much it lost in certain non-targeted capabilities.