Hacker Newsnew | past | comments | ask | show | jobs | submitlogin



Getting ~36-33 tok/s (see the "S_TG t/s" column) on a 24GB Radeon RX 7900 XTX using llama.cpp's Vulkan backend:

    $ llama-server --version
    version: 8851 (e365e658f)

    $ llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.529 |   654.11 |    3.470 |    36.89 |    4.999 |   225.67 |
    |  2000 |    128 |    1 |   2128 |    3.064 |   652.75 |    3.498 |    36.59 |    6.562 |   324.30 |
    |  4000 |    128 |    1 |   4128 |    6.180 |   647.29 |    3.535 |    36.21 |    9.715 |   424.92 |
    |  8000 |    128 |    1 |   8128 |   12.477 |   641.16 |    3.582 |    35.73 |   16.059 |   506.12 |
    | 16000 |    128 |    1 |  16128 |   25.849 |   618.98 |    3.667 |    34.91 |   29.516 |   546.42 |
    | 32000 |    128 |    1 |  32128 |   57.201 |   559.43 |    3.825 |    33.47 |   61.026 |   526.47 |


Getting ~44-40 tok/s on 24GB RTX 3090 (llama.cpp version 8884, same llama-batched-bench call):

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    0.684 |  1462.61 |    2.869 |    44.61 |    3.553 |   317.47 |
    |  2000 |    128 |    1 |   2128 |    1.390 |  1438.84 |    2.868 |    44.64 |    4.258 |   499.80 |
    |  4000 |    128 |    1 |   4128 |    2.791 |  1433.18 |    2.886 |    44.35 |    5.677 |   727.11 |
    |  8000 |    128 |    1 |   8128 |    5.646 |  1416.98 |    2.922 |    43.80 |    8.568 |   948.65 |
    | 16000 |    128 |    1 |  16128 |   11.851 |  1350.10 |    3.007 |    42.57 |   14.857 |  1085.51 |
    | 32000 |    128 |    1 |  32128 |   25.855 |  1237.66 |    3.168 |    40.40 |   29.024 |  1106.96 |
Edit: Model gets stuck in infinite loops at this quantization level. I've also tried Q5_K_M quantization (fits up to 51968 context length), which seems more robust.


~25-26 tok/s with ROCm using the same card, llama.cpp b8884:

    $ llama-batched-bench -dev ROCm1 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    1.034 |   966.90 |    4.851 |    26.39 |    5.885 |   191.67 |
    |  2000 |    128 |    1 |   2128 |    2.104 |   950.38 |    4.853 |    26.38 |    6.957 |   305.86 |
    |  4000 |    128 |    1 |   4128 |    4.269 |   937.00 |    4.876 |    26.25 |    9.145 |   451.40 |
    |  8000 |    128 |    1 |   8128 |    8.962 |   892.69 |    4.912 |    26.06 |   13.873 |   585.88 |
    | 16000 |    128 |    1 |  16128 |   19.673 |   813.31 |    4.996 |    25.62 |   24.669 |   653.78 |
    | 32000 |    128 |    1 |  32128 |   46.304 |   691.09 |    5.122 |    24.99 |   51.426 |   624.75 |


Did you try GPU/CPU mix with a bigger model?


Prompt processing is absolutely punishing:

    ./llama-batched-bench -hf unsloth/Qwen3.5-122B-A10B-GGUF:UD-IQ4_NL -npp 1000 -ntg 128 -npl 1 --cache-type-k q8_0 --cache-type-v q8_0 -c 18000 --n-cpu-moe 32
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |   53.961 |    18.53 |    9.223 |    13.88 |   63.184 |    17.85 |


llama-batched-bench -hf ggml-org/Qwen3.6-27B-GGUF -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

M2 Ultra, Q8_0

  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    1.307 |   391.69 |    6.209 |    20.61 |    7.516 |    85.15 |
  |  1024 |    128 |    1 |   1152 |    2.534 |   404.16 |    6.227 |    20.56 |    8.760 |   131.50 |
  |  2048 |    128 |    1 |   2176 |    5.029 |   407.26 |    6.229 |    20.55 |   11.258 |   193.29 |
  |  4096 |    128 |    1 |   4224 |   10.176 |   402.52 |    6.278 |    20.39 |   16.454 |   256.72 |
  |  8192 |    128 |    1 |   8320 |   20.784 |   394.14 |    6.376 |    20.08 |   27.160 |   306.33 |
  | 16384 |    128 |    1 |  16512 |   43.513 |   376.53 |    6.532 |    19.59 |   50.046 |   329.94 |
  | 32768 |    128 |    1 |  32896 |   99.137 |   330.53 |    7.081 |    18.08 |  106.218 |   309.70 |

DGX Spark, Q8_0

  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.881 |   580.98 |   16.122 |     7.94 |   17.003 |    37.64 |
  |  1024 |    128 |    1 |   1152 |    1.749 |   585.43 |   16.131 |     7.93 |   17.880 |    64.43 |
  |  2048 |    128 |    1 |   2176 |    3.486 |   587.54 |   16.169 |     7.92 |   19.655 |   110.71 |
  |  4096 |    128 |    1 |   4224 |    7.018 |   583.64 |   16.245 |     7.88 |   23.263 |   181.58 |
  |  8192 |    128 |    1 |   8320 |   14.189 |   577.33 |   16.427 |     7.79 |   30.617 |   271.75 |
  | 16384 |    128 |    1 |  16512 |   29.015 |   564.68 |   16.749 |     7.64 |   45.763 |   360.81 |
  | 32768 |    128 |    1 |  32896 |   60.413 |   542.40 |   17.359 |     7.37 |   77.772 |   422.98 |


at this trajectory, unsloth are going to release the models BEFORE the model drop within the next weeks...


Haha :)


Do you get early access so you can prep the quants for release?


Yes we do! Sorry on the delay


IIRC they mentioned they do.


128GB (112 GB avail) Strix AI 395+ Radeon 8060x (gfx1151)

llama-* version 8889 w/ rocm support ; nightly rocm

llama.cpp/build/bin/llama-batched-bench --version unsloth/Qwen3.6-27B-GGUF:UD-Q8_K_XL -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.776 |   360.22 |   20.192 |     6.34 |   22.968 |    49.11 |
    |  2000 |    128 |    1 |   2128 |    5.778 |   346.12 |   20.211 |     6.33 |   25.990 |    81.88 |
    |  4000 |    128 |    1 |   4128 |   11.723 |   341.22 |   20.291 |     6.31 |   32.013 |   128.95 |
    |  8000 |    128 |    1 |   8128 |   24.223 |   330.26 |   20.399 |     6.27 |   44.622 |   182.15 |
    | 16000 |    128 |    1 |  16128 |   52.521 |   304.64 |   20.669 |     6.19 |   73.190 |   220.36 |
    | 32000 |    128 |    1 |  32128 |  120.333 |   265.93 |   21.244 |     6.03 |  141.577 |   226.93 |
More directly comparable to the results posted by genpfault (IQ4_XS):

llama.cpp/build/bin/llama-batched-bench -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000

    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    2.543 |   393.23 |    9.829 |    13.02 |   12.372 |    91.17 |
    |  2000 |    128 |    1 |   2128 |    5.400 |   370.36 |    9.891 |    12.94 |   15.291 |   139.17 |
    |  4000 |    128 |    1 |   4128 |   10.950 |   365.30 |    9.972 |    12.84 |   20.922 |   197.31 |
    |  8000 |    128 |    1 |   8128 |   22.762 |   351.46 |   10.118 |    12.65 |   32.880 |   247.20 |
    | 16000 |    128 |    1 |  16128 |   49.386 |   323.98 |   10.387 |    12.32 |   59.773 |   269.82 |
    | 32000 |    128 |    1 |  32128 |  114.218 |   280.16 |   10.950 |    11.69 |  125.169 |   256.68 |


Results are nearly identical running on a Strix Halo using Vulkan, llama.cpp b8884:

    $ llama-batched-bench -dev Vulkan2 -hf unsloth/Qwen3.6-27B-GGUF:IQ4_XS -npp 1000,2000,4000,8000,16000,32000 -ntg 128 -npl 1 -c 34000
    |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
    |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
    |  1000 |    128 |    1 |   1128 |    3.288 |   304.15 |    9.873 |    12.96 |   13.161 |    85.71 |
    |  2000 |    128 |    1 |   2128 |    6.415 |   311.79 |    9.883 |    12.95 |   16.297 |   130.57 |
    |  4000 |    128 |    1 |   4128 |   13.113 |   305.04 |    9.979 |    12.83 |   23.092 |   178.76 |
    |  8000 |    128 |    1 |   8128 |   27.491 |   291.01 |   10.155 |    12.61 |   37.645 |   215.91 |
    | 16000 |    128 |    1 |  16128 |   59.079 |   270.83 |   10.476 |    12.22 |   69.555 |   231.87 |
    | 32000 |    128 |    1 |  32128 |  148.625 |   215.31 |   11.084 |    11.55 |  159.709 |   201.17 |


you should try vulkan instead of rocm. it goes like 20% faster.


Is that based on recent experience? With "stable" ROCm, or the (IMHO better) releases from TheRock? With older or more recent hardware? The AMD landscape is rather uneven.


For this model results are identical. In my experience it can go either way by up to 10%.


~/llama.cpp$ build-.../bin/llama-batched-bench -m models/....gguf -npp 512,1024,2048,4096,8192,16384,32768 -ntg 128 -npl 1 -c 36000

  On amd 7900xtx

  Qwen3.6-27B-Q4_K_M
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.743 |   689.35 |    4.605 |    27.80 |    5.348 |   119.68 |
  |  1024 |    128 |    1 |   1152 |    1.188 |   862.17 |    4.573 |    27.99 |    5.761 |   199.96 |
  |  2048 |    128 |    1 |   2176 |    2.566 |   798.09 |    4.602 |    27.81 |    7.168 |   303.57 |
  |  4096 |    128 |    1 |   4224 |    5.936 |   690.00 |    4.639 |    27.59 |   10.575 |   399.43 |
  |  8192 |    128 |    1 |   8320 |   15.034 |   544.90 |    4.729 |    27.06 |   19.763 |   420.98 |
  | 16384 |    128 |    1 |  16512 |   42.807 |   382.74 |    4.886 |    26.20 |   47.694 |   346.21 |
  | 32768 |    128 |    1 |  32896 |  137.377 |   238.53 |    5.188 |    24.67 |  142.566 |   230.74 |

  Qwen3.6-27B-IQ4_NL
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.535 |   957.45 |    3.715 |    34.45 |    4.250 |   150.59 |
  |  1024 |    128 |    1 |   1152 |    1.124 |   911.16 |    3.677 |    34.81 |    4.801 |   239.97 |
  |  2048 |    128 |    1 |   2176 |    2.447 |   836.89 |    3.698 |    34.62 |    6.145 |   354.13 |
  |  4096 |    128 |    1 |   4224 |    5.711 |   717.17 |    3.729 |    34.32 |    9.441 |   447.43 |
  |  8192 |    128 |    1 |   8320 |   14.615 |   560.52 |    3.821 |    33.50 |   18.436 |   451.30 |
  | 16384 |    128 |    1 |  16512 |   41.966 |   390.41 |    3.967 |    32.26 |   45.933 |   359.48 |
  | 32768 |    128 |    1 |  32896 |  135.789 |   241.32 |    4.253 |    30.09 |  140.042 |   234.90 |

  On mbp M2 Max

  Qwen3.6-27B-UD-Q8_K_XL
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    2.583 |   198.18 |   22.049 |     5.81 |   24.633 |    25.98 |
  |  1024 |    128 |    1 |   1152 |    8.321 |   123.06 |   22.364 |     5.72 |   30.685 |    37.54 |
  |  2048 |    128 |    1 |   2176 |   17.873 |   114.59 |   23.290 |     5.50 |   41.164 |    52.86 |
  |  4096 |    128 |    1 |   4224 |   41.967 |    97.60 |   23.624 |     5.42 |   65.591 |    64.40 |
  |  8192 |    128 |    1 |   8320 |   68.722 |   119.20 |   21.077 |     6.07 |   89.799 |    92.65 |
  | 16384 |    128 |    1 |  16512 |  142.184 |   115.23 |   22.026 |     5.81 |  164.210 |   100.55 |
  | 32768 |    128 |    1 |  32896 |  339.778 |    96.44 |   24.465 |     5.23 |  364.243 |    90.31 |

  Compared to similar prior models

  On amd 7900xtx

  Qwen3.6-35B-A3B-UD-Q4_K_S
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.203 |  2517.60 |    1.482 |    86.35 |    1.686 |   379.67 |
  |  1024 |    128 |    1 |   1152 |    0.427 |  2399.22 |    1.471 |    87.04 |    1.897 |   607.15 |
  |  2048 |    128 |    1 |   2176 |    0.946 |  2165.23 |    1.478 |    86.59 |    2.424 |   897.67 |
  |  4096 |    128 |    1 |   4224 |    2.253 |  1818.33 |    1.502 |    85.22 |    3.755 |  1125.01 |
  |  8192 |    128 |    1 |   8320 |    5.849 |  1400.51 |    1.525 |    83.91 |    7.375 |  1128.17 |
  | 16384 |    128 |    1 |  16512 |   17.115 |   957.27 |    1.589 |    80.55 |   18.705 |   882.78 |
  | 32768 |    128 |    1 |  32896 |   56.008 |   585.06 |    1.704 |    75.10 |   57.712 |   570.00 |

  Qwen3.6-35B-A3B-UD-IQ4_XS
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.204 |  2508.94 |    1.313 |    97.46 |    1.517 |   421.78 |
  |  1024 |    128 |    1 |   1152 |    0.423 |  2418.64 |    1.296 |    98.80 |    1.719 |   670.18 |
  |  2048 |    128 |    1 |   2176 |    0.946 |  2164.61 |    1.323 |    96.78 |    2.269 |   959.13 |
  |  4096 |    128 |    1 |   4224 |    2.235 |  1832.54 |    1.326 |    96.52 |    3.561 |  1186.06 |
  |  8192 |    128 |    1 |   8320 |    5.845 |  1401.44 |    1.352 |    94.70 |    7.197 |  1156.03 |
  | 16384 |    128 |    1 |  16512 |   17.096 |   958.38 |    1.417 |    90.33 |   18.513 |   891.94 |
  | 32768 |    128 |    1 |  32896 |   56.013 |   585.00 |    1.530 |    83.66 |   57.543 |   571.67 |

  Carnice-Qwen3.6-MoE-35B-A3B-Q4_K_S
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.205 |  2499.78 |    1.483 |    86.31 |    1.688 |   379.16 |
  |  1024 |    128 |    1 |   1152 |    0.434 |  2361.36 |    1.448 |    88.40 |    1.882 |   612.25 |
  |  2048 |    128 |    1 |   2176 |    0.947 |  2161.87 |    1.478 |    86.62 |    2.425 |   897.27 |
  |  4096 |    128 |    1 |   4224 |    2.259 |  1813.00 |    1.472 |    86.94 |    3.732 |  1131.98 |
  |  8192 |    128 |    1 |   8320 |    5.892 |  1390.42 |    1.505 |    85.06 |    7.397 |  1124.85 |
  | 16384 |    128 |    1 |  16512 |   17.397 |   941.77 |    1.568 |    81.61 |   18.965 |   870.63 |
  | 32768 |    128 |    1 |  32896 |   56.296 |   582.07 |    1.690 |    75.74 |   57.986 |   567.31 |

  Nemotron-Cascade-2-30B-A3B-IQ4_XS
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.195 |  2622.33 |    0.972 |   131.69 |    1.167 |   548.30 |
  |  1024 |    128 |    1 |   1152 |    0.407 |  2514.76 |    0.934 |   137.10 |    1.341 |   859.16 |
  |  2048 |    128 |    1 |   2176 |    0.854 |  2396.99 |    0.942 |   135.90 |    1.796 |  1211.42 |
  |  4096 |    128 |    1 |   4224 |    1.895 |  2161.89 |    0.953 |   134.36 |    2.847 |  1483.50 |
  |  8192 |    128 |    1 |   8320 |    4.593 |  1783.70 |    0.967 |   132.43 |    5.559 |  1496.60 |
  | 16384 |    128 |    1 |  16512 |   12.213 |  1341.53 |    0.996 |   128.56 |   13.209 |  1250.10 |
  | 32768 |    128 |    1 |  32896 |   36.998 |   885.66 |    1.059 |   120.89 |   38.057 |   864.39 |

  On mbp M2 Max

  Qwen3.6-35B-A3B-UD-Q6_K_XL
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.540 |   947.31 |    2.489 |    51.42 |    3.030 |   211.22 |
  |  1024 |    128 |    1 |   1152 |    0.951 |  1077.21 |    3.237 |    39.54 |    4.188 |   275.10 |
  |  2048 |    128 |    1 |   2176 |    2.994 |   684.10 |    3.139 |    40.77 |    6.133 |   354.80 |
  |  4096 |    128 |    1 |   4224 |    6.245 |   655.85 |    3.210 |    39.88 |    9.455 |   446.75 |
  |  8192 |    128 |    1 |   8320 |   12.411 |   660.08 |    3.284 |    38.98 |   15.694 |   530.13 |
  | 16384 |    128 |    1 |  16512 |   28.321 |   578.51 |    3.584 |    35.71 |   31.905 |   517.53 |
  | 32768 |    128 |    1 |  32896 |   65.725 |   498.56 |    4.029 |    31.77 |   69.754 |   471.60 |

  Nemotron-Cascade-2-30B-A3B-Q8_0
  |    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
  |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
  |   512 |    128 |    1 |    640 |    0.528 |   969.13 |    2.036 |    62.87 |    2.564 |   249.59 |
  |  1024 |    128 |    1 |   1152 |    1.079 |   948.84 |    3.201 |    39.99 |    4.280 |   269.15 |
  |  2048 |    128 |    1 |   2176 |    3.390 |   604.10 |    2.952 |    43.36 |    6.342 |   343.11 |
  |  4096 |    128 |    1 |   4224 |    6.756 |   606.28 |    2.991 |    42.79 |    9.747 |   433.35 |
  |  8192 |    128 |    1 |   8320 |   13.647 |   600.30 |    3.061 |    41.81 |   16.708 |   497.97 |
  | 16384 |    128 |    1 |  16512 |   29.491 |   555.56 |    3.414 |    37.50 |   32.905 |   501.81 |
  | 32768 |    128 |    1 |  32896 |   65.867 |   497.49 |    3.663 |    34.95 |   69.530 |   473.12 |
Dang I saw some lowish numbers there for Spaks (and Strix). As I was eyeing a spark to get some CUDA exposure... :-O




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: