홈으로 돌아가기
Hacker News

(GPU가 없는 26B-A4B MTP 드래프터의 경우) 10년 된 제온 프로세서 하나면 충분합니다.

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

529 points 234 comments cafkafk 2026-06-01 15:38

댓글

7
cafkafk 2026-06-01 15:42
ENGLISH (원문)
Hi HN. I wrote this post after getting frustrated by the lack of ways to run the new Gemma 4 Drafter models, and mainstream tools not prioritizing this, and hiding all the performance levers. I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow. I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details. I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
fragmede 2026-06-01 16:07
ENGLISH (원문)
(purple on black is really hard to read) You say it runs "at reading speed". Have you benchmarked it?
Eonexus 2026-06-01 16:18
ENGLISH (원문)
I wonder what the tokens per second actually are. Yes, it does say "reading speed" but that varies for everyone, no?
potus_kushner 2026-06-01 16:20
ENGLISH (원문)
@cafkafk got a recommendation for a good model that fits into 64GB and leaves a couple GB free for other tasks ?
cafkafk 2026-06-01 16:35
ENGLISH (원문)
Honestly, at this point you're probably looking at a smaller model, for the Gemma series I'd go with Gemma 4 E4B with drafters, but that's just a hunch from using it on my laptop (where I do have a RTX 4060 M and 96gb ram). So you'd change the invocation slightly here, but a lot of things you can potentially reuse. That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.
cafkafk 2026-06-01 16:37
ENGLISH (원문)
That is a very fair point! I just ran a not very scientific benchmark with the system under load, and posted the raw logs in a sibling comment above, but the short answer is that it's hitting 11.94 tokens per second for generation - while it's also being a binary cache and CI build server. Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.
christkv 2026-06-01 17:04
ENGLISH (원문)
Makes you wonder if its possible to squeeze more tps out of a strix halo system using the 16 zen5 cores as well as the gpu.

좋아요가 저장됐어요!

로그인하면 어디서나 확인하고
영구적으로 저장할 수 있어요.