Jamesob's guide to running SOTA LLMs locally

(github.com)

171 points | by livestyle 5 hours ago

23 comments

Aurornis 3 hours ago
I play with local LLMs a lot. I've spent more on hardware than I should. I'm friends with a local group of people who have spent a lot more than I have.
The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware. You will read a lot of claims that 4-bit quantization is lossless, but those claims come from KL divergence measurements on a small corpus. Use one of these 4-bit models on long context coding tasks and the quality will be noticeably less. Even for non-coding tasks like dataset analysis, I can measure a substantial quality difference between 4-bit models, 8-bit quants, and even some times the full 16-bit source.
This article is also encouraging the use of a REAP model, which means someone has cut out some of the weights to make it smaller. The idea is to remove weights that are less useful for certain tasks, but again this is going to reduce the overall quality of the output.
The trap is that people say "I'm running GLM-5.2 locally!" and it sounds amazing when you look at the GLM-5.2 benchmarks. However they're not actually running GLM-5.2, they're running a model derived from GLM-5.2 that discards most of the bits and drops some of the experts. It does not perform the same as what you see in the benchmarks. In my experience, the divergence between a quantized/REAP model and the parent model is unnoticeable when you try it on very small tasks or chat, but becomes painful when you start trying to use it on long-horizon tasks where little errors start compounding.
Then you get into the slippery slope of thinking you're $50K deep into this project, but what you really need is just one or two more of those $12K GPUs to use the next level of quantization that might improve the quality a little more and make your investment worthwhile...
[-]
- zozbot234 1 hour ago
  > The warning I would have for everyone is to temper your expectations and read the fine print carefully. The big build in article starts off with a $40K budget and then includes 4 GPUs that are $12K each. For those doing the math, this build is going to cost more like 50-55K.
  > Local setups also often rely on quantization and techniques like REAP to fit the models on their hardware.
  This seems to ignore the very real possibility of running SOTA models at full precision on ordinary local hardware using SSD offload. Yes this will be slow and usually have very low throughput (even batched decode can only achieve so much before power and thermal limits become important, and that still leaves you with slow prefill as a major bottleneck) but that's OK if you aren't expecting a real-time response to begin with and your volumes as a single user are low enough.
  [-]
  - Aurornis 1 hour ago
    SSD streaming throughput is too slow to be usable.
    GLM-5.2 has 40B active parameters at a time. At Q4 that's 20GB. The best PCIe 5 SSDs can get 15GB/sec when everything goes well. Every expert load would take more than a second.
    If you had enough RAM and enough SSDs in parallel you might get a couple tokens per second on a good day. If you left this machine running 24 hours straight, you might be able to get 200,000 tokens generated.
    So it can be done, but only if you interact with your LLM like you're e-mailing someone back and forth and you're okay waiting until tomorrow for a response.
    You would spend $50K to buy a machine that consumes 2000W and takes all day to produce as many tokens as I could buy on OpenRouter for $0.60. You would spend $5-15 on electricity depending on where you live.
    If you have no other option but to process data locally and you must use a very large model and you aren't in a rush, this can do it. I would not recommend it unless you're desperate and operating inside of rigid constraints.
    [-]
    - CuriouslyC 59 minutes ago
      You can improve that with speculative preload. I'm sure models could be designed and tuned around efficient SSD offloading to keep throughput pretty high.
      [-]
      - searealist 10 minutes ago
        It would apply equally to GPU or RAM inference as those are also bandwidth constrained on decode, so people already try to optimize for it.
      - rsalus 26 minutes ago
        surely the supply of unified memory will rise to meet demand before this is needed
- odo1242 1 hour ago
  This is similar to my experience with (8-bit quantized, non-MOE, 26b) Qwen locally on my computer. It’s really good for small tasks, but the first time I tried to do a major task with it it straight up forgot what agent harness it was in and started using the wrong format for tool calls lol
  (If you’re curious, it was running in Pi, but somehow convinced itself it was running in Claude instead and started trying to call Claude tools that didn’t exist)
- FuckButtons 1 hour ago
  I’ve found ds4 on my mbp to be very useful, bought before ram prices became insane. It’s not writing entire applications on it’s own, it has resolved annoying networking issues on my tailnet that I had neither the time nor inclination to figure out on my own and I often find myself reaching for it for simple but annoyingly research intensive tasks that I wouldn’t have otherwise gotten to. Is it opus? No, but is it useful? absolutely and I don’t have to worry about whether or not I’m getting value out of a subscription or the api cost of using it.
- vient 40 minutes ago
  Wonder if AMD MI350P release will affect setups like this. From what I've heard, the price will be pretty similar to RTX PRO 6000 while having 50% more VRAM which is additionally an HBM3E instead of GDDR7.
- bloat 57 minutes ago
  They do say the cards were purchased when they were cheaper. They debuted at less than nine grand apparently.
- ttoinou 2 hours ago
  Well you could make a REAP with better input prompts on longer context then. It’ll improve the REAP quality
- CamperBob2 3 hours ago
  All very true. Right now, running GLM 5.2 at its full BF16 quantization level needs 1.5 TB of VRAM. You can't run this locally at a usable speed for less than $250K or so, and frankly I'd be surprised if it could be done for less than $500K.
  The best NV4FP quant for 5.2 appears to be lukealonso's at https://huggingface.co/lukealonso/GLM-5.2-NVFP4, and it is capable of good throughput (75-100 tps) without losing much reasoning performance. Allowing for overhead for the KV cache and other requirements, this quant will (barely) run in 8-way tensor-parallel mode on 8x RTX 6000 cards. Not too long ago it was possible to put an 8x machine together for less than $100K USD, but that's probably not true now, assuming you buy all-new components.
  It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers. If I hadn't already put a similar rig together, I'd be kicking myself. But getting it running well is by no means as simple as buying a bunch of RTX6K cards and calling it a day, and people need to know what they're getting into.
  Local AI is in its Altair and IMSAI days. There's no turnkey Apple II or C64 on the market yet, much less an IBM PC. Hardware, yes -- you can buy a capable box off the shelf from various vendors -- but you have to be prepared to take up a whole new hobby when it comes to getting a complete system working well.
  [-]
  - Aurornis 3 hours ago
    > It'll almost certainly be worth it, given the abusive behavior we've seen and will continue to see from the major closed-model providers.
    The proper financial comparison for GLM-5.2 would be one of the providers on OpenRouter or renting a server as needed. Compare apples to apples.
    You will almost certainly never break even compared to paying per token.
    Local LLMs at this scale are only worth it if you have extremely strict requirements that data not leave the premises.
    [-]
    - jobeirne 3 hours ago
      Or if you want to hedge against the various tail risks of third-party providers raising prices or denying you service or somehow abusing your data...
      [-]
      - Aurornis 3 hours ago
        > hedge against the various tail risks of third-party providers raising prices
        They could 10X the prices and you’d still be better off. It’s also unlikely that prices go up enough to warrant a $100K local investment to prevent paying a couple bucks per million tokens.
        > or denying you service
        I guess you’re not familiar with OpenRouter? There are many providers there. There are providers outside of OpenRouter. There will always be someone to take your business.
        > or somehow abusing your data...
        If data security is your concern then you’re better renting a server as needed still.
        If you cannot tolerate any data leaving, then local models are the only way. You pay a high premium for it!
      - incrudible 2 hours ago
        Raising prices is not a tail risk, anything a local LLM setup can do for you can be done by any cloud provider, with the same capex as yours (or less), there is no moat here, so it is highy price competitive and will remain so. If you want to speculate on hardware shortages, that is a different business altogether and you need no janky garage setup to profit.
    - CamperBob2 3 hours ago
      Also agreed, it's definitely a sucker's game to run a high-end model locally, by any objective measure.
      Still... if it's not your weights, running on your box, you're always going to be behind somebody else's 8-ball. Everybody has to decide for themselves where their priorities lie.
GTP 18 minutes ago
There also exists an in-between possibility, that is, if you get 128GB of vram (there are now multiple options in the market to get that amount with a unified memory architecture) you can run DeepSeek V4 flash at good speed via DwarfStar. I'm not going to spend money on this, but my gut feeling is that this would be the right compromise for a lot of people.
datadrivenangel 4 hours ago
"A great way to go is 2x RTX 3090s for a total of 48GB VRAM total. You can then run Qwen3.6-27B, which is an awesome model."
Just want to note that for $3k you can get an M5 macbook pro with 48gb of shared memory, and it will not be a giant box. Also, consider committing to spending that money on a cloud hosting provider, which will be at least somewhat cheaper if not significantly cheaper. It is awesome being able to run models locally though.
[-]
- satvikpendem 8 minutes ago
  To summarize a video I saw recently [0] rebutting your arguments, MacBooks can get crazy slow when running local models or even just Claude Code and Codex due to their poor implementation, to the point that the laptop itself becomes unusable.
  There are other arguments for running an ssh-able box in a closet somewhere too as with KVMs you can give an agent remote control over the machine itself such that it has vastly more capabilities than if it were controlling its own machine it's running on, as well as not needing to keep the MacBook open all the time just to have the agent finish running.
  [0] https://youtu.be/9tGrhrVKCrE
- mips_avatar 36 minutes ago
  The cool thing about the 3090s is the RAM bandwidth. Token generation is mostly bottlenecked on memory bandwidth. Dual 3090s have 1.87 TB/s memory bandwidth (0.936 TB/s each), vs the M5 Macbook pro with only 0.3 TB/s (max chip has up to 0.63 TB/s but it's a $10k machine at that config).
  This translates to qwen 27b actually working fast enough for useful work on dual 3090s and being painfully slow on Macbook Pros. Also if you're running a big model on a macbook pro the UI gets laggy and the keyboard gets hot. Much better to run dual 3090s in your basement and connect to them from your Macbook.
  [-]
  - CobaltFire 16 minutes ago
    $4.8k for 48GB Max (what the parent said). Half of your quote.
    Even a 128GB is $6.8k today. Still only 2/3 your quote.
    Bandwidth is relevant (I have both a 5090 and an M4 Max 128GB Studio, so have direct comparison right here), but quote the cost appropriately!
    [-]
    - mips_avatar 4 minutes ago
      You need the 128gb ram config to get the 614 GB/s bandwidth (which is $6999), you could skip out on upgrading the storage to save money but at that point I think most people upgrade the storage too at which point it's $8-10k + tax.
- LeBit 3 hours ago
  I’m an idiot who is unable to project itself in situations I’ve never experienced before.
  So, I always thought local LLMs were toys not worth pursuing.
  Only once have I tried something decent like Gemma 4 31B and Qwen 3.6 27B did I realize how incredibly useful they are.
  You stop fearing you are sharing sensitive information.
  You stop fearing you will run out of tokens.
  You stop fearing about the availability of the remote AI.
  Local LLMs are extremely valuable.
  [-]
  - bityard 3 hours ago
    *for many tasks
- WithinReason 2 hours ago
  I'm running Qwen3.6-27B on a single 24GB GPU at 80 tok/s, you don't even need 2 of them
  [-]
  - npodbielski 23 minutes ago
    Yeah but 4 bits very often loops needlessly. Which is not that bad because you do not pay for tokens. But you paid for hardware and you want use it for something useful. Q6 is better but then you have like 40t/s prefill. Which is really tiring. But at least it says sorry when you ask it what is wrong! I heard there is some extension for PI preventing that. I need to look into it. Otherwise I am quite happy.
- jbellis 3 hours ago
  That's a reasonable option, just be aware that you get about 1/3 as much memory bandwidth with the M5 Pro, or 2/3 with the M5 Max [now you're at $4100 for the lowest-end]. So both your prefill (flops-bound, M5 has a lot less) and decode (bw-bound) will be slower.
- Aurornis 3 hours ago
  I have an M5 MacBook Pro and I also have a separate GPU setup for running models. The difference in speed is significant. It's not just token generation speed, but time to first token (prompt processing).
  The M5 hardware is amazing for what it is, but GPUs are still so much faster.
  Running the models on the GPU box also means I can use the laptop on my lap instead of turning it into a hot plate.
  [-]
  - amelius 2 hours ago
    What is your GPU setup?
- amelius 2 hours ago
  You can also buy a Jetson Orin with 64GB of unified memory.
- boredatoms 3 hours ago
  The standalone mini/studio is better if you dont want to have a constantly hot laptop
  Get a regular laptop and use the network to access the LLM
rishabhaiover 18 minutes ago
This is a great guide. However, the economics just do not work in my favor at all. Even if I were to spend $2k, I get much more flexibility of model intelligence and choice from a provider for $20/month.
maxignol 4 minutes ago
Did not seem to find how much tokens per second he achieved with this setup ?
jacobgold 1 hour ago
> "~$40k At this price level, you get the next step up in model intelligence. Something pretty close to Claude Opus."
That is equivalent to 16.8 years of Claude Opus 4.8 or Codex GPT 5.5 at $200/mo.
I'm a huge fan of running local models, but they're still wildly expensive, lower quality, and possibly dangerous (if backdoored). I sincerely wish this wasn't the case.
[-]
- simonw 1 hour ago
  That $200/month is already more like $4,000/month if you have to pay full API pricing - "enterprise" companies for example. That drops the equivalent to 10 months.
  (I'd be surprised if that local rig really can drive the equivalent of $4,000/month of API spend though, given that a local rig can run prompts in parallel a lot less effectively than Anthropic's many data centers.)
- verdverm 1 hour ago
  You can use a lot more tokens on hardware than you can spend on a $200/m plan.
  Inwrnt through 1B tokens my first month with an OEM spark. That's more than $1k of opus. Not a fair comparison, because token patterns are different, but since that time I have also seen a 2-3x improvement in then speeds.from improvements in vllm (mainly MTP). DiffusionGemma is around 4x regular gemma.
- echelon 1 hour ago
  Stop trying to run them locally, folks.
  You don't own your fiber connection. So why try to own another rapidly depreciating, expensive, and annoying asset?
  Rent cloud GPUs!
  You get to participate in the ownership, data control, price control, and hacking culture without having to Frankenstein some hobbyist box that costs a ton, is distilled down to functional uselessness, and is a PITA to maintain.
  [-]
  - satvikpendem 5 minutes ago
    If I'm gonna rent cloud GPUs I might as well just use a subsidized cloud agent like Claude or Codex. As for depreciation, that is true, but the bet is that models get better for a certain parameter count faster than your hardware becomes obsolete, such as Gemma models for example at the same 30 billion parameter count being much better than some years ago.
3eb7988a1663 1 hour ago
Related - what is the best isolation system available? Do I have to go full, fat VMs or can I get by with a Firecracker-like thing?
Seemingly every available option has some subtle-gotchas about how easy it is to blow off your foot and effectively have no security at all. I use VMs because I actually trust that security is a foundational principle of the technology, not a well-if-you-use-these-20-flags-and-squint kind of deal.
[-]
- Catloafdev 58 minutes ago
  It depends - for what? If your security model is sandboxing an agent to ensure they don't nuke your PC, then there are a lot of options, you can use something like bubblewrap[1] or a microVM like libkrun[2] if your goal is light-weight, up to full Docker if you want the tooling that comes with that.
  [1] https://github.com/containers/bubblewrap
  [2] https://github.com/libkrun/libkrun
- ZiiS 1 hour ago
  Full fat VMs with GPU passthough I trust a lot less then CPU ones.
  [-]
  - elsombrero 1 hour ago
    from my understanding, you can run the inference server (llama.cpp/vllm/whatever) and the agent/harness in different contexts, event different machines.
    The risky part is in the agent/harness and what tools it has access to.
    You don't need to give GPU passthrough to the VM running the agent/harness.
    There is still a risk of a prompt messing with the inference server, but I think that's a much lower risk compared to an agent doing whatever on its own.
    [-]
    - dofm 50 minutes ago
      Right. All my experiments are naïve, I am sure, but I run the LLM on the host and expose it via OpenAI API to the VMs.
      This approach requires that you trust the llama.cpp codebase, essentially. It might be reasonable not to.
      I suppose in principle there is the risk of a prompt exploit corrupting the inference server.
kgeist 3 hours ago
>$40k gets you almost-Opus
GLM 5.2 is "almost Opus," and it needs at least 8xH200s for comfortable inference (so it's closer to $400k than $40k).
They suggest using this modified model:
>A REAP-pruned (≈22% of experts removed), Int8-mix NVFP4 quantized version of GLM-5.2, ≈594B parameters.
I wonder how it behaves in practice outside of benchmarks. Qwen3.6, even at 6-bit quantization, often gets stuck in loops while reasoning. And here they've also removed some experts. I mean, sometimes an 8-bit or 16-bit small model can be smarter than a lobotomized large model. I heard the consensus is you shouldn't go below 8 bit for coding.
Also, it's not clear what is left of the available context when you try to fit a lobotomized model into 4 RTX 6000s. Anything below 100k is barely usable because it often hits compaction before it's able to gather the necessary context P.S. found in the repos, 240k context
[-]
- amelius 3 hours ago
  How does this work with scaling?
  I assume you can then somehow run several hundreds of prompts concurrently?
- CamperBob2 3 hours ago
  You can get 1M context with the lukealonso NVFP4 quant on 8x RTX6000s, which remains coherent and useful through at least 400k. No real need to run 8x H200s unless you just want to. Or unless you need to serve many concurrent users or agents on a regular basis.
turova 3 hours ago
For qwen3.6-27b you can also run the q4 variant with full ~250K context on one 3090. It's fast enough to not be frustrating so the speed gains with 2x 3090s wouldn't be worth it to me. Running a q6 on 2x 3090s at half the speed with a smaller context is an option, but you're really not going to compete with SOTA models there anyway so unless you already have 2x 3090s, I would say 1 is the best investment given current prices. It's good enough to do a lot, especially with a well-configured harness.
[-]
- nabakin 14 minutes ago
  Are you running qwen3.6-27b on one 3090 with your KV cache at q4? Ime there is significant long-context recall accuracy degradation at that precision. I prefer putting the KV cache at q8 and working with the 120k context
- hypfer 2 hours ago
  That math (250k context, Q4 model, 24GB VRAM) only checks out at q4 quant for the K/V cache, which is probably not the best idea.
QuantumNoodle 8 minutes ago
$2k or $40k? One of those is not "self host."
chompychop 2 hours ago
Is Whisper still considered SOTA for STT? Since it came out years ago, I'd have assumed there are better models by now.
[-]
- randomblock1 2 hours ago
  No, there are quite a few models which are smaller, more accurate, and faster. For example Parakeet TDT v3 is half the size, way faster, and lower WER. There's also Voxstral, which is much larger but also even more accurate.
  But the ecosystem isn't as mature, so Whisper is still a valid option, even now. For example Parakeet uses Nemotron framework (made by Nvdia), normally you need CUDA, so you need to use an ONNX version instead on AMD. Meanwhile Whisper has VLLM and desktop apps like Buzz.
  There aren't many benchmarks and they often don't have all the models, since STT doesn't get nearly enough attention as normal LLMs, but this is one of the more complete ones: https://artificialanalysis.ai/speech-to-text/non-streaming
- simonw 59 minutes ago
  I'm a big fan of Parakeet v3 - I run it using the MacWhisper app, it's a 494MB model and the quality is excellent.
- venusenvy47 2 hours ago
  I don't have anything to compare against, since I have just started using it. But I was fairly happy with it on my personal recordings from my phone. Also, I ran it on my CPU (Core i7) and it was perfectly usable, as something to run when not using the machine for anything else.
beardsciences 4 hours ago
I am somewhere in the middle, where I want something with more than 48GB/$2k of VRAM, but less than 384GB/$40k.
I'm curious if GMKtec's EVO-X2, with ~96GB of usable VRAM, is still a good solution for something like this for $3,399.
[-]
- sampullman 4 hours ago
  I picked up the 128gb version when it was $2,199 and it runs Qwen 3.6 reasonably well with a 128kb context. Not very useful for complex tasks but it can handle some web stuff.
- mft_ 3 hours ago
  It has lower memory bandwidth than most comparable Macs.
- verdverm 53 minutes ago
  I've been happy with an OEM Spark (128G), enough so that I picked up a second one. Have 2x qwen and 1x gemma (both at 8bit and full context), plus embedding, Re-Ranker, and a 1.7B for little things. Running 6x models, probably going to add STT here soon, want to try talking more than typing.
  The caveat is that if you try to use multiple models on the same device at the same time, you thrash and destroy tok/s
c4pt0r 56 minutes ago
Local open weight models will definitely be a future trend. Imagine if an Opus-level model could run locally: many more latent use cases would likely emerge, since Opus is priced so high. Perhaps the future will be a multi-model architecture, where frontier models handle planning and local models carry out the concrete execution.
zackify 3 hours ago
You can get amazing local STT using parakeet which can use as little as 600mb of vram. Better or as good as whisper v3 large
[-]
- subhobroto 2 hours ago
  [flagged]
maxxxml 38 minutes ago
What harness is the best for local LLMs? I've been researching optimizing local LLM agent harness performance with context/ tools. Quite the endeavor and would love to learn what users prefer for this type of workflow.
[-]
- npodbielski 32 minutes ago
  I like vibe and pi. Vibe just looks nice and is good enough. But pi extensibility is just another level. There is also Dirac that is quite OK but seems like full of bugs. Zerostack is the simplest harness I saw. OpenCode is OK too. Rest I did not try.
wxw 3 hours ago
I agree that local LLMs are the likely future and worth investing in… but at $40k for possible-SOTA right now, this isn’t worth it for the average consumer.
I’m pretty bullish that Apple will deliver something very competitive for the average consumer in the next couple years.
bcjdjsndon 58 minutes ago
If you can run sota on a 40k setup, why do openai etc spend maybe 100x that?
[-]
- dwroberts 57 minutes ago
  Obvious one: Because they are serving it to millions of people at the same time, not just one local user
Avicebron 59 minutes ago
Does anyone know any good data center to home conversion kits for gear?
bobkb 1 hour ago
Very useful. The whisper setup is something similar to what we have been using. The LLM setup though is outstanding.
api 3 hours ago
Apple M series chips deserve a mention as another option, especially since you get a whole Mac laptop or desktop workstation too.
They have unified memory and respectable inference performance, and for some variations can be cheaper than video cards, especially if you get an older-gen high-end M series with a lot of RAM used or refurbished.
I've read that Apple has plans once the RAM bottleneck passes to offer more RAM in all their models, and that future M series GPUs and NPUs will be even better for local inference, so in the future I expect Apple to be a serious offering for local inference and AI research workstations.
And what about AMD and Intel Arc GPUs? They don't get as much love but I've heard they can be compelling for certain shapes of a local LLM configuration.
At this point though, I think we may be in a "renters market" for LLM compute. If you want privacy it might be better to rent GPU time in raw form or use spot pricing at various providers. It probably only makes sense to build if you have extreme privacy/security needs or just want to do it cause it's cool.
[-]
- mwcampbell 2 hours ago
  > once the RAM bottleneck passes
  Do we have evidence that this will actually happen? Maybe the belief that it won't pass is what requires evidence, but I think there's a widespread feeling right now that things are just getting permanently worse and this is one example.
- maxxxml 38 minutes ago
  MLX is super underrated right now, tons of performance unlocked as of recent. Love to see it!
whalesalad 1 hour ago
why in gods name is a RTX PRO 6000 $13,000? supply and command?
xela79 3 hours ago
did he call Qwen a SOTA model?
[-]
- mft_ 3 hours ago
  No, he’s running GLM 5.2, which is closer to SOTA.
- verdverm 51 minutes ago
  It can be considered SOTA within is size category. Very useful for many things. You still want access to big models, I recommend OpenCode Go if you want to stay with open models.
maxothex 4 hours ago
[flagged]