Big GPUs don't need big PCs

(jeffgeerling.com)

62 points | by mikece 2 hours ago

9 comments

numpad0 5 minutes ago
[delayed]
yjftsjthsd-h 1 hour ago
I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.
[-]
- binsquare 6 minutes ago
  I run a crowd sourced website to collect data on the best and cheapest hardware setup for local LLM here: https://inferbench.com/
  Source code: https://github.com/BinSquare/inferbench
- tcdent 41 minutes ago
  We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.
  So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.
  [-]
  - skhameneh 18 minutes ago
    There is plenty that can run within 32/64/96gb VRAM. IMO models like Phi-4 are underrated for many simple tasks. Some quantized Gemma 3 are quite good as well.
    There are larger/better models as well, but those tend to really push the limits of 96gb.
    FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.
- seanmcdirmid 23 minutes ago
  And you don’t want to go the M4 Max/M3 Ultra route? It works well enough for most mid sized LLMs.
- zeusk 55 minutes ago
  Get the DGX Spark computers? They’re exactly what you’re trying to build.
- dist-epoch 35 minutes ago
  This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.
  > Asus made a crypto-mining motherboard that supports up to 20 GPUs
  https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...
  For LLMs you'll probably want a different setup, with some memory too, some m.2 storage.
  [-]
  - zozbot234 2 minutes ago
    M.2 is mostly just a different form factor for PCIe anyway.
  - jsheard 30 minutes ago
    Those only gave each GPU a single PCIe lane though, since crypto mining barely needed to move any data around. If your application doesn't fit that mould then you'll need a much, much more expensive platform.
    [-]
    - dist-epoch 28 minutes ago
      After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic.
      [-]
      - numpad0 2 minutes ago
        [delayed]
  - skhameneh 27 minutes ago
    In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.
    Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs are typically using a higher quantity of smaller GPUs for cost effectiveness.
3eb7988a1663 1 hour ago
Datapoints like this really make me reconsider my daily driver. I should be running one of those $300 mini PCs at <20W. With ~flat CPU performance gains, would be fine for the next 10 years. Just remote into my beefy workstation when I actually need to do real work. Browsing the web, watching videos, even playing some games is easily within their wheelhouse.
[-]
- ekropotin 1 hour ago
  As experiment, I decided to try using proxmox VM with eGPU and usb bus bypassed to it, as my main PC for browsing and working on hobby projects.
  It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.
- reactordev 14 minutes ago
  I went with a beelink for this purpose. Works great.
  Keeps the desk nice and tidy while “the beasts” roar in a soundproofed closet.
jonahbenton 2 hours ago
So glad someone did this. Have been running big gpus on egpus connected to spare laptops and thinking why not pis.
Wowfunhappy 1 hour ago
I really would have liked to see gaming performance, although I realize it might be difficult to find a AAA game that supports ARM. (Forcing the Pi to emulate x86 with FEX doesn't seem entirely fair.)
[-]
- 3eb7988a1663 1 hour ago
  You might have to thread the needle to find a game which does not bottleneck on the CPU.
kristjansson 47 minutes ago
Really why have the PCI/CPU artifice at all? Apple and Nvidia have the right idea: put the MPP on the same die/package as the CPU.
[-]
- bigyabai 25 minutes ago
  > put the MPP on the same die/package as the CPU.
  That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.
lostmsu 27 minutes ago
Now compare batched training performance. Or batched inference.
Of course prefill is going to be GPU bound. You only send a few thousand bytes to it, and don't really ask to return much. But after prefill is done, unless you use batched mode, you aren't really using your GPU for anything more that it's VRAM bandwidth.