GLM-4.7-Flash

(huggingface.co)

280 points | by scrlk 5 hours ago

25 comments

dajonker 4 hours ago
Great, I've been experimenting with OpenCode and running local 30B-A3B models on llama.cpp (4 bit) on a 32 GB GPU so there's plenty of VRAM left for 128k context. So far Qwen3-coder gives the me best results. Nemotron 3 Nano is supposed to benchmark better but it doesn't really show for the kind of work I throw at it, mostly "write tests for this and that method which are not covered yet". Will give this a try once someone has quantized it in ~4 bit GGUF.
Codex is notably higher quality but also has me waiting forever. Hopefully these small models get better and better, not just at benchmarks.
[-]
- latchkey 4 hours ago
  https://huggingface.co/unsloth/GLM-4.7-GGUF
  This user has also done a bunch of good quants:
  https://huggingface.co/0xSero
  [-]
  - WanderPanda 3 hours ago
    I find it hard to trust post training quantizations. Why don't they run benchmarks to see the degradation in performance? It sketches me out because it should be the easiest thing to automatically run a suite of benchmarks
    [-]
    - Miraste 2 hours ago
      Unsloth doesn't seem to do this for every new model, but they did publish a report on their quant methods and the performance loss it causes.
      https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs
      It isn't much until you get down to very small quants.
  - dajonker 3 hours ago
    Yes I usually run Unsloth models, however you are linking to the big model now (355B-A32B), which I can't run on my consumer hardware.
    The flash model in this thread is more than 10x smaller (30B).
    [-]
    - a_e_k 3 hours ago
      When the Unsloth quant of the flash model does appear, it should show up as unsloth/... on this page:
      https://huggingface.co/models?other=base_model:quantized:zai...
      Probably as:
      https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF
      [-]
      - homarp 3 hours ago
        it'a a new architecture. Not yet implemented in llama.cpp
        issue to follow: https://github.com/ggml-org/llama.cpp/issues/18931
      - dumbmrblah 3 hours ago
        One thing to consider is that this version is a new architecture, so it’ll take time for Llama CPP to get updated. Similar to how it was with Qwen Next.
        [-]
        cristoperb 45 minutes ago
        Apparently it is the same as the DeepseekV3 architecture and already supported by llama.cpp once the new name is added. Here's the PR: https://github.com/ggml-org/llama.cpp/pull/18936
    - latchkey 3 hours ago
      There are a bunch of 4bit quants in the GGUF link and the 0xSero has some smaller stuff too. Might still be too big and you'll need to ungpu poor yourself.
      [-]
      - disiplus 3 hours ago
        yeah there is no way to run 4.7 on a 32g vram this flash is something that im also waiting to try later tonight
        [-]
        omneity 2 hours ago
        Why not? Run it with vLLM latest and enable 4bit quantization with bnb, and it will quantize the original safetensors on the fly and fit your vram.
        [-]
        disiplus 1 hour ago
        because how huge glm 4.7 is https://huggingface.co/zai-org/GLM-4.7
        [-]
        omneity 46 minutes ago
        Except this is GLM 4.7 Flash which has 32B total params, 3B active. It should fit with a decent context window of 40k or so in 20GB of ram at 4b weights quantization and you can save even more by quantizing the activations and KV cache to 8bit.
        [-]
        disiplus 28 minutes ago
        yes, but the parrent link was to the big glm 4.7 that had a bunch of ggufs, the new one at the point of posting did not, nor does it now. im waiting for unsloth guys for the 4.7 flash
- behnamoh 3 hours ago
  > Codex is notably higher quality but also has me waiting forever.
  And while it usually leads to higher quality output, sometimes it doesn't, and I'm left with a bs AI slop that would have taken Opus just a couple of minutes to generate anyway.
polyrand 2 hours ago
I've been using z.ai models through their coding plan (incredible price/performance ratio), and since GLM-4.7 I'm even more confident with the results it gives me. I use it both with regular claude-code and opencode (more opencode lately, since claude-code is obviously designed to work much better with Anthropic models).
Also notice that this is the "-Flash" version. They were previously at 4.5-Flash (they skipped 4.6-Flash). This is supposed to be equivalent to Haiku. Even on their coding plan docs, they mention this model is supposed to be used for `ANTHROPIC_DEFAULT_HAIKU_MODEL`.
[-]
- RickHull 1 hour ago
  Same, I got 12 months of subscription for $28 total (promo offer), with 5x the usage limits of the $20/month Claude Pro plan. I have only used it with claude code so far.
  [-]
  - stogot 2 minutes ago
    Do they still have that promo offer?
vessenes 4 hours ago
Looks like solid incremental improvements. The UI oneshot demos are a big improvement over 4.6. Open models continue to lag roughly a year on benchmarks; pretty exciting over the long term. As always, GLM is really big - 355B parameters with 31B active, so it’s a tough one to self-host. It’s a good candidate for a cerebras endpoint in my mind - getting sonnet 4.x (x<5) quality with ultra low latency seems appealing.
[-]
- pseudony 1 hour ago
  I hear this said, but never substantiated. Indeed, I think our big issue right now is making actual benchmarks relevant to our own workloads.
  Due to US foreign policy, I quit claude yesterday and picked up minimax m2.1 We wrote a whole design spec for a project I’ve previously written a spec for with claude (but some changes to architecture this time, adjacent, not same).
  My gut feel ? I prefer minimax m2.1 with open code to claude. Easiest boycot ever.
  (I even picked the 10usd plan, it was fine for now).
- HumanOstrich 4 hours ago
  I tried Cerebras with GLM-4.7 (not Flash) yesterday using paid API credits ($10). They have rate limits per-minute and it counts cached tokens against it so you'll get limited in the first few seconds of every minute, then you have to wait the rest of the minute. So they're "fast" at 1000 tok/sec - but not really for practical usage. You effectively get <50 tok/sec with rate limits and being penalized for cached tokens.
  They also charge full price for the same cached tokens on every request/response, so I burned through $4 for 1 relatively simple coding task - would've cost <$0.50 using GPT-5.2-Codex or any other model besides Opus and maybe Sonnet that supports caching. And it would've been much faster.
  [-]
  - Miraste 2 hours ago
    I wonder why they chose per minute? That method of rate limiting would seem to defeat their entire value proposition.
  - mlyle 2 hours ago
    The pay-per-use API sucks. If you end up on the $50/mo plan, it's better, with caveats:
    1 million tokens per minute, 24 million tokens per day. BUT: cached tokens count full, so if you have 100,000 tokens of context you can burn a minute of tokens in a few requests.
  - twalla 3 hours ago
    I hope cerebras figures out a way to be worth the premium - seeing two pages of written content output in the literal blink of an eye is magical.
- Workaccount2 3 hours ago
  Unless one of the open model labs has a breakthrough, they will always lag. Their main trick is distilling the SOTA models.
  People talk about these models like they are "catching up", they don't see that they are just trailers hooked up to a truck, pulling them along.
  [-]
  - runako 2 hours ago
    FWIW this is what Linux and the early open-source databases (e.g. PostgreSQL and MySQL) did.
    They usually lagged for large sets of users: Linux was not as advanced as Solaris, PostgreSQL lacked important features contained in Oracle. The practical effect of this is that it puts the proprietary implementation on a treadmill of improvement where there are two likely outcomes: 1) the rate of improvement slows enough to let the OSS catch up or 2) improvement continues, but smaller subsets of people need the further improvements so the OSS becomes "good enough." (This is similar to how most people now do not pay attention to CPU speeds because they got "fast enough" for most people well over a decade ago.)
    [-]
    - weslleyskah 1 hour ago
      You know, this is also the case of Proxmox vs. VMWare.
      Proxmox became good and reliable enough as an open-source alternative for server management. Especially for the Linux enthusiasts out there.
  - irthomasthomas 1 hour ago
    Deepseek 3.2 scores gold at IMO and others. Google had to use parallel reasoning to do that with gemini, and the public version still only achieves silver.
  - skrebbel 2 hours ago
    How does this work? Do they buy lots of openai credits and then hit their api billions of times and somehow try to train on the results?
    [-]
    - g-mork 2 hours ago
      dont forget the plethora of middleman chat services with liberal logging policies. i've no doubt there is a whole subindustry lurking in here
- behnamoh 3 hours ago
  > The UI oneshot demos are a big improvement over 4.6.
  This is a terrible "test" of model quality. All these models fail when your UI is out of distribution; Codex gets close but still fails.
- mckirk 4 hours ago
  Note that this is the Flash variant, which is only 31B parameters in total.
  And yet, in terms of coding performance (at least as measured by SWE-Bench Verified), it seems to be roughly on par with o3/GPT-5 mini, which would be pretty impressive if it translated to real-world usage, for something you can realistically run at home.
  [-]
- ttoinou 3 hours ago
  Sonnet was already very good a year ago, do open weights model right are as good ?
  [-]
  - jasonjmcghee 3 hours ago
    Fwiw Sonnet 4.5 is very far ahead of where sonnet was a year ago
andhuman 19 minutes ago
Gave it four of my vibe questions around general knowledge and it didn’t do great. Maybe expected with a model as small as this one. Once support in llama.cpp is out I will take it for a spin.
jcuenod 48 minutes ago
Comparison to GPT-OSS-20B (irrespective of how you feel that model actually performs) doesn't fill me with confidence. Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5, I would have hoped that their flash model would run circles around GPT-OSS-120B. I do wish they would provide an Aider result for comparison. Aider may be saturated among SotA models, but it's not at this size.
[-]
- syntaxing 39 minutes ago
  Hoping a 30-A3B runs circles around a 117-A5.1B is a bit hopeful thinking, especially when you’re testing embedded knowledge. From the numbers, I think this model excels at agent calls compared to GPT-20B. The rest are about the same in terms of performance
- unsupp0rted 38 minutes ago
  > Given GLM 4.7 seems like it could be competitive with Sonnet 4/4.5
  Not for code. The quality is so low, it's roughly on par with Sonnet 3.5
baranmelik 3 hours ago
For anyone who’s already running this locally: what’s the simplest setup right now (tooling + quant format)? If you have a working command, would love to see it.
[-]
- johndough 2 hours ago
  I've been running it with llama-server from llama.cpp (compiled for CUDA backend, but there are also prebuilt binaries and instructions for other backends in the README) using the Q4_K_M quant from ngxson on Lubuntu with an RTX 3090:
  https://github.com/ggml-org/llama.cpp/releases
  https://huggingface.co/ngxson/GLM-4.7-Flash-GGUF/blob/main/G...
  https://github.com/ggml-org/llama.cpp?tab=readme-ov-file#sup...
```
    llama-server -ngl 999 --ctx-size 32768 -m GLM-4.7-Flash-Q4_K_M.gguf
```
  You can then chat with it at http://127.0.0.1:8080 or use the OpenAI-compatible API at http://127.0.0.1:8080/v1/chat/completions
  Seems to work okay, but there usually are subtle bugs in the implementation or chat template when a new model is released, so it might be worthwhile to update both model and server in a few days.
  [-]
  - mistercheph 2 hours ago
    I think the recently introduced -fit option which is on by default means it's no longer necesary to -ngl, can also probably drop -c which is "0" by default and reads metadata from the gguf to get the model's advertised context size
- zackify 1 hour ago
  LM Studio Search for 4.7-flash and install from mlx community
- ljouhet 1 hour ago
  Something like
```
    ollama run hf.co/ngxson/GLM-4.7-Flash-GGUF:Q4_K_M
```
  It's really fast! But, for now it outputs garbage because there is no (good) template. So I'll wait for a model/template on ollama.com
- pixelmelt 3 hours ago
  I would look into running a 4 bit quant using llama cpp (or any of its wrappers)

This is their blurb about the release:

    We’ve launched GLM-4.7-Flash, a lightweight and efficient model designed as the free-tier version of GLM-4.7, delivering strong performance across coding, reasoning, and generative tasks with low latency and high throughput.

    The update brings competitive coding capabilities at its scale, offering best-in-class general abilities in writing, translation, long-form content, role play, and aesthetic outputs for high-frequency and real-time use cases.

https://docs.z.ai/release-notes/new-released

[-]

z2 2 hours ago
The two notes from this year are accidentally marked as 2025, the website posts may actually be hand-crafted.

bilsbie 4 hours ago
What’s the significance of this for someone out of the loop?
[-]
- epolanski 3 hours ago
  You can run gpt 5 mini level ai on your MacBook with 32 gb ram.
  You can get LLM as a service for cheaper.
  E.g. This model costs less than a tenth of Haiku 4.5.
arbuge 1 hour ago
Perhaps somebody more familiar with HF can explain this to me... I'm not too sure what's going on here:
https://huggingface.co/inference/models?model=zai-org%2FGLM-...
[-]
- Mattwmaster58 1 hour ago
  I assume you're talking about 50t/s? My guess is that providers are poorly managing resources.
  Slow inference is also present on z.ai, eyeballing it the 4.7 flash model was twice as slow as regular 4.7 right now.
esafak 3 hours ago
When I want fast I reach for Gemini, or Cerebras: https://www.cerebras.ai/blog/glm-4-7
GLM 4.7 is good enough to be a daily driver but it does frustrate me at times with poor instruction following.
infocollector 3 hours ago
Maybe someone here has tackled this before. I’m trying to connect Antigravity or Cursor with GLM/Qwen coding models, but haven’t had any luck so far. I can easily run Open-WebUI + LLaMA on my 5090 Ubuntu box without issues. However, when I try to point Antigravity or Cursor to those models, they don’t seem to recognize or access them. Has anyone successfully set this up?
syntaxing 2 hours ago
I find GLM models so good. Better than Qwen IMO. I wish they released a new GLM air so I can run on my framework desktop
montroser 3 hours ago
> SWE-bench Verified 59.2
This seems pretty darn good for a 30B model. That's significantly better than the full Qwen3-Coder 480B model at 55.4.
[-]
- achierius 2 hours ago
  I think most have moved past SWE-Bench Verified as a benchmark worth tracking -- it only tracks a few repos, contains only a small number of languages, and probably more importantly papers have come out showing a significant degree of memorization in current models, e.g. models knowing the filepath of the file containing the bug when prompted only with the issue description and without having access to the actual filesystem. SWE-Bench Pro seems much more promising though doesn't avoid all of the problems with the above.
  [-]
  - robbies 2 hours ago
    What do you like to use instead? I’ve used the aider leaderboard a couple times, but it didn’t really stick with me
    [-]
    - NitpickLawyer 1 hour ago
      swe-REbench is interesting. The "RE" stands for re-testing after the models were launched. They periodically gather new issues from live repos on github, and have a slider where you can see the scores for all issues in a given interval. So if you wait ~2 months you can see how the models perform on new (to them) real-world issues.
      It's still not as accurate as benchmarks on your own workflows, but it's better than the original benchmark. Or any other public benchmarks.
dfajgljsldkjag 4 hours ago
Interesting they are releasing a tiny (30B) variant, unlike the 4.5-air distill which was 106B parameters. It must be competing with gpt mini and nano models, which personally I have found to be pretty weak. But this could be perfect for local LLM use cases.
In my ime small tier models are good for simple tasks like translation and trivia answering, but are useless for anything more complex. 70B class and above is where models really start to shine.
eurekin 4 hours ago
I'm trying to run it, but getting odd errors. Has anybody managed to run it locally and can share the command?
karmakaze 4 hours ago
Not much info than being a 31B model. Here's info on GLM-4.7[0] in general.
I suppose Flash is merely a distillation of that. Filed under mildly interesting for now.
[0] https://z.ai/blog/glm-4.7
[-]
- lordofgibbons 4 hours ago
  How interesting it is depends purely on your use-case. For me this is the perfect size for running fine-tuning experiments.
- redrove 4 hours ago
  A3.9B MoE apparently
XCSme 4 hours ago
Seems to be marginally better than gpt-20b, but this is 30b?
[-]
- strangescript 4 hours ago
  I find gpt-oss 20b very benchmaxxed and as soon as a solution isn't clear it will hallucinate.
  [-]
  - blurbleblurble 4 hours ago
    Every time I've tried to actually use gpt-oss 20b it's just gotten stuck in weird feedback loops reminiscent of the time when HAL got shut down back in the year 2001. And these are very simple tests e.g. I try and get it to check today's date from the time tool to get more recent search results from the arxiv tool.
- lostmsu 4 hours ago
  It actually seems worse. gpt-20b is only 11 GB because it is prequantized in mxfp4. GLM-4.7-Flash is 62 GB. In that sense GLM is closer to and actually is slightly larger than gpt-120b which is 59 GB.
  Also, according to the gpt-oss model card 20b is 60.7 (GLM claims they got 34 for that model) and 120b is 62.7 on SWE-Bench Verified vs GLM reports 59.7
pixelmelt 3 hours ago
I'm glad they're still releasing models dispite going public
twelvechess 4 hours ago
Excited to test this out. We need a SOTA 8B model bad though!
[-]
- piyh 3 hours ago
  https://docs.mistral.ai/models/ministral-3-8b-25-12
  [-]
  - twelvechess 2 hours ago
    thanks I will try this out
- cipehr 4 hours ago
  Is essentialai/rnj-1 not the latest attempt at that?
  https://huggingface.co/EssentialAI/rnj-1
epolanski 5 hours ago
Any cloud vendor offering this model? I would like to try it.
[-]
- PhilippGille 4 hours ago
  z.ai itself, or Novita fow now, but others will follow soon probably
  https://openrouter.ai/z-ai/glm-4.7-flash/providers
  [-]
  - sdrinf 1 minute ago
    Note: I strongly recommend against using Novita -their main gig is they're quantizing the model to offer it for cheaper / supposedly at better latency; but if you ran an eval against other providers vs novita, you can spot the quality degradation. This is nowhere marked, or displayed in their offering.
    Tolerating this is very bad form from openrouter, as they default-select lowest price -meaning people who just jump into using openrouter and do not know about this fuckery get facepalm'd by perceived model quality.
  - epolanski 4 hours ago
    Interesting, it costs less than a tenth than Haiku.
    [-]
    - saratogacx 4 hours ago
      GLM itself is quite inexpensive. A year sub to their coding plan is only $29 and works with a bunch of various tools. I use it heavily as a "I don't want to spend my anthropic credits" day-to-day model (mostly using Crush)
- dvs13 4 hours ago
  https://huggingface.co/inference/models?model=zai-org%2FGLM-... :)
- latchkey 4 hours ago
  We don't have lot of GPUs available right now, but it is not crazy hard to get it running on our MI300x. Depending on your quant, you probably want a 4x.
  ssh admin.hotaisle.app
  Yes, this should be made easier to just get a VM with it pre-installed. Working on that.
  [-]
  - omneity 4 hours ago
    Unless using docker, if vllm is not provided and built against ROCm dependencies it’s going to be time consuming.
    It took me quite some time to figure the magic combination of versions and commits, and to build each dependency successfully to run on an MI325x.
    [-]
    - latchkey 3 hours ago
      Agreed, the OOB experience kind of suck.
      Here is the magic (assuming a 4x)...
      docker run -it --rm \ --pull=always \ --ipc=host \ --network=host \ --privileged \ --cap-add=CAP_SYS_ADMIN \ --device=/dev/kfd \ --device=/dev/dri \ --device=/dev/mem \ --group-add render \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -v /home/hotaisle:/mnt/data \ -v /root/.cache:/mnt/model \ rocm/vllm-dev:nightly mv /root/.cache /root/.cache.foo ln -s /mnt/model /root/.cache VLLM_ROCM_USE_AITER=1 vllm serve zai-org/GLM-4.7-FP8 \ --tensor-parallel-size 4 \ --kv-cache-dtype fp8 \ --quantization fp8 \ --enable-auto-tool-choice \ --tool-call-parser glm47 \ --reasoning-parser glm45 \ --load-format fastsafetensors \ --enable-expert-parallel \ --allowed-local-media-path / \ --speculative-config.method mtp \ --speculative-config.num_speculative_tokens 1 \ --mm-encoder-tp-mode data
- xena 4 hours ago
  The model literally came out less than a couple hours ago, it's going to take people a while in order to tool it for their inference platforms.
  [-]
  - idiliv 4 hours ago
    Sometimes model developers coordinate with inference platforms to time releases in sync.
kylehotchkiss 1 hour ago
What's the minimum hardware you need to run this at a reasonable speed?
My Mac Mini probably isn't up for the task, but in the future I might be interested in a Mac Studio just to churn at long-running data enrichment types of projects
Haris18 2 hours ago
[dead]
wotsdat 4 hours ago
[dead]