Mistral Small 4

(mistral.ai)

114 points | by pember 23 hours ago

6 comments

revolvingthrow 14 hours ago
I really wish the benchmarks were even slightly trustworthy for AI models. ~120B are the largest models I can run locally. Naturally I grabbed the 122B Qwen3.5, which had great benchmarks and… frankly, the model is garbage, worse than glm air 4.5 IMO. But then, qwen famously benchmaxxes.
And here we have another release. The benchmarks are just a tiny bit worse than qwen3.5 (for far less tokens). Am I to take it that the model is worse? Or does qwen’s benchmaxxing mean that slightly worse result of non-qwen models means a better model? I’d rather not spend hours testing things myself for every noteworthy release.
Ah well. Mistral has been fairly decent so worth taking a look. Obviously they’re behind the big 3, but in my experience their small models are probably the best you can get for several months after each release. I’m not sure how it works as a sales funnel for their paid models, same as with chinese models - people likely just go for google/openai/anthropic in this case - but I’m thankful for their existence.
[-]
- weitendorf 4 hours ago
  So far it's better for equivalent Qwen 3.5 workloads, and much less expensive. As you mention Qwen spends way too much time/tokens reasoning, so it ends up being more expensive than you'd think based on its model card (also IME, flaky).
  I actually think this model is a Big Deal because there's a whole world out there of people building on top of Qwen and other Chinese models, and now Mistral has just released one of the best generalist FOSS models in its price/size range at an excellent price ($0.60/1M output is a steal). Mistral could potentially grab a lot of that.
  Personally I am going to build off of it and invest in their ecosystem now, with this model, because it's definitely worth paying for at the current price. Whether Mistral or some other venture comes out with the next big thing in that category is anybody's guess, but I hope now that labs are starting to converge on more rapid release cycles, I'm hoping Mistral won't be far behind.
  The main thing for me though is that for small model use cases, it just doesn't make sense to pay a ton for Haiku/Gemini and other expensive small models that you can't self-host or finetune or generally build upon. They cost too much and can't be tinkered with. Also, the range in which you'd want the incrementally better performance of something like Haiku over Mistral, but not enough to think about the benefits of tuning or self-hosting inference, are few for me. But at the same time, if you're going to invest in building on top of someone else's product, you want them to be trustworthy and long-term partners.
  I'm excited to give them a shot
kristianp 16 hours ago
Interesting that they target around 120 billion parameters. Just enough to fit onto a single H100 with 4 bit quant. Or 128GB APU like apple silicon, AMD AI cpus or the GB spark.
Copying GPT-OSS-120b?
Available to try at https://build.nvidia.com/mistralai/mistral-small-4-119b-2603
[-]
- rurban 9 hours ago
  Hopefully better than gpt-oss-120b because this sucks big time. Completely unusable. gpt-5.3 and 4 are very fine though.
  Testing it tomorrow
zacksiri 15 hours ago
I tested the model in an agentic workflow. Here is the report:
https://upmaru.com/llm-tests/simple-tama-agentic-workflow-q1...
[-]
- Reubend 13 hours ago
  Seems like it does quite well on that particular benchmark?
  [-]
  - zacksiri 12 hours ago
    It's ok, it's not the best. There are models that do better, I'd use it for some basic tasks but not actual complex tasks like query generation and retrieval.
2001zhaozhao 21 hours ago
Which Haiku model are they comparing to? Is it 4.5? In which case it's absolutely wild that Qwen3.5 122B is shredding it in those graphs
adt 21 hours ago
https://lifearchitect.ai/models-table/
7777777phil 9 hours ago
Been spending a bunch of time lately trying to figure out why these ~120B MoE models keep beating much larger dense ones.
With Mistral it's 128 experts but only 4 active per token, so any given forward pass is like 6B params. That's a very different kind of model than scaling a dense transformer bigger. Also wrote a little post on where I think this is going: https://philippdubach.com/posts/the-last-architecture-design...