Leanstral 1.5: Proof abundance for all

(mistral.ai)

125 points | by programLyrique 5 hours ago

13 comments

boulos 4 hours ago
This is nice work, but I found the bug finding example to be weird:
> One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
In what way would this boundary condition case be considered something that "testing [...] would typically miss"? It's certainly something that bad tests would miss or not think about, but I find that (a) careful people and (b) ML coding systems are actually really good at "oh, I should test the extreme values". Especially for things that parse user input.
I'm curious if they found other bugs that were more interesting, but found them too hard to explain quickly.
[-]
- Groxx 2 hours ago
  particularly "and fuzzing", yea. fuzzing generally does intentionally explore boundary values, from what I've seen. for an encoding library like this, I think it's fair to say that fuzzing is a baseline expectation for any decent code, and it almost certainly would've caught this in seconds.
  --- edit
  concretely, I made a very simple round-trip test with proptest, and got dozens of failures and this in less than a second:
```
    thread 'signed_round_trip' (50528) panicked at tests/test.rs:72:1:
    Test failed: attempt to multiply with overflow.
    minimal failing input: value = 4611686018427387904
        successes: 2
        local rejects: 0
        global rejects: 0
```
- fjdjshsh 4 hours ago
  Maybe it's not something they would "typically miss", but, from proof by existence, it's something they sometimes miss.
  It does speak to the benefits of using lean in that you don't need to be clever about the different examples you test.
- Exoristos 3 hours ago
  Yes, it's basic QA. If tests missed this kind of thing, they would be of much more limited use than we generally expect them to be. It raises questions about the authors' background.
- Quekid5 1 hour ago
  Because this is garbage PR. That's it.
  Every property-based testing system (invented ca. 1980) will explore boundary values. The semantics (or lack thereof) of C and C++ can make this difficult to actually test for because the compiler is allowed to say "test passed" to any input leading to UB.
  [-]
  - vlovich123 2 minutes ago
    Property based testing is good at generating boundary values for inputs. But for any more complicated piece of code getting boundary value coverage of interior values is an open problem that requires instrumentation feedback to understand branch coverage and value coverage of the code that got tested. It’s not an easy thing at all.
- pierrefermat1 4 hours ago
  [flagged]
  [-]
  - fjdjshsh 4 hours ago
    I feel this is a lazy, straw-manning comment.
andai 2 hours ago
Halfway thru the article it shows a comparison with several frontier-ish LLMs. But they're all from half a year ago. "Our new model is better than all these Chinese models from 3 generations ago" is pretty funny to me.
[-]
- dannyw 3 minutes ago
  It’s a 6bn model. Totally different class. I’m more excited about “frontier small language models” tbh.
- rtaylorgarlock 2 hours ago
  Agreed, though open weights + relatively small is still headline worthy. This thing really cooks.
Groxx 3 hours ago
>One such bug was in the sign function for zigzag decoding of the datrs/varinteger library. On input Std.U64.MAX, the expression (value + 1) overflowed, causing crashes in debug mode and silent corruption in release mode—an edge case that testing and fuzzing would typically miss.
that library is: https://github.com/datrs/varinteger
it seems probably correct, as there's an identical issue filed on that repo a week before this was published: https://github.com/datrs/varinteger/issues/8 (is this a leanstral employee? they have almost no info and only very sparse activity. or did leanstral perhaps just pick up this issue?)
it's a tiny, surprisingly-poorly tested, long-untouched (8y) library: https://github.com/datrs/varinteger/blob/master/tests/test.r... that has about 1k downloads per day: https://crates.io/crates/varinteger [1] which seems rather low.
I don't think I'd consider that such a smashing success that it's worth bringing up as the sole example tbh. though automated detection is certainly useful. or is this a noteworthy accomplishment for this sub-field? I haven't played with proof-writing LLMs, but given the paucity of training data I wouldn't be surprised if they're a bit rough compared to general coding.
1: https://crates.io/crates/varinteger lists it as https://github.com/mafintosh/varinteger-rs which redirects to https://github.com/datrs/varinteger , so despite looking different at a glance it does appear to be the same library
[-]
- frostlynx 2 hours ago
  The problem with proof is that it’s a bit hard sometimes to convey the value. The point is not to find bugs, but to prove that there are none (of a certain class; under certain assumptions; etc). But it’s a hard story to sell, so often the marketing is around “look at this bug we found”.
  [-]
  - Eridrus 2 hours ago
    I would be much more interested in "here is a provably memory safe version of openssl with all its memory safety bugs fixed"
RossBencina 2 hours ago
Curious that they are pitching Lean 4 for formal verification. I thought that this was more the domain of Isabelle/HOL and TLA+. At least I would have expected a model trained at using all three. Maybe also Isabell/Isar, which seems preferable for forward derivations in linear algebra. Could anyone shed some light on this?
[-]
- nextos 2 hours ago
  It is true that Lean has seen relatively little adoption in software verification compared to e.g. Isabelle and Rocq (previously Coq). Even Agda has had more traction in that domain.
  However, Lean is currently gaining significant momentum as an alternative, particularly due to its capabilities as a general-purpose functional programming language.
  Personally, I think something based on Hoare or separation logic would be more practical as it'd be easier to align requirements with specifications. I like Dafny and F*.
andai 2 hours ago
Discussed the other day:
https://news.ycombinator.com/item?id=48738938
henryrobbins00 3 hours ago
Try out Leanstral 1.5 on the latest version of OpenATP! OpenATP is an open-source Python package and CLI for agentic automated theorem provers. It natively supports running provers locally in Docker or remotely in Modal sandboxes.
GitHub: https://github.com/henryrobbins/open-atp
Docs: https://open-atp.henryrobbins.com
[-]
- mathieudombrock 2 hours ago
  This is an ad.
  [-]
  - henryrobbins00 2 hours ago
    Earnest question: any recommendation to not come off this way in forums?
    I created this tool for my own research and have found it really helpful to benchmark different automated theorem provers (my experience so far has been that Claude Code + Codex still out-perform Leanstral). My genuine aim is to share that usefulness with others, not self promote!
    [-]
    - dannyw 1 minute ago
      Your comment is helpful, useful, and relevant. Please keep going.
      Just my 2c, but maybe your original post could’ve been written in a less promotional / less excited way. It looked a little like spam at _first_ glance.
    - lioeters 1 hour ago
      You know, I was going to reply to the person who called out your comment as an "ad", because I visited the site and it's actually relevant to the topic under discussion (Leanstral). And not even selling anything.
      My thought was: Good job, this is tasteful personable marketing for a product with genuine value. I wish more marketing were done this way. So I think it's totally fine to be talking about the cool thing you're working on. I for one found it interesting and added to the discussion.
    - zuzululu 1 hour ago
      I don't know why but sometime ago, HN started resembling reddit, and there seems to be just widespread fear/jealousy/cynicism towards anybody advertising their work or services even, I don't think there was anything wrong with your post, it was informative.
      Probably the most annoying part about Reddit and HN and X (although it let you mute people) is the abundance of "expert" opinions from people who aren't experts at all. You just end up with a bunch of false signals that you shouldn't even be listening to.
      all in all I say invest in spreading the words via other channels, maybe even X is better and even the right time zone (besides US working hours, I find European time the worst statistically for sharing your work).
satvikpendem 4 hours ago
I also submitted the HuggingFace link itself here: https://news.ycombinator.com/item?id=48779902
strongly-typed 3 hours ago
Lean is such a wonderful language. So hyped by these releases.
ChrisArchitect 2 hours ago
[dupe] https://news.ycombinator.com/item?id=48738938
[-]
- rtaylorgarlock 1 hour ago
  Have you ever been downvoted for calling 'dupe?' I once was downvoted after calling dupe on a link posted thrice. HN is an interesting place to hang out, that's for sure.
zuzululu 2 hours ago
I applaud mistral's efforts but reading this release made me realize that Europe is far far behind and that once the gap is solidified I don't think its recoverable in the same way Canada's brain drain had on its economy
The best and the brightest from Europe have no incentive to build in Europe when they can do it in America and be compensated and treated far better
nullc 4 hours ago
It would be nice if special purpose models provided a some diverse examples of exactly the input required to get its expected performance on a mix of problem types. Maybe also a document intended for LLMs to read that advises on prompt construction.
I've found that you can get wildly different quality results from these sorts of models due to seemingly insignificant differences in prompt construction. It would be much easier to guess at what it wants if I could just see some RL transcripts -- and so the model author is in a much better position to provide initial advice.
moonset 3 hours ago
I gave Codex with GPT-5.5 High this prompt:
```
    Identify bugs in [datrs/varinteger](https://github.com/datrs/varinteger) . Do NOT look at the GitHub issues, just inspect the source
```
It also found the bug that Leanstral 1.5 found and the authors highlighted. I think this bug wasn't especially tricky; it's just a case of too few eyeballs on this repo.
Congrats on the release regardless! Excited for the direction Lean + automated AI proofs are headed.
Disclosure: I work at OpenAI.
[-]
- adev_ 2 hours ago
  > It also found the bug that Leanstral 1.5 found and the authors highlighted
  This is a little bit like someone pointing the moon and you look at the finger.
  The formal proof domain goes way beyond just finding bugs.
  It has tons of usages in term of functional safety, protocol validation, cryptography, etc...
  The fact Mistral tackle this kind of problem is both smart and not so surprising.
  Smart because it is niche enough that they do not front face the big competitors (yet).
  No so surprising because the French labs have a well known and long time expertise with formal proof tools (Coq and all its Ocaml associated tools). It has been historically mainly pushed by the aerospace and train industries (Airbus, Dassault, Alsthom).
- noperator 2 hours ago
  Leanstral 1.5 has 6B active parameters. How many parameters does GPT-5.5 have?
  [-]
  - lioeters 1 hour ago
    We can run Leanstral locally on commodity/consumer hardware. Nuff said.
- ilc 2 hours ago
  Given that they directly compare to GPT-5.5 in their documentation. This comes off as puppy kicking to me. They state it is not SOTA, even IN its domain!
  Honestly: Think twice before dragging your firm into what you say.
  Disclaimer: I speak for myself. Not any firm I am associated with.
- 8note 2 hours ago
  the mechanism is whats interesting, rather than whether it could do it.
  this sounds like a great tool to add to the toolbelt, as part of the "how do we handle all the code output from LLMs" problem
yashthakker 2 hours ago
[flagged]