What would have been more interesting is if LLMs were tested with questions where the direct solutions are not publicly available (so not in training data). In that case I wonder how much of hallucinations would happen or if it tries to connect dots with what’s available publicly and come up with a direct solution
I don't understand why you expect that an answer known to the researcher but which has never been published should be in the training data. You possibly missunderstand what these problems look like -- we made them all publicly available on the website, so please have a look: https://math.sciencebench.ai/benchmarks/benchmarks-in-leipzi...
No, they only provided large-scale model runs for us (this is explained in the ackonowledgements). These runs would have been too expensive to perform myself, so I am happy they offered to provide them.
Thanks for answering this random internet guy's question. It's a bit sad that a german math prof doesn't have sufficient funds to run a few prompts. I would have paid for them for this amount of advertising. I don't like that you gave them to a silicon valley company.
On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.
Please indicate which other models you would like to see included. (And I agree that the context window limitations were not reasonable to have.) Finally: running this few prompts would have been $10-20k if I would have run them myself via the API. (And the company didn't asked to contribute, but I asked whether they would be willing to do so, just saying.)
I don't like that you've called these problems "research-level", or your description that they are something you might give to a second-year PhD student. Some examples:
- Question 093 is a word problem of the kind that I would imagine is commonly given to high school students. Maybe it is slightly more difficult, but it doesn't appear to have any mathematical relevance and nobody would ever give it to a second-year PhD student.
- Question 096 is something I would expect a computer to do easily by brute force, and has essentially no mathematical content other than doing a calculation. (Under what circumstance does one care about taking base 10 digits and interpreting them in base 11?). Again, nobody would ever assign this to a math PhD student, and I expect that any undergrad who knows how to code can give you this answer.
- Question 016 is the kind of combinatorial problem that one could expect to brute force with a computer (and some decently-written code) even before AI. Again nobody would give it to a 2nd year PhD student because it is too random and of no academic interest.
- There are questions like 026 and 014, about computing Hilbert series. Computing Hilbert series is a standard computer algebra task that nobody would want to do by hand before generative AI, and certainly not now.
Similar comments apply to many others. There are plenty of random-looking computational questions of exactly the type that one expects not only that computers cans solve, but should be used to solve, because nobody would ever do it by hand. None of them are research-level --- certainly not anything that would be considered publishable (before generative AI or after) --- despite the subtitle of the paper saying "research-level". And if you give them to a 2nd year PhD student I would imagine you would just be wasting their time.
I also don't like your phrasing "much harder than any exam question in any exam". If I ask you to multiply two 1000 digit numbers, the question is "much harder" than any question that will ever appear on any exam. Everyone understands the computer will do it instantly, and it doesn't demonstrate anything relevant. There is a clear regime in which one expects AI-type methods to perform better (combinatorial, calculation-based questions which can be answered using standard methods), and other regimes where one expects worse performance (e.g., proofs of statements that use abstract concepts). Why is there nothing here of the second type?
I cannot keep answering everyone's comments of the type "Why did you consider / not consider?" or "Here are much better ideas". I promise you that we have thought quite a bit about the setup and have discussed it with many math researchers.
1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?
2. The questions are of theoretical nature, even if a little calculation is involved. This does not mean that the problems are not solvable using a computer program, but it means that they are not solvable with reasonalble effort with a computer program.
3. And we do not ask for proofs because other projects already do that (IMProofBench, please have a look) and we cannot grade LLM answers as a human would need to understand the provided proof -- and this is not what I or we or actually most researchers are interested in doing.
> 1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?
The objection is to phrasing "much harder". One should distinguish between something that is difficult for reasons stemming from a lack of computational power and something that is difficult for reasons stemming from a lack of relevant abstractions or the ability to grapple with them. If the reason that a particular problem is "hard" for a PhD student is that they have to do a long calculation, but not because of a lack of conceptual understanding, then it doesn't say much about the capabilities of generative AI if the computer solves it.
Hence the example: multiplying two large numbers is hard for the former reason, not the latter. Your example of factoring a 4096-bit semiprime is hard for both reasons (because the brute force method is too slow).
Well, you are correct that one should distinguish the two. But we give no indication that the questions are hard because of computational tasks and we give many indications that the problems are of theorecical nature and hard for theoretical reasons. There is not a single question where a PhD student would need to do a long calculation.
I trust the judgement of respected researchers submitting the questions, I personally know them, and they publish research under their full names (and whose names are fully disclosed in the paper). And you also should trust them.
Please consider disclosing your name and your field of expertise, pick a question in your own research area and explain to me why this question is not research-level. And, best of all, solve it yourself to clarify why it was too easy.
Haha, the classic “Why didn’t you do X?” comments always appear. I think a lot of people underestimate how much quality researchers deeply think about such setups. My genuine standard rely to those folks is - do the research with your setup and publish it.
As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.
If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:
- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)
- Opus: 1306/2000 questions answered, of which 294 were correct (22%)
So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.
Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.
You are 100% correct with your assessment of the situation. But I do not agree with either of your conclusions:
1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.
2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.
For some reason, perhaps some sort of Freudian self-defense mechanism, we tend to downplay how impressive solving never seen problems that require deep understanding of the concepts at play requires.
Look for final exams of advanced courses in CS or math. It will be clarifying how close (or plainly, harder) the questions from the study are. And so how impressive the capabilities these models are achieving...
I know that people with strong feelings one way or the other will comment here, but note that this is specifically about problems with known answers that can be inferred from existing literature (e.g., training data).
This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.
The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."
Let me also add: there is zero chance of the problems being included in the training data. The results are quite impressive: leading experts struggled to write questions with well-defined unique answers on existing research that the models were not able to solve.
This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.
But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.
I think most of the value LLMs provide comes from connecting the dots between unsolved questions and patterns or structures that have already been demonstrated, which accelerates research.
Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.
I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.
It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?
If you're trying to compare what the models are good at, important to note that the different models did not run with the same settings. In one case they also retried with GPT until it answered all the problems but did not retry with the other models.
GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.
the difference between gpt and gemini concerning the "retry until..." can almost be ignored. I did rerun gpt a few times, but still way below what gemini was not able to answer at all.
"...Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers... We present the resulting collection of 100 questions....We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs....we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive..."
Partially, 2.2 Submission workflow W2 deals with this:
> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to
the original answer by an LLM judge. If at most three models answered correctly, the contributor could
proceed.
So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.
I had the same thought, because even if the exact solution doesn't appear there's a notable difference between performing a literature search versus solving something de novo. But I think perhaps this benchmark wasn't meant to exclude the former and that the point may have been to test the ability of the model to accurately interpret and synthesize relevant output for research level mathematical problems at all.
I think you are underestimating the complexity of such problems. A PhD in the exact field of research would need days to weeks to understand what the problem means and how to solve it. This is far beyond "throwing standard techniques" at a problem. (But, I keep emphasizing this, it is also far away from solving research mathematics.)
When you write "there's a notable difference between performing a literature search versus solving something de novo", you suggest that the questions we provided can be solved doing a literature search.
This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.
So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!
“In the training data” isn’t really relevant for a modern LLM. The better question would be are they solvable using known techniques that have been fine-tuned in.
A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.
Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.
Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.
But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.
On that note, the tests are very US-centric. Only one chinese model and you unfairly nerfed it by limiting it's context window, when the compressed context is deepseek v4's main innovation and even with full context it is much cheaper to run than all the others.
- Question 093 is a word problem of the kind that I would imagine is commonly given to high school students. Maybe it is slightly more difficult, but it doesn't appear to have any mathematical relevance and nobody would ever give it to a second-year PhD student.
- Question 096 is something I would expect a computer to do easily by brute force, and has essentially no mathematical content other than doing a calculation. (Under what circumstance does one care about taking base 10 digits and interpreting them in base 11?). Again, nobody would ever assign this to a math PhD student, and I expect that any undergrad who knows how to code can give you this answer.
- Question 016 is the kind of combinatorial problem that one could expect to brute force with a computer (and some decently-written code) even before AI. Again nobody would give it to a 2nd year PhD student because it is too random and of no academic interest.
- There are questions like 026 and 014, about computing Hilbert series. Computing Hilbert series is a standard computer algebra task that nobody would want to do by hand before generative AI, and certainly not now.
Similar comments apply to many others. There are plenty of random-looking computational questions of exactly the type that one expects not only that computers cans solve, but should be used to solve, because nobody would ever do it by hand. None of them are research-level --- certainly not anything that would be considered publishable (before generative AI or after) --- despite the subtitle of the paper saying "research-level". And if you give them to a 2nd year PhD student I would imagine you would just be wasting their time.
I also don't like your phrasing "much harder than any exam question in any exam". If I ask you to multiply two 1000 digit numbers, the question is "much harder" than any question that will ever appear on any exam. Everyone understands the computer will do it instantly, and it doesn't demonstrate anything relevant. There is a clear regime in which one expects AI-type methods to perform better (combinatorial, calculation-based questions which can be answered using standard methods), and other regimes where one expects worse performance (e.g., proofs of statements that use abstract concepts). Why is there nothing here of the second type?
1. Why do you compare it to multiplying two 1000 digit numbers and not to factorizing a 4096-bit numbers into its 2 prime factors, when not knowing any details?
2. The questions are of theoretical nature, even if a little calculation is involved. This does not mean that the problems are not solvable using a computer program, but it means that they are not solvable with reasonalble effort with a computer program.
3. And we do not ask for proofs because other projects already do that (IMProofBench, please have a look) and we cannot grade LLM answers as a human would need to understand the provided proof -- and this is not what I or we or actually most researchers are interested in doing.
The objection is to phrasing "much harder". One should distinguish between something that is difficult for reasons stemming from a lack of computational power and something that is difficult for reasons stemming from a lack of relevant abstractions or the ability to grapple with them. If the reason that a particular problem is "hard" for a PhD student is that they have to do a long calculation, but not because of a lack of conceptual understanding, then it doesn't say much about the capabilities of generative AI if the computer solves it.
Hence the example: multiplying two large numbers is hard for the former reason, not the latter. Your example of factoring a 4096-bit semiprime is hard for both reasons (because the brute force method is too slow).
I trust the judgement of respected researchers submitting the questions, I personally know them, and they publish research under their full names (and whose names are fully disclosed in the paper). And you also should trust them.
Please consider disclosing your name and your field of expertise, pick a question in your own research area and explain to me why this question is not research-level. And, best of all, solve it yourself to clarify why it was too easy.
If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:
- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)
- Opus: 1306/2000 questions answered, of which 294 were correct (22%)
So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.
Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.
1. These questions cannot and must not be compared as being similar to homework questions. These are different leagues and possibly even different sports.
2. The "more useful benchmark" that you suggest is already present in the data as we ran every model exactly once in Stage 1.
Look for final exams of advanced courses in CS or math. It will be clarifying how close (or plainly, harder) the questions from the study are. And so how impressive the capabilities these models are achieving...
This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.
The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."
This should not be interpreted as AI can solve mathematics: the ability to solve exercise-style questions based on existing research is vastly different from the creation of new mathematics.
But it is still impressive and not what we expected -- I rather expected that we end with 20-40 questions no current publicly available model can solve.
Now, reasoning in the sense of making truly original discoveries, as Einstein did with the field equations, is a different story for current LLMs.
https://arxiv.org/abs/2305.10160
https://math.sciencebench.ai/benchmarks
I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.
It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?
GPT has 5 effort settings and they picked the highest (xhigh). Claude has 5 and they picked the middle one to avoid having to retry when it timed out. Gemini has medium or high effort and they picked medium.
... that are therefore liable to be in the training data?
> Stage W2 The five project-active models, see Table 2, attempted the question. Their answers were compared to the original answer by an LLM judge. If at most three models answered correctly, the contributor could proceed.
So "trivially contained in the training data" is excluded, as then all models could/should easily come up with the solution.
This is incorrect. What is correct is the following: When understanding the existing literature on a question in the dataset, one can derive the answer without creating new mathematics research.
So the difference is "searching the literature" vs "understanding the literature" that made me believe it. But if you didn't that's even better!
The goal was not to define unsolved problems.
But as such, the problems are also not previously published problems.
This seems quite reasonable IMHO.
A simple example, as a non-mathematician: I’d expect a well trained LLM to be able to solve any integral that can be solved with integration by parts. I would be much more interested to see it solve one with no know solution using some novel technique.
Obviously this doesn’t really lend itself to making a benchmark, but if something is solveable by a known technique, and the LLM has has some kind of RL training re using that technique, seeing a solution isn’t too surprising.