Quantitative AI progress needs accurate and transparent evaluation

(mathstodon.xyz)

202 points | by bertman 1 day ago

18 comments

fsh 1 day ago
I believe that it may be misguided to focus on compute that much, and it would be more instructive to consider the effort that went into curating the training set. The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data. The most crass example is OpenAI paying off the FrontierMath creators last year to get exclusive secret access to the problems before the evaluation [1]. Even without resorting to cheating, competition formats are vulnerable to this. It is extremely difficult to come up with truly original questions, so by spending significant resources on re-hashing all kinds of permutations of previous question, one will probably end up very close to the actual competition set. The first rule I learned about training neural networks is to make damn sure there is no overlap between the training and validation sets. It it interesting that this rule has gone completely out of the window in the age of LLMs.
[1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lesso...
[-]
- OtherShrezzing 1 day ago
  > The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data
  I'm fairly certain this phenomenon is responsible for LLM capabilities on GeoGuesser type games. They have unreasonably good performance. For example, being able to identify obscure locations from featureless/foggy pictures of a bench. GeoGuesser's entire dataset, including GPS metadata, is definitely included in all of the frontier model training datasets - so it should be unsurprising that they have excellent performance in that domain.
  [-]
  - ACCount36 21 hours ago
    People tried VLMs on "closed set" GeoGuessr-type tasks - i.e. non-Street View photos in similar style, not published anywhere.
    They still kicked ass.
    It seems like those AIs just have an awful lot of location familiarity. They've seen enough tagged photos to be able to pick up on the patterns, and generalize that to kicking ass at GeoGuessr.
  - YetAnotherNick 1 day ago
    > GeoGuesser's entire dataset
    No, it is not included, however there must be quite a lot of pictures on internet for most cities.. Geoguesser data is same as Google's street view data and it probably contains billions of 360 degree photos.
    [-]
    - suddenlybananas 1 day ago
      Why do you say it's not included? Why wouldn't they include it.
      [-]
      - sebzim4500 1 day ago
        If every photo in streetview was included in the training data of a multimodal LLM it would be like 99.9999% of the training data/resource costs.
        It just isn't plausible that anyone has actually done that. I'm sure some people include a small sample of them, though.
        [-]
        clbrmbr 11 hours ago
        Yet.
        This is a good rebuttal when someone quips that we “are about to run out of data”. There’s oh so much more, just not in the form of books and blogs.
        bluefirebrand 1 day ago
        Why would every photo in streetview be required in order to have Geoguessr's dataset in the training data?
        [-]
        bee_rider 22 hours ago
        I’m pretty sure they are saying that Geoguessr's just pulls directly from Google Streetview. There isn’t a separate Geoguessr dataset, it just pulls from Google’s API (at least that’s what Wikipedia says).
        [-]
        bluefirebrand 22 hours ago
        I suspect that Geoguessr's dataset is a subset of Google Streetview, but maybe it really is just pulling everything directly
        [-]
        bee_rider 20 hours ago
        My guess would be that they pull directly from street-view, maybe with some extra filtering for interesting locations.
        Why bother to create a copy, if it can be avoided, right?
    - ivape 1 day ago
      I just saw a video on Reddit where a woman still managed to take a selfie while being literally face to face with a black bear. There’s definitely way too much video training data out there for everything.
      [-]
      - lutusp 20 hours ago
        > I just saw a video on Reddit where a woman still managed to take a selfie while being literally face to face with a black bear.
        This is not uncommon. Bears aren't always tearing people apart, that's a movie trope with little connection to reality. Black bears in particular are smart and social enough to befriend their food sources.
        But a hungry bear, or a bear with cubs, that's a different story. Even then bears may surprise you. Once in Alaska, a mama bear got me to babysit her cubs while she went fishing -- link: https://arachnoid.com/alaska2018/bears.html .
- eru 9 hours ago
  > It is extremely difficult to come up with truly original questions, [...]
  No, that's actually really easy. What's hard is coming up with original questions of a specific level of difficulty. And that's what you need for a competition.
  To elaborate: it's really easy to find lots and lots of elementary, unsolved questions. But it's not clear whether you can actually solve them or how hard solving them is, so it's hard to judge the performance of LLMs on them.
  > It it interesting that this rule has gone completely out of the window in the age of LLMs.
  No, it hasn't.
- astrange 1 day ago
  > The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set.
  An irony here is that math blogs like Tao's might not be in LLM training data, for the same reason they aren't accessible to screen readers - they're full of math, and the math is rendered as images, so it's nonsense if you can't read the images.
  (The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)
  [-]
  - alansammarone 1 day ago
    As others have pointed out, LLMs have no trouble with LaTeX. I can see why one might think they're not - in fact, I made the same assumption myself sometime ago. LLMs, via transformers, are exceptionally good any _any_ sequence or one-dimensional data. One very interesting (to me anyway) example is base64 - pick some not-huge sentence (say, 10 words), base64-encode it, and just paste it in any LLM you want, and it will be able to understand it. Same works with hex, ascii representation, or binary. Here's a sample if you want to try: aWYgYWxsIEEncyBhcmUgQidzLCBidXQgb25seSBzb21lIEIncyBhcmUgQydzLCBhcmUgYWxsIEEncyBDJ3M/IEFuc3dlciBpbiBiYXNlNjQu
    I remember running this experiment some time ago in a context where I was certain there was no possibility of tool use to encode/decode. Nowadays, it can be hard to certain whether there is any tool use or not, in some cases, such as Mistral, the response is quick enough to make it unlikely there's any tool use.
    [-]
    - throwanem 1 day ago
      I've just tried it, in the form of your base64 prompt and no other context, with a local Qwen-3 30b instance that I'm entirely certain is not actually performing tool use. It produced a correct answer ("Tm8="), which in a moment of accidental comedy it spontaneously formatted with LaTeX. But it did talk about invoking an online decoder, just before the first appearance of the (nearly) complete decoded string in its CoT.
      It "left out" the A in its decode and still correctly answered the proposition, either out of reflexive familiarity with the form or via metasyntactic reasoning over an implicit anaphor; I believe I recall this to be a formulation of one of the elementary axioms of set theory, though you will excuse me for omitting its name before coffee, which makes the pattern matching possibility seem somewhat more feasible. ('Seem' may work a little too hard there. But a minimally more novel challenge I think would be needed to really see more.)
      There's lots of text in lots of languages about using an online base64 decoder, and nearly none at all about decoding the representation "in your head," which for humans would be a party trick akin to that one fellow who could see a city from a helicopter for 30 seconds and then perfectly reproduce it on paper from memory. It makes sense to me that a model trained on the Internet would "invent" the "metaphor" of an online decoder here, I think. What in its "experience" serves better as a description?
      [-]
      - kaffekaka 7 hours ago
        I assume you're referring to Stephen Wiltshire: https://en.m.wikipedia.org/wiki/Stephen_Wiltshire
        [-]
        throwanem 2 hours ago
        I am! Good grief, it must have been thirty years ago I saw that news story, and apparently I misremembered several whole decades onto his age; I hadn't imagined he would still be alive. Thank you!
  - prein 1 day ago
    What would be a better alternative than LaTex for the alt text? I can't think of a solution that makes more sense, it provides an unambiguous representation of what's depicted.
    I wouldn't think an LLM would have issue with that at all. I can see how a screen reader might, but it seems like the same problem faced by a screen reader with any piece of code, not just LaTex.
  - mbowcut2 23 hours ago
    LLMs are better at LaTeX than humans. ChatGPT often writes LaTeX responses.
    [-]
    - neutronicus 22 hours ago
      Yeah, it's honestly one of the things they're best at!
      I've been working on implementing some E&M simulations with Claude Code and it's so-so on the C++ and TERRIBLE at the actual math (multiplying a couple 6x6 matrix differential operators is beyond it).
      But I can dash off some notes and tell Claude to TeXify and the output is great.
  - QuesnayJr 1 day ago
    LLMs understand LaTeX extraordinarily well.
  - constantcrying 1 day ago
    >(The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)
    LLMs are extremely good at outputting LaTeX, ChatGPT will output LaTeX, which the website will render as such. Why do you think LLMs have trouble understanding it?
    [-]
    - astrange 18 hours ago
      I don't think LLMs will have trouble understanding it. I think people using screen readers will. …oh I see, I accidentally deleted the part of the comment about that.
      But the people writing the web page extraction pipelines also have to handle the alt text properly.
  - MengerSponge 1 day ago
    LLMs are decent with LaTeX! It's just markup code after all. I've heard from some colleagues that they can do decent image to code conversion for a picture of an equation or even some handwritten ones.
- disruptbro 1 day ago
  Language modeling is compression, whittle down graph to reduce duplication and data with little relationship: https://arxiv.org/abs/2309.10668
  Let’s say everyone agrees to refer to one hosted copy of a token “cat”, and instead generate a unique vector to represent their reference to “cat”.
  Blam. Endless unique vectors which are nice and precise for parsing. No endless copies of arbitrary text like “cat”.
  Now make that your globally distributed data base to bootstrap AI chips from. The data driven programming dream where other machines on the network feed new machines boot strap.
  American tech industry is IBM now. Stuck on recent success of web SaaS and way behind the plans of AI.
NitpickLawyer 1 day ago
The problem with benchmarks is that they are really useful for honest researchers, but extremely toxic if used for marketing, clout, etc. Something something, every measure that becomes a target sucks.
It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them).
The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests.
[-]
- rachofsunshine 1 day ago
  What makes Goodhart's Law so interesting is that you transition smoothly between two entirely-different problems the more strongly people want to optimize for your metric.
  One is a measurement problem, a statement about the world as it is: an engineer who can finish such-and-such many steps of this coding task in such-and-such time has such-and-such chance of getting hired. The thing you're measuring isn't running away from you or trying to hide itself, because facts aren't conscious agents with the goal of misleading you. Measurement problems are problems of statistics and optimization, and their goal is a function f: states -> predictions. Your problems are usually problems of inputs, not problems of mathematics.
  But the larger you get, and the more valuable gaming your test is, the more you leave that measurement problem and find an adversarial problem. Adversarial problems are at least as difficult as your adversary is intelligent, and they can sometimes be even worse by making your adversary the invisible hand of the market. You don't live in the world of gradient descent anymore, because the landscape is no longer fixed. You now live in the world of game theory, and your goal is a function f: (state) x (time) x (adversarial capability) x (history of your function f) -> predictions.
  It's that last, recursive bit that really makes adversarial problems brutal. Very simple functions can rapidly result in extremely deep chaotic dynamics once you allow even the slightest bit of recursion - even very nice functions like f(x) = 3.5x(1-x) become writhing ergodic masses of confusion.
  [-]
  - pixl97 21 hours ago
    I would also assume Russell's paradox needs added in here too. Humans can and do hold sets of conflicting information, it is my theory that conflicts have an informational/processing cost to manage. In benchmark gaming you can optimize the processing speed by removing the conflicting information but you lose real world reliability metrics.
  - visarga 1 day ago
    Well said, the problem with recursion is that it constructs its own context as it goes, rewrites its rules, and you cannot predict it statically, without forward execution. It's why we have the halting problem. Recursion is irreducible. A benchmark is a static dataset, it does not capture the self constructive nature of recursion.
  - bwfan123 22 hours ago
    nice comment, a reason why ML approaches may struggle in trading markets where other agents are also competing with you possibly using similar algos. or self-driving which involves other agents who could be adversarial. just training on past data is not sufficient as existing edges are competed away and new edges keep arising out of nowhere.
- crocowhile 1 day ago
  There is also a social issue that has to do with accountability. If you claim your model is the best and then it turns out you overfitted the benchmarks and it's actually 68th, your reputation should suffer considerably for cheating. If it does not, we have a deeper problem than the benchmarks.
- mmcnl 1 day ago
  Yes, I ignore every news article about LLM benchmarks. "GPT 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok thanks for the info?
- antupis 1 day ago
  Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl.
  [-]
  - NitpickLawyer 1 day ago
    True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess.
    This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation.
- ACCount36 21 hours ago
  Your options for evaluating AI performance are: benchmarks or vibes.
  Benchmarks are a really good option to have.
- klingon-3 1 day ago
  > It's really hard to trust anything public
  Just feed it into an LLM, unintentionally hint at your bias, and voila, it will use research and the latest or generated metrics to prove whatever you’d like.
  > The only true tests are the ones you write yourself, never publish, and only work 100% on open models.
  This may be good enough, and that’s fine if it is.
  But, if you do it in-house in a closet with open models, you will have your own biases.
  No tests are valid if all that ever mattered was the argument and perhaps curated evidence.
  All tests, private and public tests have proved flawed theories historically.
  Truth has always been elusive and under siege.
  People will always just believe things. Data is just foundation for pre-existing or fabricated beliefs. It’s the best rationale for faith, because in the end, faith is everything. Without it, there is nothing.
pu_pe 1 day ago
> For instance, if a cutting-edge AI tool can expend $1000 worth of compute resources to solve an Olympiad-level problem, but its success rate is only 20%, then the actual cost required to solve the problem (assuming for simplicity that success is independent across trials) becomes $5000 on the average (with significant variability). If only the 20% of trials that were successful were reported, this would give a highly misleading impression of the actual cost required (which could be even higher than this, if the expense of verifying task completion is also non-trivial, or if the failures to solve the goal were correlated across iterations).
This is a very valid point. Google and ChatGPT announced they got the gold medal with specialized models, but what exactly does that entail? If one of them used a billion dollars in compute and the other a fraction of that, we should know about it. Error rates are equally important. Since there are conflicts of interest here, academia would be best suited for producing reliable benchmarks, but they would need access to closed models.
[-]
- sojuz151 1 day ago
  Compute has been getting cheaper and models more optimised. So if models can do something it will not be long till they can do this cheap.
  [-]
  - EvgeniyZh 23 hours ago
    GPU compute per watt has grown by a factor of 2 in last 5 years
- moffkalast 1 day ago
  > with specialized models
  > what exactly does that entail
  Overfitting on the test set with models that are useless for anything else, that's what.
- JohnKemeny 1 day ago
  Don't put Google and ChatGPT in the same category here. Google cooperated with the organizers, at least.
  [-]
  - ml-anon 1 day ago
    Also neither got a gold medal. Both solved problems to meet the threshold for a human child getting a gold medal but it’s like saying an F1 car got a gold medal in the 100m sprint at the Olympics.
    [-]
    - bwfan123 21 hours ago
      The popular science title was funnier with a pun on "mathed" [1]
      "Human teens beat AI at an international math competition Google and OpenAI earned gold medals, but were still out-mathed by students."
      [1] https://www.popsci.com/technology/ai-math-competition/
      [-]
    - nmca 1 day ago
      Indeed, it’s like saying a jet plane can fly!
    - vdfs 1 day ago
      "Google F1 Preview Experimental beat the record of the fastest man on earth Usain Bolt"
  - spuz 1 day ago
    Could you clarify what you mean by this?
    [-]
    - raincole 1 day ago
      Google's answers were judged by IMO. OpenAI's were judged by themselves internally. Whether it matters is up to the reader.
    - EnnEmmEss 1 day ago
      TheZvi had a summarization of this here: https://thezvi.substack.com/i/168895545/not-announcing-so-fa...
      In short (there is nuance), Google cooperated with the IMO team while OpenAI didn't which is why OpenAI announced before Google.
ozgrakkurt 1 day ago
Out of topic but just opening link and actually being able to read the posts and go to profile on a browser, without an account, feels really good. Opening a mastadon profile, fk twitter
[-]
- ipnon 1 day ago
  Stallman was right all along.
mhl47 1 day ago
Side note: What is going on with these comments on Mathstodon? From moon landing denials, to insults, allegations that he used AI to write this ... almost all of them are to some capacity insane.
[-]
- dash2 1 day ago
  Almost everywhere on the internet is like this. It's hn that is (mostly!) exceptional.
  [-]
  - f1shy 1 day ago
    The “mostly” there is so important! But also HN suffers from other problems (see in this thread the discussion about over policing comments, and calling fast hyperbolic and inflammatory).
    And don’t get me started in the decline on depth in technical topics and soaring in political discussions. I came to HN for the first, not the second.
    So we are humans, there will never be a perfect forum.
    [-]
    - frumiousirc 1 day ago
      > So we are humans, there will never be a perfect forum.
      Perfect is in the eye of the moderator.
- Karrot_Kream 1 day ago
  I find the same kind of behavior on bigger Bluesky AI threads. I don't use Mathstodon (or actively follow folks on it) but I certainly feel sad to see similar replies there too. I speculate that folks opposed to AI are angry and take it out by writing these sorts of comments, but this is just my hunch. That's as much as I feel I should write about this without feeling guilty for derailing the discussion.
  [-]
  - ACCount36 21 hours ago
    No wonder. Bluesky is where insane Twitter people go when they get too insane for Twitter.
- andrepd 1 day ago
  Have you opened a twitter thread? People are insane on social media, why should open source social media be substantially different? x)
  [-]
  - f1shy 1 day ago
    I refrain from any of those X, mastodon, etc. so let me ask a question:
    are all equally bad? Or same bad but a different aspect? E.g. I read often here that X has more disinformation, and right wing propaganda, while mastodon here was called out on another topic.
    Maybe somebody active in different networks can answer that.
    [-]
    - fc417fc802 1 day ago
      Moderation and the algorithms used to generate user feeds both have strong impacts. In the case of mastodon (ie activitypub) moderation varies wildly between different domains.
      But in general, I'd say that the microblogging format as a whole encourages a number of toxic behaviors and interaction patterns.
    - miltonlost 1 day ago
      X doesn't let you use trans as a word and has Grok spewing right-wing propaganda (mechahitler?). That self-selects into the most horrible people being on X now.
- nurettin 1 day ago
  That is how peak humanity looks like.
- hshshshshsh 1 day ago
  The truth is, both deniers and believers are operating on belief. Only those who actually went to the Moon know firsthand. The rest of us trust information we've received — filtered through media, education, or bias. That makes us no fundamentally different from deniers; we just think our belief is more justified.
  [-]
  - esafak 22 hours ago
    Some beliefs are more supported by evidence than others. To ignore this is to make the concept of belief practically useless.
    [-]
    - hshshshshsh 21 hours ago
      Yeah. My point is you have not seen any of the evidence. You just have belief that evidence exists. Which is a belief and not evidence.
      [-]
      - esafak 20 hours ago
        Yes, we have seen evidence: videos, pictures and other artifacts of the landing.
        I think you don't know what evidence means. You want proof and that's for mathematics.
        You don't know that you exist. You could be a simulation.
        [-]
        hshshshshsh 4 hours ago
        But isn't that the same kind of evidence that the deniers also seen? They also saw some images and videos and decided to conclude the other way around.
        The fact that I know I exist is the only thing I know for sure. Whether I am simulation or a soul or whatever word you want to call it is irrelevant.
  - fc417fc802 1 day ago
    Just to carry this line of reasoning out to the extreme for entertainment purposes (and to illustrate for everyone how misguided it is). Even if you perform a task firsthand, at the end of the day you're just trusting your memory of having done so. You feel that your trust in your memory is justified but fundamentally that isn't any different from the deniers either.
    [-]
    - hshshshshsh 1 day ago
      This is actually true. Plenty of accidents has happened because of this.
      I am not saying trusting your memory is always false or true. Most of the times it might be true. It's a heuristic.
      But if someone comes and deny what you did, the best course of action would be to consider the evidence they have and not assume they are stupid because they believe differently.
      Let's be honest, you have not personally went and verified the rocks belongs to Moon. Nor were you tracking the telemetry data in your computer when the rocket was going to Moon.
      I also believe we went to Moon.
      But all I have is beliefs.
      Everyone believed Early was flat 1000s years back as well. They had solid evidence.
      But the humility is accepting you don't know and you are believing and not pretend you are above others who believe exact opposite..
      [-]
      - fc417fc802 19 hours ago
        It's a misguided line of reasoning because the "belief" thing is a red herring. Nearly everything comes down to belief at a low level. The differences lie in the justifications.
        As you say, you should have the humility to consider the evidence that others provide that you might be wrong. The thing with the various popular conspiracy theories is that the evidence is conspicuously missing when any competent good faith actor would be presenting it front and center.
pama 1 day ago
This sounds very reasonable to me.
When considering top tier labs that optimize inference and own the GPUs: the electricity cost of USD 5000 at a data center with 4 cents per kWh (which may be possible to arrange or beat in some counties in the US with special industrial contracts) can produce about 2 trillion tokens for the R1-0528 model using 120kW draw for the B200 NVL72 hardware and the (still to be fully optimized) sglang inference pipeline: https://lmsys.org/blog/2025-06-16-gb200-part-1/
Although 2T tokens is not unreasonable for being able to get high precision answers to challenging math questions, such a very high token number would strongly suggest there are lots of unknown techniques deployed at these labs.
If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU, then the number of tokens for 5k USD shrinks dramatically to only 66B tokens, which is still high for usual techniques that try to optimize for a best single answer in the end, but perhaps plausible if the vast majority of these are intermediate thinking tokens and a lot of the value comes from LLM-based verification.
ipnon 1 day ago
Tao’s commentary is more practical and insightful than all of the “rationalist” doomers put together.
[-]
- jmmcd 1 day ago
  (a) no it's not
  (b) your comment is miles off-topic, as he is not addressing doom in any sense
- Quekid5 1 day ago
  That seems like a low bar :)
  [-]
  - ipnon 1 day ago
    My priors do not allow the existence of bars. Your move.
    [-]
    - tempodox 1 day ago
      You would have felt right at home in the time of the Prohibition.
- ks2048 1 day ago
  I agree about Tao in general, but here,
  > AI technology is now rapidly approaching the point of transition from qualitative to quantitative achievement.
  I don't get it. The whole history of deep learning was driven by quantitative achievement on benchmarks.
  I guess the rest of the post is about adding emphasis on costs in addition to overall performance. But, I don't see how that is a shift from qualitative to quantitative.
  [-]
  - raincole 1 day ago
    He means people in this AI hype trend mostly focused on "now AI can do a task that was impossible mere 5 years ago", but we will gradually change our perception of AI to "how much energy/hardware cost to complete this task and does it really benefit us."
    (My interpretation, obviously)
paradite 1 day ago
I believe everyone should run their own evals on their own tasks or use cases.
Shameless plug, but I made a simple app for anyone to create their own evals locally:
https://eval.16x.engineer/
stared 1 day ago
I agree that after a challenge is something can be done at all (heavier-than-air flight, Moon landing, Gold medal at the IMO) then next question is makes sense economically.
I like ARC-AGI approach for the reason that it shows both axes - score and price, and place human benchmark on these.
https://arcprize.org/leaderboard
js8 1 day ago
LLMs could be very useful in formalizing the problem and assumptions (conversion from natural language), but once problem is described in a formal way (it can be described in some fuzzy logic), then more reliable AI techniques should be applied.
Interestingly, Tao mentions https://teorth.github.io/equational_theories/, and I believe this is better progress than LLMs doing math. I believe enhancing Lean with more tactics and formalizing those in Lean itself is a more fruitful avenue for AI in math.
[-]
- agentcoops 1 day ago
  I used to work quite extensively with Isabelle and as a developer on Sledgehammer [1]. There are well-known results, most obviously the halting problem, that mean fully-automated logical methods applied to a formalism with any expressive capability, i.e. that can be used to formalize non-trivial problems, simply can never fulfill the role you seem to be suggesting. The proofs that are actually generated in that way are, anyway, horrendous -- in fact, the problem I used to work on was using graph algorithms to try and simplify computer-generated proofs for human comprehension. That's the very reason that all the serious work has previously been on proof /assistants/ and formal validation.
  LLMs, especially in /conjunction/ with Lean for formal validation, are really an exciting new frontier in mathematics and it's a mistake to see that as just "unreliable" versus "reliable" symbolic AI etc. The OP Terence Tao has been pushing the edge here since day one and providing, I think, the most unbiased perspective on where things stand today, strengths as much as limitations.
  [1] https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehamm...
  [-]
  - js8 1 day ago
    LLMs (as well as humans) are algorithms like anything else and so they are also subject to halting problem. I don't see what LLMs do that couldn't be in principle formalized as a Lean tactic. (IMHO LLMs are just learning rules - theorems of some kind of fuzzy logic - and then try to apply them using heuristic search to satisfy the goal. Unfortunately the rules learned are likely not fully consistent and so you get reasoning errors.)
kristianp 15 hours ago
It's going to take a large step up in transparency for AI companies to do this. It was back in gpt 4 days that openai stopped reporting model size for example and the others followed suit.
data_maan 23 hours ago
The concept of pre-registered eval (an analogy to pre-registered study) will go a long way towards fixing this.
More information
https://mathstodon.xyz/@friederrr/114881863146859839
iloveoof 1 day ago
Moore’s Law for AI Progress: AI metrics will double every two years whether the AI gets smarter or not.
kingstnap 1 day ago
My own thoughts on it are that it's entirely crazy that we focus so much on "real world" fixed benchmarks.
I should write an article on it sometime, but I think the incessant focus on data someone collected from the mystical "real world" over well designed synthetic data from a properly understood algorithm is really damaging to proper understanding.
akomtu 1 day ago
The benchmarks should really add the test of data compression. Intelligence is mostly about discovering the underlying principles, the ability to see simple rules behind complex behaviors, and data compression captures this well. For example, if you can look at a dataset of planetary and stellar motions and compress it into a simple equation, you'd be considered wildly intelligent. If you can't remember and reproduce a simple checkerboard pattern, you'd be considered dumb. Another example is drawing a duck in SVG - another form of data compression. Data extrapolation, on the other hand, is the opposite problem, which can be solved by imitation or by understanding the rules producing the data. Only the latter deserves to be called intelligence. Note, though, that understanding the rules isn't always a superior method. When we are driving, we drive by imitation based on our extensive experience with similar situations, hardly understanding the physics of driving.
PontingClarke 1 day ago
[dead]
BrenBarn 1 day ago
I like Tao, but it's always so sad to me to see people talk in this detached rational way about "how" to do AI without even mentioning the ethical and social issues involved. It's like pondering what's the best way to burn down the Louvre.
[-]
- bubblyworld 1 day ago
  I don't think everybody has to pay lip service to this stuff every time they talk about AI. Many people (myself included) acknowledge these issues but have nothing to add to the conversation that hasn't been said a million times already. Tao is a mathematician - I think it's completely fine that he's focused on the quantitative aspects of this stuff, as that is where his expertise is most relevant.
- spuz 1 day ago
  Do you not think social and ethical issues can be approached rationally? To me it sounds like Tao is concerned about the cost of running AI powered solutions and I can quite easily see how the ethical and social costs fit under that umbrella along with monetary and environmental costs.
- benlivengood 20 hours ago
  I think using LLMs/AI for pure mathematics is one of the very least ethically fraught use-cases. Creative works aren't being imitated, people aren't being deceived by hallucinations (literally by design; formal proof systems prevent it), from a safety perspective even a superintelligent agent that was truly limited to producing true theorems would be dramatically safer than other kinds of interactions with the world, etc.
- blitzar 1 day ago
  It's always so sad to me to see people banging on about the ethical and social issues involved without quantifying anything, or using dodgy projections - "at this rate it will kill 100 billion people by the end of the year".
- Karrot_Kream 1 day ago
  I feel like your comment could be more clear and less hyperbolic or inflammatory by saying something like: “I like Tao but the ethical and social issues surrounding AI are much more important to me than discussing its specifics.”
  [-]
  - rolandog 1 day ago
    I don't agree; portraying it as an opinion has the risk of continuing to erode the world with moral relativism.
    The tech — despite being sometimes impresaive — is objectively inefficient, expensive, and harmful to the environment (excessive use if energy and water for cooling), to the people located near the data centers (by stochastic leeching of coolants to the waterbed IIRC), and the economic harm done to hundreds of millions of people whose data was involuntarily used for training.
    [-]
    - Karrot_Kream 1 day ago
      For the claim to be objective then I believe it needs objective substance to discuss. I saw none of that. I would like to see numbers, results, or something of that nature. It's fine to have subjective feelings as well but I feel it's important to clarify one's feelings especially because I see online discussion on forums become so heated so quickly which I feel degrades discussion quality.
      [-]
      - rolandog 22 hours ago
        Let's not shift the burden of proof so irresponsibly.
        We've all seen the bad faith actors that questioned, for example, studies on the efficacy of wearing masks in reducing chance of transmission of airborne diseases because the study combined wearing masks AND washing hands... Those people would gladly hand wipe without toilet paper to "own the libs" or whatever hate-filled mental gymnastics strokes their ego.
        With that in mind, let's call things for what they are: there are multiple companies that are salivating at the prospects of being able to make the working class obsolete. There's trillions to be made in their mind.
        > I would like to see numbers, results, or something of that nature
        I would like the same thing! So far, we have seen that a very big company that had pledged, IIRC, to remain not-for-profit for the benefit of humanity sold out at the drop of a hat the moment they were able to hint Zombocom levels of possibility to investors.
  - calf 1 day ago
    I find it extremist and inflammatory this reoccurring—frankly conservative—tendency on HN to police any strong polemic criticism as "hyperbole" and "inflammatory". People should learn to take criticism is stride, not every strongly critical comment ought to be socially censored by tone policing it. The comparison to Louvre was a funny comment and if people didn't get that perhaps it is not too far-fetched to suggest improving on basic literary-device literacy skills.
- rolandog 1 day ago
  > what's the best way to burn down the Louvre.
  "There are two schools of thought, you see..."
  Joking aside, I think that's a very valid point; not sure what would be the nonreligious term for the amorality of "sins of omission"... But, in essence, one can clearly be unethical by ignoring the social responsibility we have to study who is affected by our actions.
  Corporations can't really play dumb there, since they have to weigh the impacts for every project they undertake.
  Also, side note... It's very telling how little control we (commoners?) have as a global society that — collectively — we're throwing mountains of cash at weapons and AI, which would directly move us closer to oblivion and further the effects of climate change (despite the majority of people not wanting wars nor being replaced by a chatbot). I would instead favor world peace; ending poverty, famine, and genocide; and, preventing further global warming.
- ACCount36 20 hours ago
  And I am tired of "mentioning the ethical and social issues".
  If the best you can do is bring up this garbage, then you have nothing of value to say.