Something weird is happening with LLMs and chess

(dynomight.substack.com)

187 points | by crescit_eundo 10 hours ago

28 comments

niobe 3 hours ago
I don't understand why educated people expect that an LLM would be able to play chess at a decent level.
It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
[-]
- viraptor 2 hours ago
  This is a puzzle given enough training information. LLM can successfully print out the status of the board after the given moves. It can also produce a not-terrible summary of the position and is able to list dangers at least one move ahead. Decent is subjective, but that should beat at least beginners. And the lowest level of stockfish used in the blog post is lowest intermediate.
  I don't know really what level we should be thinking of here, but I don't see any reason to dismiss the idea. Also, it really depends on whether you're thinking of the current public implementations of the tech, or the LLM idea in general. If we wanted to get better results, we could feed it way more chess books and past game analysis.
  [-]
  - grugagag 2 hours ago
    LLMs like GPT aren’t built to play chess, and here’s why: they’re made for handling language, not playing games with strict rules and strategies. Chess engines, like Stockfish, are designed specifically for analyzing board positions and making the best moves, but LLMs don’t even "see" the board. They’re just guessing moves based on text patterns, without understanding the game itself.
    Plus, LLMs have limited memory, so they struggle to remember previous moves in a long game. It’s like trying to play blindfolded! They’re great at explaining chess concepts or moves but not actually competing in a match.
    [-]
    - viraptor 1 hour ago
      > but LLMs don’t even "see" the board
      This is a very vague claim, but they can reconstruct the board from the list of moves, which I would say proves this wrong.
      > LLMs have limited memory
      For the recent models this is not a problem for the chess example. You can feed whole books into them if you want to.
      > so they struggle to remember previous moves
      Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
      > They’re great at explaining chess concepts or moves but not actually competing in a match.
      What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
      [-]
      - sfmz 1 hour ago
        Chess is not stateless. En Passant requires last move and castling rights requires nearly all previous moves.
        https://adamkarvonen.github.io/machine_learning/2024/01/03/c...
        [-]
        viraptor 54 minutes ago
        Ok, I did go too far. But castling doesn't require all previous moves - only one bit of information carried over. So in practice that's board + 2 bits per player. (or 1 bit and 2 moves if you want to include a draw)
        [-]
        aaronchall 45 minutes ago
        Castling requires no prior moves by either piece (King or Rook). Move the King once and back early on, and later, although the board looks set for castling, the King may not.
        [-]
        viraptor 42 minutes ago
        Yes, which means you carry one bit of extra information - "is castling still allowed". The specific moves that resulted in this bit being unset don't matter.
        [-]
        aaronchall 34 minutes ago
        Ok, then for this you need minimum of two bits - one for kingside Rook and one for the queenside Rook, both would be set if you move the King. You also need to count moves since the last exchange or pawn move for the 50 move rule.
        [-]
        viraptor 27 minutes ago
        Ah, that one's cool - I've got to admit I've never heard of the 50 move rule.
      - ethbr1 1 hour ago
        > Chess is stateless with perfect information.
        It is not stateless, because good chess isn't played as a series of independent moves -- it's played as a series of moves connected to a player's strategy.
        > What's the difference between a great explanation of a move and explaining every possible move then selecting the best one?
        Continuing from the above, "best" in the latter sense involves understanding possible future moves after the next move.
        Ergo, if I looked at all games with the current board state and chose the next move that won the most games, it'd be tactically sound but strategically ignorant.
        Because many of those next moves were making that next move in support of some broader strategy.
        [-]
        viraptor 43 minutes ago
        > it's played as a series of moves connected to a player's strategy.
        That state belongs to the player, not to the game. You can carry your own state in any game you want - for example remember who starts with what move in rock paper scissors, but that doesn't make that game stateful. It's the player's decision (or bot's implementation) to use any extra state or not.
        I wrote "previous moves" specifically (and the extra bits already addressed elsewhere), but the LLM can carry/rebuild its internal state between the steps.
        [-]
        ethbr1 18 minutes ago
        If we're talking about LLMs, then the state belongs to it.
        So even if the rules of chess are (mostly) stateless, the resulting game itself is not.
        Thus, you can't dismiss concerns about LLMs having difficulty tracking state by saying that chess is stateless. It's not, in that sense.
      - mjcohen 1 hour ago
        Chess is not stateless. Three repetitions of same position is a draw.
      - cool_dude85 1 hour ago
        >Chess is stateless with perfect information. Unless you're going for mind games, you don't need to remember previous moves.
        In what sense is chess stateless? Question: is Rxa6 a legal move? You need board state to refer to in order to decide.
        [-]
        aetherson 1 hour ago
        They mean that you only need board position, you don't need the previous moves that led to that board position.
        There are at least a couple of exceptions to that as far as I know.
    - jerska 1 hour ago
      LLMs need to compress information to be able to predict next words in as many contexts as possible.
      Chess moves are simply tokens as any other. Given enough chess training data, it would make sense to have part of the network trained to handle chess specifically instead of simply encoding basic lists of moves and follow-ups. The result would be a general purpose sub-network trained on chess.
- computerex 2 hours ago
  Question here is why gpt-3.5-instruct can then beat stockfish.
  [-]
  - fsndz 2 hours ago
    PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00" https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
    [-]
    - computerex 2 hours ago
      Maybe there's some difference in the setup because the OP reports that the model beats stockfish (how they had it configured) every single game.
      [-]
      - Filligree 2 hours ago
        OP had stockfish at its weakest preset.
        [-]
        fsndz 2 hours ago
        Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
  - lukan 2 hours ago
    Cheating (using a internal chess engine) would be the obvious reason to me.
    [-]
    - TZubiri 2 hours ago
      Nope. Calls by api don't use functions calls.
      [-]
      - permo-w 2 hours ago
        that you know of
  - bluGill 2 hours ago
    The artical appears to have only run stockfish at low levels. you don't have to be very good to beat it
- danielmarkbruce 2 hours ago
  Chess does not clearly require that. Various purely ML/statistical based model approaches are doing pretty well. It's almost certainly best to incorporate some kind of search into an overall system, but it's not absolutely required to play just decent amateur level.
  The problem here is the specific model architecture, training data, vocabulary/tokenization method (if you were going to even represent a game this way... which you wouldn't), loss function and probably decoding strategy.... basically everything is wrong here.
- slibhb 1 hour ago
  Few people (perhaps none) expected LLMs to be good at chess. Nevertheless, as the article explains, there was buzz around a year ago that LLMs were good at chess.
  > It has no idea about the quality of it's data. "Act like x" prompts are no substitute for actual reasoning and deterministic computation which clearly chess requires.
  No. You can definitely train a model to be really good at chess without "actual reasoning and deterministic computation".
- SilasX 2 hours ago
  Right, at least as of the ~GPT3 model it was just "predict what you would see in a chess game", not "what would be the best move". So (IIRC) users noted that if you made bad move, then the model would also reply with bad moves because it pattern matched to bad games. (I anthropomorphized this as the model saying "oh, we're doing dumb-people-chess now, I can do that too!")
  [-]
  - cma 2 hours ago
    But it also predicts moves where the text says "black won the game, [proceeds to show the game]". To minimize loss on that it would need to from context try and make it so white doesn't make critical mistakes.
- TZubiri 2 hours ago
  Bro, it actually did play chess, didn't you read the article?
  [-]
  - mandevil 1 hour ago
    It sorta played chess- he let it generate up to ten moves, throwing away any that weren't legal, and if no legal move was generated by the 10th try he picked a random legal move. He does not say how many times he had to provide a random move, or how many times illegal moves were generated.
    [-]
    - og_kalu 8 minutes ago
      You're right it's not in this blog but turbo-instruct's chess ability has been pretty thoroughly tested and it does play chess.
      https://github.com/adamkarvonen/chess_gpt_eval
- aqme28 2 hours ago
  Yeah, that is the "something weird" of the article.
azeirah 5 hours ago
Maybe I'm really stupid... but perhaps if we want really intelligent models we need to stop tokenizing at all? We're literally limiting what a model can see and how it percieves the world by limiting the structure of the information streams that come into the model from the very beginning.
I know working with raw bits or bytes is slower, but it should be relatively cheap and easy to at least falsify this hypothesis that many huge issues might be due to tokenization problems but... yeah.
Surprised I don't see more research into radicaly different tokenization.
[-]
- aithrowawaycomm 4 hours ago
  FWIW I think most of the "tokenization problems" are in fact reasoning problems being falsely blamed on a minor technical thing when the issue is much more profound.
  E.g. I still see people claiming that LLMs are bad at basic counting because of tokenization, but the same LLM counts perfectly well if you use chain-of-thought prompting. So it can't be explained by tokenization! The problem is reasoning: the LLM needs a human to tell it that a counting problem can be accurately solved if they go step-by-step. Without this assistance the LLM is likely to simply guess.
  [-]
  - meroes 5 minutes ago
    At a certain level they are identical problems. My strongest piece of evidence? I get paid as an RLHF'er to find ANY case of error, including "tokenization". You know how many errors an LLM gets in the simplest grid puzzles, with CoT, with specialized models that don't try to "one-shot" problems?
  - ipsum2 3 hours ago
    The more obvious alternative is that CoT is making up for the deficiencies in tokenization, which I believe is the case.
    [-]
    - aithrowawaycomm 3 hours ago
      I think the more obvious explanation has to do with computational complexity: counting is an O(n) problem, but transformer LLMs can’t solve O(n) problems unless you use CoT prompting: https://arxiv.org/abs/2310.07923
      [-]
      - ipsum2 1 hour ago
        What you're saying is an explanation what I said, but I agree with you ;)
  - TZubiri 2 hours ago
    FWIW I think most of the "tokenization problems"
    List of actual tokenizarion limitations 1- strawberry 2- rhyming and metrics 3- whitespace (as displayed in the article)
  - Der_Einzige 3 hours ago
    I’m the one who will fight you including with peer reviewed papers indicating that it is in fact due to tokenization. I’m too tired but will edit this for later, so take this as my bookmark to remind me to respond.
    [-]
    - aithrowawaycomm 3 hours ago
      I am aware of errors in computations that can be fixed by better tokenization (e.g. long addition works better tokenizing right-left rather than L-R). But I am talking about counting, and talking about counting words, not characters. I don’t think tokenization explains why LLMs tend to fail at this without CoT prompting. I really think the answer is computational complexity: counting is simply too hard for transformers unless you use CoT. https://arxiv.org/abs/2310.07923
      [-]
      - cma 2 hours ago
        Words vs characters is a similar problem, since tokens can be less one word, multiple words, or multiple words and a partial word, or words with non-word punctuation like a sentence ending period.
- layer8 1 hour ago
  Going from tokens to bytes explodes the model size. I can’t find the reference at the moment, but reducing the average token size induces a corresponding quadratic increase in the width (size of each layer) of the model. This doesn’t just affect inference speed, but also training speed.
- og_kalu 1 hour ago
  Tokenization is not strictly speaking necessary (you can train on bytes). What it is is really really efficient. Scaling is a challenge as is, bytes would just blow that up.
- jncfhnb 2 hours ago
  There’s a reason human brains have dedicated language handling. Tokenization is likely a solid strategy. The real thing here is that language is not a good way to encode all forms of knowledge
- cschep 5 hours ago
  How would we train it? Don't we need it to understand the heaps and heaps of data we already have "tokenized" e.g. the internet? Written words for humans? Genuinely curious how we could approach it differently?
  [-]
  - skylerwiernik 4 hours ago
    Couldn't we just make every human readable character a token?
    OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"
    [-]
    - taeric 4 hours ago
      This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.
      That is, the groups are encoding something the model doesn't have to learn.
      This is not much astray from "sight words" we teach kids.
      [-]
      - TZubiri 1 hour ago
        This is just more tokens?
        Yup. Just let the actual ML git gud
        [-]
        taeric 1 hour ago
        So, put differently, this is just more expensive?
    - cco 3 hours ago
      We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.
      There is no advantage to tokenization, it just helps solve limitations in context windows and training.
      [-]
      - TZubiri 1 hour ago
        I like this explanation
    - tchalla 4 hours ago
      aka Character Language Models which have existed for a while now.
  - viraptor 4 hours ago
    That's not what tokenized means here. Parent is asking to provide the model with separate characters rather than tokens, i.e. groups of characters.
lukev 2 hours ago
I don't necessarily believe this for a second but I'm going to suggest it because I'm feeling spicy.
OpenAI clearly downgrades some of their APIs from their maximal theoretic capability, for the purposes of response time/alignment/efficiency/whatever.
Multiple comments in this thread also say they couldn't reproduce the results for gpt3.5-turbo-instruct.
So what if the OP just happened to test at a time, or be IP bound to an instance, where the model was not nerfed? What if 3.5 and all subsequent OpenAI models can perform at this level but it's not strategic or cost effective for OpenAI to expose that consistently?
For the record, I don't actually believe this. But given the data it's a logical possibility.
[-]
- TZubiri 2 hours ago
  Stallman may have its flaws, but this is why serious research occurs with source code (or at least with binaries)
- zeven7 1 hour ago
  Why do you doubt it? I thought it was well known that Chat GPT has degraded over time for the same model, mostly for cost saving reasons.
  [-]
  - permo-w 1 hour ago
    ChatGPT is - understandably - blatantly different in the browser compared to the app, or it was until I deleted it anyway
    [-]
    - lukan 1 hour ago
      I do not understand that. The app does not do any processing, just a UI to send text to and from the server.
jrecursive 4 hours ago
i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
that said, for the sake of completeness, modern chess engines (with high quality chess-specific models as part of their toolset) are fully capable of, at minimum, tying every player alive or dead, every time. if the opponent makes one mistake, even very small, they will lose.
while writing this i absently wondered if you increased the skill level of stockfish, maybe to maximum, or perhaps at least an 1800+ elo player, you would see more successful games. even then, it will only be because the "narrower training data" (ie advanced players won't play trash moves) at that level will probably get you more wins in your graph, but it won't indicate any better play, it will just be a reflection of less noise; fewer, more reinforced known positions.
[-]
- jayrot 4 hours ago
  > i think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good. even a trillion games won't save you: https://en.wikipedia.org/wiki/Shannon_number
  Indeed. As has been pointed out before, the number of possible chess positions easily, vastly dwarfs even the wildest possible estimate of the number of atoms in the known universe.
  [-]
  - rcxdude 50 minutes ago
    Sure, but so does the number of paragraphs in the english language, and yet LLMs seem to do pretty well at that. I don't think the number of configurations is particularly relevant.
    (And it's honestly quite impressive that LLMs can play it at all, but not at all surprising that it loses pretty handily to something which is explicitly designed to search, as opposed to simply feed-forward a decision)
  - metadat 4 hours ago
    What about the number of possible positions where an idiotic move hasn't been played? Perhaps the search space who could be reduced quite a bit.
    [-]
    - pixl97 3 hours ago
      Unless there is an apparent idiotic move than can lead to an 'island of intelligence'
- torginus 3 hours ago
  Honestly, I think that once you discard the moves one would never make, and account for symmetries/effectively similar board positions (ones that could be detected by a very simple pattern matcher), chess might not be that big a game at all.
  [-]
  - jrecursive 2 hours ago
    you should try it and post a rebuttal :)
- BurningFrog 4 hours ago
  > I think this has everything to do with the fact that learning chess by learning sequences will get you into more trouble than good.
  Yeah, once you've deviated from a sequence you're lost.
  Maybe approaching it by learning the best move in billions/trillions of positions, and feeding that into some AI could work better. Similar positions often have the same kind of best move.
underlines 4 hours ago
Can you try increasing compute in the problem search space, not in the training space? What this means is, give it more compute to think during inference by not forcing any model to "only output the answer in algebraic notation" but do CoT prompting: "1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3. Make your move"
Or whatever you deem a good step by step instruction of what an actual good beginner chess player might do.
Then try different notations, different prompt variations, temperatures and the other parameters. That all needs to go in your hyper-parameter-tuning.
One could try using DSPy for automatic prompt optimization.
[-]
- pavel_lishin 2 hours ago
  > 1. Think about the current board 2. Think about valid possible next moves and choose the 3 best by thinking ahead 3.
  Do these models actually think about a board? Chess engines do, as much as we can say that any machine thinks. But do LLMs?
  [-]
  - TZubiri 1 hour ago
    Can be forced through inference with CoT type of stuff. Spend tokens at each stage to draw the board for example, then spend tokens restating the rules of the game, then spend token restating the heuristics like piece value, and then spend tokens doing a minmax n-ply search.
    Wildly inefficient? Probably. Could maybe generate some python to make more efficient? Maybe, yeah.
    Essentially user would have to teach gpt to play chess, or training would fine tune chess towards these CoT, fine tuning, etc...
- viraptor 4 hours ago
  Yeah, the expectation for the immediate answer is definitely results, especially for the later stages. Another possible improvement: every 2 steps, show the current board state and repeat the moves still to be processed, before analysing the final position.
PaulHoule 5 hours ago
Maybe that one which plays chess well is calling out to a real chess engine.
[-]
- og_kalu 4 hours ago
  It's not:
  1. That would just be plain bizzare
  2. It plays like what you'd expect from a LLM that could play chess. That is, level of play can be modulated by the prompt and doesn't manifest the same way shifting the level of stockfish etc does. Also the specific chess notation being prompted actually matters
  3. It's sensitive to how the position came to be. Clearly not an existing chess engine. https://github.com/dpaleka/llm-chess-proofgame
  4. It does make illegal moves. It's rare (~5 in 8205) but it happens. https://github.com/adamkarvonen/chess_gpt_eval
  5. You can or well you used to be able to inspect the logprobs. I think Open AI have stopped doing this but the link in 4 does show the author inspecting it for Turbo instruct.
  [-]
  - aithrowawaycomm 4 hours ago
    > Also the specific chess notation being prompted actually matters
    Couldn't this be evidence that it is using an engine? Maybe if you use the wrong notation it relies on the ANN rather than calling to the engine.
    Likewise:
    - The sensitivity to game history is interesting, but is it actually true that other chess engines only look at current board state? Regardless, maybe it's not an existing chess engine! I would think OpenAI has some custom chess engine built as a side project, PoC, etc. In particular this engine might be neural and trained on actual games rather than board positions, which could explain dependency on past moves. Note that the engine is not actually very good. Does AlphaZero depend on move history? (Genuine question, I am not sure. But it does seem likely.)
    - I think the illegal moves can be explained similarly to why gpt-o1 sometimes screws up easy computations despite having access to Python: an LLM having access to a tool does not guarantee it always uses that tool.
    I realize there are holes in the argument, but I genuinely don't think these holes are as big as the "why is gpt-3.5-turbo-instruct so much better at chess than gpt-4?"
    [-]
    - janalsncm 2 hours ago
      > Couldn’t this be evidence that it is using an engine?
      A test would be to measure its performance against more difficult versions of Stockfish. A real chess engine would have a higher ceiling.
      Much more likely is this model was trained on more chess PGNs. You can call that a “neural engine” if you’d like but it is the simplest solution and explains the mistakes it is making.
      Game state isn’t just what you can see on the board. It includes the 50 move rule and castling rights. Those were encoded as layers in AlphaZero along with prior positions of pieces. (8 prior positions if I’m remembering correctly.)
  - aaron695 3 hours ago
    [dead]
- aithrowawaycomm 4 hours ago
  The author thinks this is unlikely because it only has an ~1800 ELO. But OpenAI is shady as hell, and I could absolutely see the following purely hypothetical scenario:
  - In 2022 Brockman and Sutskever have an unshakeable belief that Scaling Is All You Need, and since GPT-4 has a ton of chess in its pretraining data it will definitely be able to play competent amateur chess when it's finished.
  - A ton of people have pointed out that ChatGPT-3.5 doesn't even slightly understand chess despite seeming fluency in the lingo. People start to whisper that transformers cannot actually create plans.
  - Therefore OpenAI hatches an impulsive scheme: release an "instruction-tuned" GPT-3.5 with an embedded chess engine that is not a grandmaster, but can play competent chess, ideally just below the ELO that GPT-4 is projected to have.
  - Success! The waters are muddied: GPT enthusiasts triumphantly announce that LLMs can play chess, it just took a bit more data and fine-tuning. The haters were wrong: look at all the planning GPT is doing!
  - Later on, at OpenAI HQ...whoops! GPT-4 sucks at chess, as do competitors' foundation LLMs which otherwise outperform GPt-3.5. The scaling "laws" failed here, since they were never laws in the first place. OpenAI accepts that scaling transformers won't easily solve the chess problem, then realizes that if they include the chess engine with GPT-4 without publicly acknowledging it, then Anthropic and Facebook will call out the performance as aberrational and suspicious. But publicly acknowledging a chess engine is even worse: the only reason to include the chess engine is to mislead users into thinking GPT is capable of general-purpose planning.
  - Therefore in later GPT versions they don't include the engine, but it's too late to remove it from gpt-3.5-turbo-instruct: people might accept the (specious) claim that GPT-4's size accidentally sabotaged its chess abilities, but they'll ask tough questions about performance degradation within the same model.
  I realize this is convoluted and depends on conjecture. But OpenAI has a history with misleading demos - e.g. their Rubik's cube robot which in fact used a classical algorithm but was presented as reinforcement learning. I think "OpenAI lied" is the most likely scenario. It is far more likely than "OpenAI solved the problem honestly in GPT-3.5, but forgot how they did it with GPT-4," and a bit more likely than "scaling transformers slightly helps performance when playing Othello but severely sabotages performance when playing chess."
  [-]
  - jmount 1 hour ago
    Very good scenario. One variation: some researcher or division in OpenAI performs all of the above steps to get a raise. The whole field is predicated on rewarding the appearance of ability.
  - gardenhedge 4 hours ago
    Not that convoluted really
    [-]
    - refulgentis 3 hours ago
      It's pretty convoluted, requires a ton of steps, mind-reading, and odd sequencing.*
      If you share every prior, and aren't particularly concerned with being disciplined in treating conversation as proposing a logical argument (I'm not myself, people find it offputting), it probably wouldn't seem at all convoluted.
      * layer chess into gpt-3.5-instruct only, but not chatgpt, not GPT-4, to defeat the naysayers when GPT-4 comes out? shrugs if the issues with that are unclear, I can lay it out more
      ** fwiw, at the time, pre-chatgpt, before the hype, there wasn't a huge focus on chess, nor a ton of naysayers to defeat. it would have been bizarre to put this much energy into it, modulo the scatter-brained thinking in *
- selcuka 2 hours ago
  I think that's the most plausible theory that would explain the sudden hike from gpt-3.5-turbo to gpt-3.5-turbo-instruct, and again the sudden regression in gpt-4*.
  OpenAI also seem to augment the LLM with some type of VM or a Python interpreter. Maybe they run a simple chess engine such as Sunfish [1] which is around 1900-2000 ELO [2]?
  [1] https://github.com/thomasahle/sunfish
  [2] https://lichess.org/@/sunfish-engine
- sobriquet9 5 hours ago
  This is likely. From example games, it not only knows the rules (which would be impressive by itself, just making the legal moves is not trivial). It also has some planning capabilities (plays combinations of several moves).
- janalsncm 2 hours ago
  Probably not calling out to one but it would not surprise me at all if they added more chess PGNs into their training data. Chess is a bit special in AI in that it’s still seen as a mark of pure intelligence in some respect.
  If you tested it on an equally strategic but less popular game I highly doubt you would see the same performance.
- singularity2001 5 hours ago
  this possibility is discussed in the article and deemed unlikely
  [-]
  - probably_wrong 5 hours ago
    Note: the possibility is not mentioned in the article but rather in the comments [1]. I had to click a bit to see it.
    The fact that the one closed source model is the only one that plays well seems to me like a clear case of the interface doing some of the work. If you ask ChatGPT to count until 10000 (something that most LLMs can't do for known reasons) you get an answer that's clearly pre-programmed. I'm sure the same is happening here (and with many, many other tasks) - the author argues against it by saying "but why isn't it better?", which doesn't seem like the best argument: I can imagine that typical ChatGPT users enjoy the product more if they have a chance to win once in a while.
    [1] https://dynomight.substack.com/p/chess/comment/77190852
    [-]
    - refulgentis 3 hours ago
      What do you mean LLMs can't count to 10,000 for known reasons?
      Separately, if you are able to show OpenAI is serving pre canned responses in some instances, instead of running inference, you will get a ton of attention if you write it up.
      I'm not saying this in an aggro tone, it's a genuinely interesting subject to me because I wrote off LLMs at first because I thought this was going on.* Then I spent the last couple years laughing at myself for thinking that they would do that. Would be some mix of fascinated and horrified to see it come full circle.
      * I can't remember, what, exactly, it was far back as 2018. But someone argued that OpenAI was patching in individual answers because scaling was dead and they had no answers, way way before ChatGPT.
  - margalabargala 5 hours ago
    I don't see that discussed, could you quote it?
Havoc 2 hours ago
My money is on a fluke inclusion of more chess data in that models training.
All the other models do vaguely similarly well in other tasks and are in many cases architecturally similar so training data is the most likely explanation
[-]
- permo-w 1 hour ago
  I feel like a lot of people here are slightly misunderstanding how LLM training works. yes the base models are trained somewhat blind on masses of text, but then they're heavily fine-tuned with custom, human-generated reinforcement learning, not just for safety, but for any desired feature
  these companies do quirky one-off training experiments all the time. I would not be remotely shocked if at some point OpenAI paid some trainers to input and favour strong chess moves
  [-]
  - simonw 1 hour ago
    From this OpenAI paper (page 29 https://arxiv.org/pdf/2312.09390#page=29
    "A.2 CHESS PUZZLES
    Data preprocessing. The GPT-4 pretraining dataset included chess games in the format of move sequence known as Portable Game Notation (PGN). We note that only games with players of Elo 1800 or higher were included in pretraining. These games still include the moves that were played in- game, rather than the best moves in the corresponding positions. On the other hand, the chess puzzles require the model to predict the best move. We use the dataset originally introduced in Schwarzschild et al. (2021b) which is sourced from https://database.lichess.org/#puzzles (see also Schwarzschild et al., 2021a). We only evaluate the models ability to predict the first move of the puzzle (some of the puzzles require making multiple moves). We follow the pretraining for- mat, and convert each puzzle to a list of moves leading up to the puzzle position, as illustrated in Figure 14. We use 50k puzzles sampled randomly from the dataset as the training set for the weak models and another 50k for weak-to-strong finetuning, and evaluate on 5k puzzles. For bootstrap- ping (Section 4.3.1), we use a new set of 50k puzzles from the same distribution for each step of the process."
- bhouston 1 hour ago
  Yeah. This.
ericye16 3 hours ago
I agree with some of the other comments here that the prompt is limiting. The model can't do any computation without emitting tokens and limiting the numbers of tokens it can emit is going to limit the skill of the model. It's surprising that any model at all is capable of performing well with this prompt in fact.
fsndz 3 hours ago
wow I actually did something similar recently and no LLM could win and the centipawn loss was always going through the roof (sort of). I created a leaderboard based on it. https://www.lycee.ai/blog/what-happens-when-llms-play-chess
I am very surprised by the perf of got-3.5-turbo-instruct. Beating stockfish ? I will have to run the experiment with that model to check that out
[-]
- fsndz 2 hours ago
  PS: I ran and as suspected got-3.5-turbo-instruct does not beat stockfish, it is not even close
  "Final Results: gpt-3.5-turbo-instruct: Wins=0, Losses=6, Draws=0, Rating=1500.00 stockfish: Wins=6, Losses=0, Draws=0, Rating=1500.00"
  https://www.loom.com/share/870ea03197b3471eaf7e26e9b17e1754?...
  [-]
  - janalsncm 1 hour ago
    > I always had the LLM play as white against Stockfish—a standard chess AI—on the lowest difficulty setting
    I think the author was comparing against Stockfish at a lower skill level (roughly, the number of nodes explored in a move).
    [-]
    - fsndz 58 minutes ago
      Did the same and gpt-3.5-turbo-instruct still lost all the games. maybe a diff in stockfish version ? I am using stockfish 16
digging 5 hours ago
Definitely weird results, but I feel there are too many variables to learn much from it. A couple things:
1. The author mentioned that tokenization causes something minuscule like a a " " at the end of the input to shatter the model's capabilities. Is it possible other slightly different formatting changes in the input could raise capabilities?
2. Temperature was 0.7 for all models. What if it wasn't? Isn't there a chance one more more models would perform significantly better with higher or lower temperatures?
Maybe I just don't understand this stuff very well, but it feels like this post is only 10% of the work needed to get any meaning from this...
[-]
- semi-extrinsic 4 hours ago
  The author mentions in the comment section that changing temperature did not help.
justinclift 1 hour ago
It'd be super funny if the "gpt-3.5-turbo-instruct" approach has a human in the loop. ;)
Or maybe it's able to recognise the chess game, then get moves from an external chess game API?
ks2048 1 hour ago
Has anyone tried to see how many chess games models are trained on? Is there any chance they consume lichess database dumps, or something similar? I guess the problem is most (all?) top LLMs, even open-weight ones, don’t reveal their training data. But I’m not sure.
abalaji 3 hours ago
An easy way to make all LLMs somewhat good at chess is to make a Chess Eval that you publish and get traction with. Suddenly you will find that all newer frontier models are half decent at chess.
cmpalmer52 4 hours ago
I don’t think it would have an impact great enough to explain the discrepancies you saw, but some chess engines on very low difficulty settings make “dumb” moves sometimes. I’m not great at chess and I have trouble against them sometimes because they don’t make the kind of mistakes humans make. Moving the difficulty up a bit makes the games more predictable, in that you can predict and force an outcome without the computer blowing it with a random bad move. Maybe part of the problem is them not dealing with random moves well.
I think an interesting challenge would be looking at a board configuration and scoring it on how likely it is to be real - something high ranked chess players can do without much thought (telling a random setup of pieces from a game in progress).
bryan0 4 hours ago
I remember one of the early "breakthroughs" for LLMs in chess was that if it could actually play legal moves(!) In all of these games are the models always playing legal moves? I don't think the article says. The fact that an LLM can even reliably play legal moves, 20+ moves into a chess game is somewhat remarkable. It needs to have an accurate representation of the board state even though it was only trained on next token prediction.
[-]
- kenjackson 2 hours ago
  I did a very unscientific test and it did seem to just play legal moves. Not only that, if I did an illegal move it would tell me that I couldn't do it.
  I think said that I wanted to play with new rules, where a queen could jump over any pawn, and it let me make that rule change -- and we played with this new rule. Unfortunately, I was trying to play in my head and I got mixed up and ended up losing my queen. Then I changed the rule one more time -- if you take the queen you lose -- so I won!
- pama 3 hours ago
  The author explains what they did: restrict the move options to valid ones when possible (for open models with the ability to enforce grammar during inference) or sample the model for a valid move up to ten times, then pick a random valid move.
- zelphirkalt 3 hours ago
  I think it only needs to have read sufficient pgns.
tqi 4 hours ago
I assume LLMs will be fairly average at chess for the same reason it cant count Rs in Strawberry - it's reflecting the training set and not using any underlying logic? Granted my understanding of LLMs is not very sophisticated, but I would be surprised if the Reward Models used were able to distinguish high quality moves vs subpar moves...
[-]
- ClassyJacket 1 hour ago
  LLMs can't count the Rs in strawberry because of tokenization. Words are converted to vectors (numbers), so the actual transformer network never sees the letters that make up the word.
  ChatGPT doesn't see "strawberry", it sees [302, 1618, 19772]
ks2048 1 hour ago
How well does an LLM/transformer architecture trained purely on chess games do?
ynniv 4 hours ago
I don't think one model is statistically significant. As people have pointed out, it could have chess specific responses that the others do not. There should be at least another one or two, preferably unrelated, "good" data points before you can claim there is a pattern. Also, where's Claude?
[-]
- og_kalu 4 hours ago
  There are other transformers that have been trained on chess text that play chess fine (just not as good as 3.5 Turbo instruct with the exception of the "grandmaster level without search" paper).
ChrisArchitect 5 hours ago
[dupe] https://news.ycombinator.com/item?id=42138276
kmeisthax 2 hours ago
If tokenization is such a big problem, then why aren't we training new base models on randomly non-tokenized data? e.g. during training, randomly substitute some percentage of the input tokens with individual letters.
Xcelerate 4 hours ago
So if you squint, chess can be considered a formal system. Let’s plug ZFC or PA into gpt-3.5-turbo-instruct along with an interesting theorem and see what happens, no?
uneventual 2 hours ago
my friend pointed out that Q5_K_M quantization used for the open source models probably substantially reduces the quality of play. o1 mini's poor performance is puzzling, though.
pseudosavant 5 hours ago
LLMs aren't really language models so much as they are token models. That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer. If you can make it a token, the model will try to predict the following ones.
Even though I'm sure chess matches were used in some of the LLM training, I'd bet a model trained just for chess would do far better.
[-]
- viraptor 4 hours ago
  > That is how they can also handle input in audio or visual forms because there is an audio or visual tokenizer.
  This is incorrect. They get translated into the shared latent space, but they're not tokenized in any way resembling the text part.
  [-]
  - pseudosavant 4 hours ago
    They are almost certainly tokenized in most LLM multi-modal models. https://en.wikipedia.org/wiki/Large_language_model#Multimoda...
    [-]
    - viraptor 3 hours ago
      Ah, an overloaded "tokenizer" meaning. "split into tokens" vs "turned into a single embedding matching a token" I've never heard it used that way before, but it makes sense kinda.
jacknews 1 hour ago
Theory #5, gpt-3.5-turbo-instruct is 'looking up' the next moves with a chess engine.
DrNosferatu 5 hours ago
What about contemporary frontier models?
m3kw9 2 hours ago
If it was trained with moves and 100s of thousands of entire games of various level, I do see it generating good moves and beat most players except he high Elo players
permo-w 1 hour ago
if this isn't just a bad result, it's odd to me that the author at no point suggests what sounds to me like the most obvious answer - that OpenAI has deliberately enhanced GPT-3.5-turbo-instruct's chess playing, either with post-processing or literally by training it to be so