AI Meets WinDBG

(svnscha.de)

290 points | by thunderbong 57 days ago

17 comments

emeryberger 57 days ago
You should check out ChatDBG project - which AFAICT goes much further than this work, though in a different direction, and which, among other things, lets the LLM drive the debugging process - has been out since early 2023. We initially did a WinDBG integration but have since focused on lldb/gdb and pdb (the Python debugger), especially for Python notebooks. In particular, for native code, it integrates a language server to let the LLM easily find declarations and references to variables, for example. We spent considerable time developing an API that enabled the LLM to make the best use of the debugger’s capabilities. (It also is not limited to post mortem debugging). ChatDBG’s been out since 2023, though it has of course evolved since that time. Code is here [1] with some videos; it’s been downloaded north of 80K times to date. Our technical paper [2] will be presented at FSE (top software engineering conference) in June. Our evaluation shows that ChatDBG is on its own able to resolve many issues, and that with some slight nudging from humans, it is even more effective.
[1] https://github.com/plasma-umass/ChatDBG (north of 75K downloads to date) [2] https://arxiv.org/abs/2403.16354
[-]
- Everdred2dx 57 days ago
  Is the benefit of using a language server as opposed to just giving access to the codebase simply a reduction in the amount of tokens used? Or are there other benefits?
  [-]
  - nicovank 57 days ago
    Beyond saving tokens, this greatly improved the quality and speed of answers: the language server (most notably used to find the declaration/definition of an identifier) gives the LLM
    1. a shorter path to relevant information by querying for specific variables or functions rather than longer investigation of source code. LLMs are typically trained/instructed to keep their answers within a range of tokens, so keeping shorter conversations when possible extends the search space the LLM will be "willing" to explore before outputting a final answer.
    2. a good starting point in some cases by immediately inspecting suspicious variables or function calls. In my experience this happens a lot in our Python implementation, where the first function calls are typically `info` calls to gather background on the variables and functions in frame.
  - emeryberger 57 days ago
    Yes. It lets the LLM immediately obtain precise information rather than having to reason across the entire source code of the code base (which ChatDBG also enables). For example (from the paper, Section 4.6):
```
  The second command, `definition`, prints the location and source
  code for the definition corresponding to the first occurrence of a symbol on a
  given line of code. For example, `definition polymorph.c:118` target prints the
  location and source for the declaration of target corresponding to its use on
  that line. The definition implementation
  leverages the `clangd` language server, which supports source code queries via
  JSON-RPC and Microsoft’s Language Server Protocol.
```
- layer8 57 days ago
  > We initially did a WinDBG integration but have since focused on lldb/gdb and pdb (the Python debugger), especially for Python notebooks.
  That’s kinda beside the point then, if you want to do Windows debugging. Or am I missing something?
  [-]
  - guipsp 57 days ago
    You can do windows debugging with lldb and gdb.
- 3abiton 55 days ago
  While not explicitly mentioned, why only openai api is supported? Folks with local LLMs feels left out.
  [-]
  - nicovank 55 days ago
    This is in the works :) In the past and still today, OpenAI has much better and easier function call support, something we rely on.
    We currently are running through LiteLLM, so while undocumented in theory other LLMs could work (in my experience they don't). I’m working on updating and fixing this.
lowleveldesign 57 days ago
I do a lot of Windows troubleshooting and still thinking about incorporating AI in my work. The posted project looks interesting and it's impressive how fast it was created. Since it's using MCP it should be possible to bind it with local models. I wonder how performant and effective it would be. When working in the debugger, you should be careful with what you send to the external servers (for example, Copilot). Process memory may contain unencrypted passwords, usernames, domain configuration, IP addresses, etc. Also, I don’t think that vibe-debugging will work without knowing what eax registry is or how to navigate stack/heap. It will solve some obvious problems, such as most exceptions, but for anything more demanding (bugs in application logic, race conditions, etc.), you will still need to get your hands dirty.
I am actually more interested in improving the debugger interface. For example, AI assistant could help me create breakpoint commands that nicely print function parameters when you only partly know the function signature and do not have symbols. I used Claude/Gemini for such tasks and they were pretty good at it.
As a side note, I recall Kevin Gosse also implemented a WinDbg extension [1][2] which used OpenAI API to interpret the debugger command output.
[1] https://x.com/KooKiz/status/1641565024765214720
[2] https://github.com/kevingosse/windbg-extensions
anougaret 57 days ago
this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic or happening because of long chains of events across multiple services/layers of the stack
imo what AI needs to debug is either:
- train with RL to use breakpoints + debugger or to do print debugging, but that'll suck because chains of action are super freaking long and also we know how it goes with AI memory currently, it's not great
- a sort of omniscient debugger always on that can inform the AI of all that the program/services did (sentry-like observability but on steroids). And then the AI would just search within that and find the root cause
none of the two approaches are going to be easy to make happen but imo if we all spend 10+ hours every week debugging that's worth the shot
that's why currently I'm working on approach 2. I made a time travel debugger/observability engine for JS/Python and I'm currently working on plugging it into AI context the most efficiently possible so it debugs even super long sequences of actions in dev & prod hopefully one day
it's super WIP and not self-hostable yet but if you want to check it out: https://ariana.dev/
[-]
- ehnto 57 days ago
  I think you hit the nail on the head, especially for deeply embedded enterprise software. The long action chains/time taken to set up debugging scenarios is what makes debugging time consuming. Solving the inference side of things would be great, but I feel it takes too much knowledge not in the codebase OR the LLM to actually make an LLM useful once you are set up with a debugging state.
  Like you said, running over a stream of events, states and data for that debugging scenario is probably way more helpful. It would also be great to prime the context with business rules and history for the company. Otherwise LLMs will make the same mistake devs make, not knowing the "why" something is and thinking the "what" is most important.
  [-]
  - anougaret 57 days ago
    thanks couldn't agree more :)
- indymike 57 days ago
  > this is pretty cool but ultimately it won't be enough to debug real bugs that are nested deep within business logic
  I'm looking at this as a better way to get the humans pointed in the right direction. Ariana.dev looks interesting!
  [-]
  - Narishma 57 days ago
    It's more likely to waste your time by pointing you in the wrong direction.
    [-]
    - anougaret 57 days ago
      hahaha yeah, even real developers cannot anticipate too well what the direction of a bug is on the first try
  - anougaret 57 days ago
    yes can be a nice lightweight way to debug with a bit of AI other tools in that space will pbly be higher involvement
- rafaelmn 57 days ago
  Frankly this kind of stuff getting upvoted kind of makes HN less and less valuable as a news source - this is yet another "hey I trivially exposed something to the LLM and I got some funny results on a toy example".
  These kind of demos were cool 2 years ago - then we got function calling in the API, it became super easy to build this stuff - and the reality hit that LLMs were kind of shit and unreliable at using even the most basic tools. Like oh woow you can get a toy example working on it and suddenly it's a "natural language interface to WinDBG".
  I am excited about progress in this front in any domain - but FFS show actual progress or something interesting. Show me an article like this [1] where the LLM did anything useful. Or just show what you did that's not "oh I built a wrapper on a CLI" - did you fine tune the model to get better performance ? Did you compare which model performs better by setting up some benchmark and found one to be impressive ?
  I am not shitting on OP here because it's fine to share what you're doing and get excited about it - maybe this is step one, but why the f** is this a front page article ?
  [1]https://cookieplmonster.github.io/2025/04/23/gta-san-andreas...
  [-]
  - anougaret 57 days ago
    yeah it is still truly hard and rewarding to do deep, innovative software but everyone is regressing to the mean, rushing to low hanging fruits, and just plugging old A with new B in the hopes it makes them VC money or something
    real, quality AI breakthrough in software creation & maintenance will require deep rework of many layers in the software stack, low and high level.
- kevingadd 57 days ago
  fwiw, WinDBG actually has support for time-travel debugging. I've used it before quite successfully, it's neat.
  [-]
  - anougaret 57 days ago
    usual limits of debuggers = barely usable to debug real scenarios
    [-]
    - pjmlp 57 days ago
      Since Borland days on MS-DOS they have served me pretty well in many real scenarios.
      Usually what I keep bumping into, are people that never bothered to learn how to use their debuggers beyond the "introduction to debuggers" class, if any.
danielovichdk 57 days ago
Claiming to use WinDBG for debugging a crash dump and the only commands I can find in the MCP code are these ? I am not trying to be a dick here, but how does this really work under the covers ? Is the MCP learning windbg ? Is there a model that knows windbg ? I am asking becuase I have no idea.
```
        results["info"] = session.send_command(".lastevent")
        results["exception"] = session.send_command("!analyze -v")
        results["modules"] = session.send_command("lm")
        results["threads"] = session.send_command("~")
```
You cannot debug a crash dump only with these 4 commands, all the time.
[-]
- psanchez 57 days ago
  It looks like it is using "Microsoft Console Debugger (CDB)" as the interface to windbg.
  Just had a quick look at the code: https://github.com/svnscha/mcp-windbg/blob/main/src/mcp_serv...
  I might be wrong, but at first glance I don't think it is only using those 4 commands. It might be using them internally to get context to pass to the AI agent, but it looks like it exposes:
```
    - open_windbg_dump
    - run_windbg_cmd
    - close_windbg_dump
    - list_windbg_dumps
```
  The most interesting one is "run_windbg_cmd" because it might allow the MCP server to send whatever the AI agent wants. E.g:
```
    elif name == "run_windbg_cmd":
        args = RunWindbgCmdParams(**arguments)
        session = get_or_create_session(
            args.dump_path, cdb_path, symbols_path, timeout, verbose
        )
        output = session.send_command(args.command)
        return [TextContent(
            type="text",
            text=f"Command: {args.command}\n\nOutput:\n```\n" + "\n".join(output) + "\n```"
        )]
```
  (edit: formatting)
  [-]
  - svnscha 57 days ago
    Yes, that's exactly the point. LLMs "know" about WinDBG and its commands. So if you ask to switch the stack frame, inspect structs, memory or heap - it will do so and give contextual answers. Trivial crashes are almost analyzed fully autonomous whereas for challenging ones you can get quite a cool assistant on your side, helping you to analyze data, patterns, structs - you name it.
- gustavoaca1997 57 days ago
  I think the magic happens in the function "run_windbg_cmd". AFAIK, the agent will use that function to pass any WinDBG command that the model thinks will be useful. The implementation basically includes the interface between the model and actually calling CDB through CDBSession.
  [-]
  - eknkc 57 days ago
    Yeah that seems correct. It's like creating an SQLite MCP server with single tool "run_sql". Which is just fine I guess as long as the LLM knows how to write SQL (or WinDBG commands). And they definitely do know that. I'd even say this is better because this shifts the capability to LLM instead of the MCP.
- dark-star 57 days ago
  The magic happens in the "analyze -v" part. This does quite a long analysis of a crashdump (https://learn.microsoft.com/en-us/windows-hardware/drivers/d...)
  After that, all that is required is interpreting the results and connecting it with the source code.
  Still impressive at first glance, but I wonder how well it works with a more complex example (like a crash in the Windows kernel due to a broken driver, for example)
JanneVee 57 days ago
> Crash dump analysis has traditionally been one of the most technically demanding and least enjoyable parts of software development.
I for one enjoy crashdump analysis because it is a technically demanding rare skill. I know I'm an exception but I enjoy actually learning the stuff so I can deterministically produce the desired result! I even apply it to other parts of the job, like learning to currently used programming language and actually reading the documentation libraries/frameworks, instead of copy pasting solutions from the "shortcut du jour" like stack overflow yesterday and LLMs of today!
[-]
- criddell 57 days ago
  Are you using WinDbg? What resources did you use to get really good at it?
  Analyzing crash dumps is a small part of my job. I know enough to examine exception context records and associated stack traces and 80% of the time, that’s enough. Bruce Dawson’s blog has a lot of great stuff but it’s pretty advanced.
  I’m looking for material to help me jump that gap.
  [-]
  - muststopmyths 57 days ago
    so, debuggers are really just tools. To get "good" at analyzing crashdumps, you have to understand the OS and its process/threading model, the ABI of the platform, a little (to a lot) of assembler etc.
    There's no magic to getting good at it. Like anything else, it's mostly about practice.
    People like Bruce and Raymond Chen had a little bit of a leg up over people outside Microsoft in that if you worked in the Windows division, you got to look at more dumps than you'd have wanted to in your life. That plus being immersed in the knowledge pool and having access to Windows source code helps to speed up learning.
    Which is to say, you will eventually "bridge the gap" with them with experience. Just keep plugging at it and eventually you'll understand what to look for and how to find it.
    It helps that in a given application domain the nature of crashes will generally be repeated patterns. So after a while you start saying "oh, I bet this is a version of that other thing I've seen devs stumble over all the time".
    A bit of a rambling comment to say don't worry. you'll "get really good at it" with experience.
  - JanneVee 57 days ago
    I didn't say that I was any good, just that I enjoyed it.
    I have a dog eared copy of Advanced Windows Debugging that I've used, but I've also have books around reverse engineering, disassembly and a little bit of curiosity and practice. I have also the .NET version which I haven't used as much. I also enjoyed the Vostokov books, even though there is a lack of editing in them.
    Edit to add: It is not as much focus on usage of the tool as it is about understanding what is going on in the dump file, you are ahead in knowledge if you can do stack traces and look at exception records.
the_duke 57 days ago
I feel like current top models (Gemini Pro 2.5 etc) would already be good developers if they had the feedback cycle and capabilities that real developers have:
* reading the whole source code
* looking up dependency documentation and code, search related blog posts
* getting compilation/linter warnings ands errors
* Running tests
* Running the application and validating output (eg, for a webserver, start the server, send requests, get the response)
The tooling is slowly catching up, and you can enable a bunch of this already with MCP servers, but we are nowhere near the optimum yet.
Expect significant improvements in the near future, even if the models don't get better.
[-]
- thegeomaster 57 days ago
  This is exactly what frameworks like Claude Code, OpenAI Codex, Cursor agent mode, OpenHands, SWE-Agent, Devin, and others do.
  It definitely does allow models to do more.
  However, the high-level planning, reflection and executive function still aren't there. LLMs can nowadays navigate very complex tasks using "intuition": just ask them to do the task, give them tools, and they do a good job. But if the task is too long or requires too much information, the context length deteriorates the performance significantly, so you have to switch to a multi-step pipeline with multiple levels of execution.
  This is, perhaps unexpectedly, where things start breaking down. Having the LLM write down a plan lossily compresses the "intuition", and LLMs (yes, even Gemini 2.5 Pro) cannot understand what's important to include in such a grand plan, how to predict possible externalities, etc. This is a managerial skill and seems distinct from closed-form coding, which you can always RL towards.
  Errors, omissions, and assumptions baked into the plan get multiplied many times over by the subsequent steps that follow the plan. Sometimes, the plan heavily depends on the outcome of some of the execution steps ("investigate if we can..."). Allowing the "execution" LLM to go back and alter the plan results in total chaos, but following the plan rigidly leads to unexpectedly stupid issues, where the execution LLM is trying to follow flawed steps, sometimes even recognizing that they are flawed and trying to self-correct inappropriately.
  In short, we're still waiting for an LLM which can keep track of high-level task context and effectively steer and schedule lower-level agents to complete a general task on a larger time horizon.
  For a more intuitive example, see how current agentic browser use tools break down when they need to complete a complex, multi-step task. Or just ask Claude Code to do a feature in your existing codebase (that is not simple CRUD) the way you'd tell a junior dev.
  [-]
  - pjmlp 57 days ago
    I expect that if I use the way I would tell an offshoring junior dev, to the way that I actually get a swing instead of a tire, then it will get quite close to the desired outcome.
    However, this usually takes much more effort than just doing the damm thing myself.
- demarq 57 days ago
  It’s now a matter of when, and I’m working on that problem.
JanSchu 57 days ago
This is one of the most exciting and practical applications of AI tooling I've seen in a long time. Crash dump analysis has always felt like the kind of task that time forgot—vital, intricate, and utterly user-hostile. Your approach bridges a massive usability gap with the exact right philosophy: augment, don't replace.
A few things that stand out:
The use of MCP to connect CDB with Copilot is genius. Too often, AI tooling is skin-deep—just a chat overlay that guesses at output. You've gone much deeper by wiring actual tool invocations to AI cognition. This feels like the future of all expert tooling.
You nailed the problem framing. It’s not about eliminating expertise—it’s about letting the expert focus on analysis instead of syntax and byte-counting. Having AI interpret crash dumps is like going from raw SQL to a BI dashboard—with the option to drop down if needed.
Releasing it open-source is a huge move. You just laid the groundwork for a whole new ecosystem. I wouldn’t be surprised if this becomes a standard debug layer for large codebases, much like Sentry or Crashlytics became for telemetry.
If Microsoft is smart, they should be building this into VS proper—or at least hiring you to do it.
Curious: have you thought about extending this beyond crash dumps? I could imagine similar integrations for static analysis, exploit triage, or even live kernel debugging with conversational AI support.
Amazing work. Bookmarked, starred, and vibed.
[-]
- svnscha 57 days ago
  Yes, I've thought about this already! Right now I'm exploring crash dump analysis, but static analysis and reverse engineering are definitely areas where such assistants can help. LLMs are surprisingly good at understanding disassembly, which makes this really exciting, beyond crash dump analysis. Besides that, I think, assisted perf trace analysis may be another cool area to explore.
  Domain expertise remains crucial though. As complexity increases, you need to provide guidance to the LLM. However, when the model understands specialized tools well - like WinDBG in my experience - it can propose valuable next steps. Even when it slightly misses the mark, course correction is quick.
  I've invested quite some time using WinDBG alongside Copilot (specifically Claude in my configuration), analyzing memory dumps, stack frames, variables, and inspect third-party structures in memory. While not solving everything automatically, it substantially enhances productivity.
  Consider this as another valuable instrument in your toolkit. I hope tool vendors like Microsoft continue integrating these capabilities directly into IDEs rather than requiring external solutions. This approach to debugging and analysis tools is highly effective, and many already incorporate AI capabilities.
  What Copilot currently lacks is the ability to configure custom Agents with specific System Prompts. This would advance these capabilities significantly - though .github/copilot-instructions.md does help somewhat, it's not equivalent to defining custom system prompts or creating a chart participant enabling Agent mode. This functionality will likely arrive eventually.
  Other tools already allowing system prompt customization might yield even more interesting results. Reducing how often I need to redirect the LLM could further enhance productivity in this area.
  The whole point of this was me chatting with Copilot about a crash dump and I asked him about what the command for some specific task is, because I didn't remember and it suggested me which commands I could further try to investigate something and I was like - wait, what if I let him do this automatically?
  That's basically the whole idea behind. Me being too lazy to copy-paste Copilot's suggestions to my WinDBG and while this was just a test at first, becoming a proof of concept and now, almost overnight got quite a lot of attention. I am probably excited the same way as you are.
- Helmut10001 57 days ago
  I have noticed a lot of improvements in this area too. I recently had a problem with my site-to-site IPSEC connection. I had an LLM explain the logs from both sides and together we came to a conclusion. Distilling the problematic part from the huge logs was a significant effort and time saver.
- tyg13 57 days ago
  Why does this read like an LLM produced it?
lgiordano_notte 57 days ago
Curious how you're handling multi-step flows or follow-ups, seems like thats where MCP could really shine especially compared to brittle CLI scripts. We've seen similar wins with browser agents once structured actions and context are in place.
cadamsdotcom 57 days ago
Author built an MCP server for windbg: https://github.com/svnscha/mcp-windbg
Knows plenty of arcane commands in addition to the common ones, which is really cool & lets it do amazing things for you, the user.
To the author: most of your audience knows what MCP is, may I suggest adding a tl;dr to help people quickly understand what you've done?
codepathfinder 57 days ago
Built this around 2023 mid and found interesting results!
Tepix 57 days ago
Sounds really neat!
How does it compare to using the Ghidra MCP server?
[-]
- mahmoudimus 57 days ago
  Ghidra is actually a suite of reverse engineering toolkits, including, but not limited to a disassembler, a decompiler and a debugger interface that interfaces with many debuggers, among other neat things.
  A disassembler takes compiled binaries and displays the assembly code the machine executes.
  A decompiler translates the disassembled code back to pseudocode (e.g. disassembly -> C).
  A debugger lets you step through the disassembly. Windbg is a debugger which is pretty powerful, but has the downside of a pretty unintuitive syntax (but I'm biased coming from gdb/llvm debuggers).
  Both the MCP servers can probably be used together, but they both do different things. A neat experiment would be to see if they're aware of each other and can utilize each other to "vibe reverse"
- cjbprime 57 days ago
  Ghidra's a decompiler and WinDBG is a debugger, so they'd be complementary.
  [-]
  - Tepix 53 days ago
    Ghidra also has a debugger, see https://github.com/NationalSecurityAgency/ghidra/blob/master...
- trealira 57 days ago
  This isn't a decompiler, but there are LLM tools for decompilation, like LLM4Decompile.
alexvitkov 57 days ago
Watching a guy type at 30 WPM in a chatbox reminds me of those old YouTube tutorials where some dude is typing into into a notepad window, and showing you how to make a shortcut to "shutdown -s -t 0" on your school computer and give it the Internet Explorer icon. It's only missing Linkin Park blasting in the background.
If you're debugging from a crash dump you probably have a large, real world program, that actual people have reviewed, deemed correct and released in the wild.
Current LLMs can't produce a sane program over 500 lines, the idea that they can understand a correct looking program several orders of magnitude larger, well enough to diagnose and fix a subtle issue that the people who wrote the it missed, is absurd.
[-]
indigodaddy 57 days ago
My word, that's one of the most beautiful sites I've ever encountered on mobile.
[-]
- svnscha 57 days ago
  Thank you - using Zola and Apollo Theme (slightly modified) - https://www.getzola.org/themes/apollo/
Zebfross 57 days ago
Considering AI is trained on the average human experience, I have a hard time believing it would be able to make any significant difference in this area. The best experience I’ve had debugging at this level was using Microsoft’s time travel debugger which allows stepping forward and back.
[-]
- cjbprime 57 days ago
  You should try AI sometime. It's quite good, and can do things (like "analyze these 10000 functions and summarize what you found out about how this binary works, including adding comments everywhere) that individual humans do not scale to.
- s3cfast 57 days ago
  George Hotz's Qira is a timeless debugger; worth checking out too!
- voidspark 57 days ago
  It can analyze a crash dump in 2 seconds, that could take hours for an experienced developer, or impossible for the "average human".
- userbinator 57 days ago
  [flagged]
  [-]
  - danb1974 57 days ago
    That's not how averages work :)
    [-]
    - FirmwareBurner 57 days ago
      Yes, but we all understood the essence of what he meant and he's right. Why be a stickler about it.
      [-]
      - userbinator 57 days ago
        [flagged]
        [-]
        hhh 57 days ago
        Is it brigading if people just have the opinion that AI can do well at software?
        [-]
        userbinator 57 days ago
        Only half of them.
  - posnet 57 days ago
    Median
    [-]
    - spacechild1 57 days ago
      Human intelligence roughly follows a normal distribution where the median is the same as the mean. In that sense OP was correct that half of the population are below average.
dogleash 57 days ago
[flagged]
revskill 57 days ago
[flagged]
[-]