Scaling long-running autonomous coding

(cursor.com)

61 points | by samwillis 1 hour ago

13 comments

  • micimize 22 minutes ago
    > While it might seem like a simple screenshot, building a browser from scratch is extremely difficult.

    > Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.

    In my view, this post does not go into sufficient detail or nuance to warrant any serious discussion, and the sparseness of info mostly implies failure, especially in the browser case.

    It _is_ impressive that the browser repo can do _anything at all_, but if there was anything more noteworthy than that, I feel they'd go into more detail than volume metrics like 30K commits, 1M LoC. For instance, the entire capability on display could be constrained to a handful of lines that delegate to other libs.

    And, it "is possible" to merge any change that avoids regressions, but the majority of our craft asks the question "Is it possible to merge _the next_ change? And the next, and the 100th?"

    If they merge the MR they're walking the walk.

    If they present more analysis of the browser it's worth the talk (not that useful a test if they didn't scrutinize it beyond "it renders")

    Until then, it's a mountain of inscrutable agent output that manages to compile, and that contains an execution pathway which can screenshot apple.com by some undiscovered mechanism.

    • embedding-shape 21 minutes ago
      > it's a mountain of inscrutable agent output that manages to compile

      But is this actually true? They don't say that as far as I can tell, and it also doesn't compile for me nor their own CI it seems.

  • simonw 1 hour ago
    "To test this system, we pointed it at an ambitious goal: building a web browser from scratch."

    I shared my LLM predictions last week, and one of them was that by 2029 "Someone will build a new browser using mainly AI-assisted coding and it won’t even be a surprise" https://simonwillison.net/2026/Jan/8/llm-predictions-for-202... and https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3913s

    This project from Cursor is the second attempt I've seen at this now! The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...

    • mrefish 40 minutes ago
      Time to raise the bar. By 2029 someone will build a new browser using mainly AI-assisted coding and the surprise is that it was designed to be used by pelicans.
    • bob1029 26 minutes ago
      The goal I am currently using for long horizon coding experiments is implementation of a PDF rasterizer given an ISO32000 specification document.
    • cheevly 1 hour ago
      2029? I have no idea why you would think this is so far off. More like Q2 2026.
      • xmprt 51 minutes ago
        You're either overestimating the capabilities of current AI models or underestimating the complexity of building a web browser. There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.
      • gordonhart 22 minutes ago
        Web browsers are insanely hard to get right, that’s why there are only ~3 decent implementations out there currently.
      • geeunits 43 minutes ago
        because it makes him look smart when inevitably he's 'right'
  • jphelan 47 minutes ago
    This looks like extremely brittle code to my eyes. Look at https://github.com/wilsonzlin/fastrender/blob/main/crates/fa...

    What is `FrameState::render_placeholder`?

    ``` pub fn render_placeholder(&self, frame_id: FrameId) -> Result<FrameBuffer, String> { let (width, height) = self.viewport_css; let len = (width as usize) .checked_mul(height as usize) .and_then(|px| px.checked_mul(4)) .ok_or_else(|| "viewport size overflow".to_string())?;

        if len > MAX_FRAME_BYTES {
          return Err(format!(
            "requested frame buffer too large: {width}x{height} => {len} bytes"
          ));
        }
    
        // Deterministic per-frame fill color to help catch cross-talk in tests/debugging.
        let id = frame_id.0;
        let url_hash = match self.navigation.as_ref() {
          Some(IframeNavigation::Url(url)) => Self::url_hash(url),
          Some(IframeNavigation::AboutBlank) => Self::url_hash("about:blank"),
          Some(IframeNavigation::Srcdoc { content_hash }) => {
            let folded = (*content_hash as u32) ^ ((*content_hash >> 32) as u32);
            Self::url_hash("about:srcdoc") ^ folded
          }
          None => 0,
        };
        let r = (id as u8) ^ (url_hash as u8);
        let g = ((id >> 8) as u8) ^ ((url_hash >> 8) as u8);
        let b = ((id >> 16) as u8) ^ ((url_hash >> 16) as u8);
        let a = 0xFF;
    
        let mut rgba8 = vec![0u8; len];
        for px in rgba8.chunks_exact_mut(4) {
          px[0] = r;
          px[1] = g;
          px[2] = b;
          px[3] = a;
        }
    
        Ok(FrameBuffer {
          width,
          height,
          rgba8,
        })
      }
    } ```

    What is it doing in these diffs?

    https://github.com/wilsonzlin/fastrender/commit/f4a0974594e3...

    I'd be really curious to see the amount of work/rework over time, and the token/time cost for each additional actual completed test case.

  • ZitchDog 1 hour ago
    I used similar techniques to build tjs [1] - the worlds fastest and most accurate json schema validator, with magical TypeScript types. I learned a lot about autonomous programming. I found a similar "planner/delegate" pattern to work really well, with the use of git subtrees to fan out work [2].

    I think any large piece of software with well established standards and test suites will be able to be quickly rewritten and optimized by coding agents.

    [1] https://github.com/sberan/tjs

    [2] /spawn-perf-agents claude command: https://github.com/sberan/tjs/blob/main/.claude/commands/spa...

  • tired_and_awake 15 minutes ago
    The moment all code is interacted with through agents I cease to care about code quality. The only thing that matters is the quality of the product, cost of maintenance etc. exactly the thing we measure software development orgs against. It could be handy to have these projects deployed to demonstrate their utility and efficacy? Looking at PRs of agents feels a wrong headed, like who cares if agents code is hard to read if agents are managing the code base?
  • trjordan 1 hour ago
    This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.

    The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.

    But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.

    I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.

    • risyachka 4 minutes ago
      >> why haven't they merged that PR.

      because it is absolutely impossible to review that code and there is gazillion issues there.

      The only way it can get merged is YOLO and then fix issues for months in prod which kinda defeats the purpose and brings gains close to zero.

    • dist-epoch 54 minutes ago
      Pretty much everything exists in the training sets. All non-research software is just a mishmash of various standard modules and algorithms.
      • galaxyLogic 23 minutes ago
        Not everything, only code-bases of existing (open-source?) applications.

        But what would be the point of re-creating existing applications? It would be useful if you can produce a better version of those applications. But the point in this experiment was to produce something "from scratch" I think. Impressive yes, but is it useful?

        A more practically useful task would be for Mozilla Foundation and others to ask AI to fix all bugs in their application(s). And perhaps they are trying to do that, let's wait and see.

  • embedding-shape 45 minutes ago
    Did anyone manage to run the tests from the repository itself? The code seems filled with errors and warnings, as far as I can tell none of them because of the platform I'm on (Linux). I went and looked at the Action workflow history for some pages, and seems CI been failing for a while, PRs also all been failing CI but merged. How exactly was this verified to be something to be used as an successful example, or am I misunderstanding what point they are trying to make? They mention a screenshot, but they never actually mention if their goal was successfully met, do they?

    I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.

  • jphoward 1 hour ago
    The browser it built, obviously the context window of the entire project is huge. They mention loads of parallel agents in the blog post, so I guess each agent is given a module to work on, and some tests? And then a 'manager' agent plugs this in without reading the code? Otherwise I can't see how, even with ChatGPT 5.2/Gemini 3, you could do this otherwise? In retrospect it seems an obvious approach and akin to how humans work in teams, but it's still interesting.
    • simonw 1 hour ago
      GPT-5.2-Codex has a 400,000 token window. Claude 4.5 Opus is half of that, 200,000 tokens.

      It turns out to matter a whole lot less than you would expect. Coding Agents are really good at using grep and writing out plans to files, which means they can operate successfully against way more code than fits in their context at a single time.

    • galaxyLogic 19 minutes ago
      > so I guess each agent is given a module to work on, and some tests?

      Who created those agents and gives them the tasks to work on. Who created the tests? AI, or the humans?

    • observationist 1 hour ago
      Get a good "project manager" agents.md and it changes the whole approach of vibe coding. For a professional environment, with each person given a little domain, arranged in the usual hierarchy of your coding team, truly amazing things can get done.

      Presumably the security and validation of code still needs work, I haven't read anything that indicates those are solved yet, so people still need to read and understand the code, but we're at the "can do massive projects that work" stage.

      Division of labor and planning and hierarchy are all rapidly advancing, the orchestration and coordination capabilities are going to explode in '26.

  • mccoyb 50 minutes ago
    Supposing agents and their organization improve, it seems like we’re approaching a point where the cost of a piece of software will be driven down to the cost of running the hardware, and the cost of the tokens required to replicate it.

    The tokens were “expensive” from the minds of humans …

    • Daishiman 32 minutes ago
      It will be driven down to the cost of having a good project and product manager effectively understanding what the customer wants, which has been the main barrier to excellent software for a good long time.
      • galaxyLogic 12 minutes ago
        And not only understanding what the customer wants, but communicating that unambiguously to the AI. And note who is the "customer" here? Is it the end-users, or is it a client-company which contracts the project-manager for this task? But then the issue is still there, who in the client-company decides exactly what is needed and what the (potential) users want?

        I think this situation emphasizes the importance of (something like) Agile. To produce something useful can only happen via experimentation and getting feedback from actual users, and re-iterating relentlessly.

  • sashank_1509 1 hour ago
    Can a browser expert please go through the code the agent wrote (skim it), and let us know how it is. Is it comparable to ladybird, or Servo, can it ever reach that capability soon?
  • mk599 47 minutes ago
    Define "from scratch" in "building a web browser from scratch". This thing has over 100 crates as dependencies... To implement css layouting, it uses Taffy, a crate used by existing browser implementations...
  • dist-epoch 56 minutes ago
    So, who is going to compile the browser and post the binaries so we can check it out? (in a sandbox/VM obviously)
  • ora-600 50 minutes ago
    [dead]