I'd like to highlight a different part of the article:
> In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at me like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:
> Hired dedicated QA / test engineers, with testing being a first-class career path on par with being a developer - No code review by default - Virtually no hand-written tests - Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests"). - Large regeression test suite (3 months wall clock to execute on compute farm) - No unit tests
Anybody here tried that (or a similar) approach? Especially going all-in on property based testing and fuzzing with no unit tests.
I tried that approach somewhere before and the initial results were promising, but ran into political issues so the idea was canned.
OP's alt text makes it clear that by "Galapagos Island" they mean Vancouver. I assumed that this was some sort of local nickname, but all of the references to "Galápagos of Canada" I could find are talking about Haida Gwaii instead.
Realizing he's just using it to mean remote place in terms of AI bubble (Vancouver! What does that make all the other places that are not major tech hubs?) was a bummer.
Who cares about AI, I wanted to read about living in Galapagos
A lot of the crazy ideas seem to have melted away in the face of massive context sizes. Today, I can put roughly a megabyte of utf8 text into my system prompt before things start to get weird.
That is a massive amount of information even if we are being sloppy with it. You can read The Hobbit and the first Harry Potter book cover-to-cover and still have room to spare. I would deeply struggle to develop a world model this detailed for any business. Anything that needs to get more specific than these narratives can be a SQL query tool into the data warehouse, grep over the codebase, MS graph API lookup, etc.
Giving the business a balanced way to collaborate over this one shared model of the world is a new challenge I am beginning to engage with. I've also noticed that the world model will compound on itself in terms of self-detection of update opportunities. The more constraints there are, the more likely we appear to violate one.
> melted away in the face of massive context sizes
If only. There is a huge difference between "Gives good responses/can easily spot things within N context size" and "Technically works but sucks within N context size", almost all models basically become cave-people once you go beyond 50% of the "supported" context size, meaning while they may technically work with 1 million output tokens, those last 500K tokens are gonna be massively "dumber" than the first 500k tokens.
Fable changes the game yet again, because it's API-only.
You're not likely to want to run Fable in a loop any more than you want to take a bunch of dollar bills and light them on fire. Every invocation of Fable has to be intentional, its context carefully managed. I feel like a babysitter.
I make Claude, Codex and Gemini review each other's design plan and implementation. Each always found a lot of things the others missed...until Fable 5 came out. Whatever plan or code Fable 5 comes up with, now it's very hard for Codex and Gemini to find any serious hole in it.
Anthropic says the change is about capacity and is temporary. In its launch announcement on June 9, 2026, it says:
"After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can."
They can’t keep their current models working on subscriptions[1], so we’ll see if this is marketing or not in the future.
It’s smart to tease it no matter what, ”insert classic first hit is free drug reference”.
I agree with you that you don't need fable for everything, and you have to be careful on what you run it on. CRUD stuff, sure even the small models can do it. But there certainly are tasks that are very much suited for the absolute SotA and you'd leave money on the table by not using it. And how much a task is worth is dependant on how much it improves your bottom line. So the cost/token becomes largely irrelevant.
Let's take this [1] benchmark. A bit more context here [2].
Here models are asked to create kernels for running inference on models. This is a benchmark perfectly suited and highly relevant right now. It's easily verifiable, an active are of research, and the results are immediately useful.
Say you have 1 unit of compute, it costs 300k $ and serves 1x users. In comes Fable and after one session it gives you 30% speed-up on your 1 unit of compute. It can now serve 1.3x users. How much is that one session worth for you? How much is it worth for a company using 10 units? 100 units? How much is it worth for a hyper-scaler running 10.000 units? How much is it worth for a lab that trains the next frontier model and then serves it from 100.000 units? 30% is relative. And the cost for one session is really meaningless. It can cost 1m$ / session and it would still be worth it for someone.
> You're not likely to want to run Fable in a loop any more than you want to take a bunch of dollar bills and light them on fire. Every invocation of Fable has to be intentional, its context carefully managed.
Eh, that's just because it's the current frontier model. Give it a few weeks, and prices will drop.
API prices are the new normal. I doubt that prices will drop to the level of the subsidized subscriptions any time soon. Usage is growing exponentially but capacity cannot. There is no reason for them to waste their capacity on subscription users if they can sell that same capacity to API users.
Like with Uber and Lyft, the low prices were a fight for market share, but now they have successfully captured that market share the focus changes to balancing their books.
> In general, when I talk to software folks about testing, I'm coming from such a different place that they immediately look at me like I'm an alien, so let's talk about how we tested at this hardware company I worked for, Centaur, which informs my biases about how I like to work. Some of the things that we did that were or are unorthodox in the software world are:
> Hired dedicated QA / test engineers, with testing being a first-class career path on par with being a developer - No code review by default - Virtually no hand-written tests - Constant testing via what programmers sometimes called property based testing, randomized testing, fuzzing, etc., although we just called those tests (hand-written tests were called "hand tests"). - Large regeression test suite (3 months wall clock to execute on compute farm) - No unit tests
Anybody here tried that (or a similar) approach? Especially going all-in on property based testing and fuzzing with no unit tests.
I tried that approach somewhere before and the initial results were promising, but ran into political issues so the idea was canned.
Who cares about AI, I wanted to read about living in Galapagos
That is a massive amount of information even if we are being sloppy with it. You can read The Hobbit and the first Harry Potter book cover-to-cover and still have room to spare. I would deeply struggle to develop a world model this detailed for any business. Anything that needs to get more specific than these narratives can be a SQL query tool into the data warehouse, grep over the codebase, MS graph API lookup, etc.
Giving the business a balanced way to collaborate over this one shared model of the world is a new challenge I am beginning to engage with. I've also noticed that the world model will compound on itself in terms of self-detection of update opportunities. The more constraints there are, the more likely we appear to violate one.
If only. There is a huge difference between "Gives good responses/can easily spot things within N context size" and "Technically works but sucks within N context size", almost all models basically become cave-people once you go beyond 50% of the "supported" context size, meaning while they may technically work with 1 million output tokens, those last 500K tokens are gonna be massively "dumber" than the first 500k tokens.
You should talk to https://www.mechanize.work/ for sponsorship/credits and about environments.
You're not likely to want to run Fable in a loop any more than you want to take a bunch of dollar bills and light them on fire. Every invocation of Fable has to be intentional, its context carefully managed. I feel like a babysitter.
[1] https://status.claude.com
Let's take this [1] benchmark. A bit more context here [2].
Here models are asked to create kernels for running inference on models. This is a benchmark perfectly suited and highly relevant right now. It's easily verifiable, an active are of research, and the results are immediately useful.
Say you have 1 unit of compute, it costs 300k $ and serves 1x users. In comes Fable and after one session it gives you 30% speed-up on your 1 unit of compute. It can now serve 1.3x users. How much is that one session worth for you? How much is it worth for a company using 10 units? 100 units? How much is it worth for a hyper-scaler running 10.000 units? How much is it worth for a lab that trains the next frontier model and then serves it from 100.000 units? 30% is relative. And the cost for one session is really meaningless. It can cost 1m$ / session and it would still be worth it for someone.
[1] - https://kernelbench.com/mega
[2] - https://x.com/elliotarledge/status/2072814573753975266
> You're not likely to want to run Fable in a loop any more than you want to take a bunch of dollar bills and light them on fire. Every invocation of Fable has to be intentional, its context carefully managed.
Eh, that's just because it's the current frontier model. Give it a few weeks, and prices will drop.
Like with Uber and Lyft, the low prices were a fight for market share, but now they have successfully captured that market share the focus changes to balancing their books.
This blog is quite unreadable for 27/32" monitors.