I don't think it's particularly effective to create a new coding agent when there's existing open-source agents (especially extremely extensible ones like Pi) that already optimize for cache hits, have far larger communities, and work for providers other than Deepseek.
I specifically use multiple different models and providers, so this wouldn't be useful for me.
And it contributes to the problem of each person vibe-coding their own, incompatible, half-baked tool in a space, instead of contributing to a small set of tools and expanding them.
I'm not sure you need a "DeepSeek native coding agent" to take advantage of DeepSeeks cache, yesterday as the Codex quota usage issue still wasn't solved for me, I wrote a tiny little bridge so I could use DeepSeek V4 Pro via Codex, and seems most of everything I did was basically cached as far as I can tell: https://i.imgur.com/7eKn6wN.png (2026-05-23 Input (Cache hit): 39,123,200 tokens, Input (Cache miss) 1,692,286), and the bridge is doing not special, just massage the DeepSeek API shape into what Codex expects, nothing particular about caching at all.
Besides being even better at the caching, I'm not sure what benefits you'd get compared to just firing up OpenCode with the DeepSeek API yourself, it'll similarly do caching for sure and also "talks directly to api.deepseek.com" if that matters, and you'll get a much more mature harness.
I can't confirm this. Having utilized Opencode for a large project over the past 10 months, with multiple models and agents, we've never run into such 'cache stability issues'."
> Ah, reminds me of good old "There are only 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
You quip, but LLM KV caching (from the harness side) is quite easy: You get a cache hit on stable prompt prefixes, period. That means you want to keep the prefix stable, and only append at the end of the conversation.
Made up example: Don't put the git branch name into the system prompt part (that comes first), as whenever the branch name changes, that'd trigger a cache invalidation of the entire prompt.
Getting this right requires some care to not by accident modify the prefix, basically, and some design on communicating the things that can change (user configuration, working dir, git information, ...).
> I wrote a tiny little bridge so I could use DeepSeek V4 Pro via Codex
Can you share the bridge. DeepSeek v4 is awesome paired with claude-code or opencode. I found that claude code costs me less than opencode and I am presuming this is due to a better engineered harness.
Sure, keep in mind it's a steaming pile of hacked together hacks, probably won't work in every case, doesn't support every feature that should be supported (like parallel tool calling, both Codex + DeepSeek API support it), and it might make your computer catch on fire: https://gist.github.com/embedding-shapes/eab3e63e5a95d3d78a2...
I only used it for a few hours to play around with stuff before the quota issue was fixed and I could resume using GPT models, and the bridge was coded by DeepSeek-V4-Flash-IQ2XXS + DwarfStar4 locally, I take no responsibility for what might happen with your computer or you, during usage or just reading the code.
Edit: heh, like don't look at line 117 for example where seemingly it likes to handle misspellings in the .env file which totally wasn't my fault for typo'ing the API key in that file... I'm sure there are tons of sharp edges and dumb stuff in there.
Not everyone is working with state secrets or user personal data (or even more closely guarded, company secrets) on a daily basis, most of what I hack on is either FOSS already, or will be, not much to keep secret here.
Obviously, if you do deal with any sort of secrets, then using local LLMs over OpenAI, Anthropic, DeepSeek or whoever is obviously preferred, and in the case of personal data of users, probably a requirement.
You’re not a novice, there are a lot of us who know exactly what we are doing and see this as a huge downside. We are just being told to go faster, faster, faster lest we miss out on… something?
there's laws on the books in China that says that every company operating in China must aid and abet the Chinese government in espionage against the rest of the world. given those facts, I find it deeply troubling to be using anything coming out of China, especially a program that runs in the context of a Linux terminal on a machine that might have something important on it. I'd argue it's a back door waiting to happen, if not sooner than obviously later.
this appears to be native to the terminal, as in, there's no special application that runs or wraps an agent inside a tui. So basically instead of commands you type plain english?
> this appears to be native to the terminal, as in, there's no special application that runs or wraps an agent inside a tui
Same with codex? codex-rs at least, is a TUI as well, it does run a "app-server" in the background, that the TUI actually interacts with, but that's just an implementation detail. Also makes it easy to hook in your own programs to fire of codex "headless" sessions even without the TUI.
Not a fan of that page. The animated typing and resulting continuous resize of the example keeps moving the content beneath it down and up. Such bad UX.
Agents or no agents, people still need to test their websites on different resolutions or at least window width, but seems this is becoming a lost art.
> The loop is append-only, engineered around DeepSeek's byte-stable prefix cache — long sessions hold 90%+ cache hit and input-token cost collapses to ~1/5. Terminal-first, leave it running.
This is how all AI coding harnesses work, isn't it?
The author claims (in another AI-written post):
> LangChain — along with every generic agent framework I checked — rebuilds the prompt every turn. Timestamps get injected. History gets reordered. Tool schemas re-serialize with different whitespace.
I haven't touched LangChain in a long, long time, but don't think any of the current harnesses, Claude Code, Pi, Crush, OpenCode etc do any of that. Keeping the context stable for caching is a very basic principle and not a wild innovation. Also curious why DeepSeek would be different? All providers behave like this.
It's pretty funny, i'm a $200/m Claude subscriber and i've had little need to use anything else. However the more Claude has been restricting my workflow (notably around the recent IDE/-p usage change) the more i've been wanting to go elsehwere.
I'm concerned since i really want SOTA reasoning, but DeepSeek still has me interested.
>Can I point it at a self-hosted / private DeepSeek endpoint?
>Yes. Since 0.30 we accept non-standard key prefixes for self-hosted DeepSeek endpoints. Just point `baseUrl` at your internal address — the loop, cache strategy, and tool protocol are unchanged.
But my question is:
If I use Reasonix to talk to a deepseek endpoint through openrouter, am I still getting the cache-hit benifits of this agent harness?
Yes*. At least from my limited usage of deepseek-flash for a few billion tokens on openrouter, the cache-hit rate is >95%. And I simply used the claude code harness pointed at the openrouter anthropic compatible endpoint with no fluff.
> Hats off to the deekseek team for creating a great product
I have been using it for a while, and I wholeheartedly agree. imo, it is as good as codex or claude which I also use. It is a winner in the cost-sensitive tier, and if some startup could put it together with data-retention in mind, it could be a great product sold to the enterprise, as data-retention and privacy are the main issues for the coding-assistant usecase.
Deepseek v4 pro is definitely my preferred cheap model, it's very good, and I use it all the time for my personal projects (opencode go plan), but I also use Claude Opus all the time at work and Deepseek is not as good as that, but it does compete with Sonnet for capability, and beats it on price.
How can you have cache hit efficiency? Isn't it just a matter of not changing the previous context? I don't understand what knobs there are to tweak on this.
> Isn't it just a matter of not changing the previous context?
Yes, but a lot of harnesses change previous context. E.g. the system prompt injects the current time/date, working directory, files in the working directory, etc. Compaction also changes the whole previous context. I _think_ changing the list of tools also invalidates cache, so invoking a subagent with different tools would invalidate the cache.
My vague impression is that it's in a similar vein to functional programming languages. It generally disallows doing things that lead to bugs (cache misses in this case), and presumably allows you to do those things in a way that makes it much clearer that this is likely to cause cache misses. I would guess that in this paradigm, you don't mutate your existing session, you derive a new session by mutating the prior context into a new context.
Cache is always there, it’s just that it only caches up to the point where an input token changes. So if the tools list is early in the prompt, changing it would limit cache for most of the prompt. If the tools list is the last thing, you could still get 99% cache hits even if it changes every turn.
Click on the download page, it's hilarious. It has a lot of information about the "smart probe" on the download and it's a realtime probe you can rerun.
That's the pinnacle of AI slop over engineered garbage in my opinion. All of that information is noise.
Say you put the current time down to the second in the system prompt, which is the message that goes in front of the entire conversation, then basically nothing will be cached, every agent turn needs to ingest the entire session over and over. Contrast to not doing that, and the backend can leverage caching all the way up to the latest message, as nothing until then changed.
Probably not that exactly, but there is a tradeoff between effectiveness of the prompt and cache hit rate. If putting the user’s datetime in the middle of the prompt scores higher on evals but worsens cache hits, versus at the end of the prompt where it’s cache friendly but may not be as effective, what do you do?
This is still art as much as science and the different harnesses take different approaches.
Obviously not, most agents properly keep previous messages unchanged, at least the major ones I've been digging into the source off. Also, everything would get so much slower, that even developers creating their own agents would notice quickly how much slower theirs is, if they fuck this up.
I don't think any the agents breaks caching on every turn, but they might do things like current list of files, or available tools depending upon plan/build mode... or lots of other things that breaks caching multiple times during a session.
Is this really the behavior you want? Yes, doing tool-result clearing and such will blow your cache, but if you do it only occasionally, it's still likely a win. Yes, cache hits are good, but not so good that it's okay to be profligate with context to preserve those precious, precious KVs.
What AI model did you use for the website design? This is the second one I see with the exact same font and color scheme. Just curious because Claude models lean towards purples for example. Thank you!
This design still screams Claude to me, but a newer version than what you're thinking of. At some point they added a markdown file that tells it to use obviously AI designs like lots of blue/purple and gradients. Since then, this is its new style.
In my experience, it is claude-code paired with deepseek-v4. For penny-pinchers like me, I can have long coding sessions with it with no anxiety about the cost. Also, prompting it to what you want and verifying the outputs is more important than the quality of the model. So, I am better off with a cheaper model and taking the responsibility for prompting it and verifying the results.
Although I have little interest in agentic coding, when I do use it, I have found Kimi K2.6 to give Opus-quality output, and have switched entirely to it for pretty much everything.
I've used Opus extensively and tried K2.6 on a few projects, and the gap is huge. K2.6 is nowhere near the performance of Opus. That's fine because it's also far cheaper, but public benchmarks line up with my own personal experience that they aren't comparable in terms of intelligence.
(that is, different places on the Pareto efficiency graph)
I've gone through ~600m tokens in Xiaomi Mimo though Claude, and it's been the most effective use of an agent I've had yet. It's very capable, but generally not ambitious, picking simple but effective solutions to most problems I give it.
Going to write something longer about the experience when I get to a billion tokens.
I'd generally agree about Deepseek being as good as Sonnet - but I have extreme trouble with prompt compliance with V4 Pro in a way that I've never had with Sonnet. I'll tell it "find the bug, but don't fix it" or "please use this tool I just developed" and it'll ignore me a high fraction of the time.
It's bad enough that I'm working on guardrails at the harness level because prompting appears to be useless.
I have Opus make a fairly detailed plan, then Deepseek implements, and GPT reviews. With that setup, I have zero issues, probably because what you mention is handled (the plan keeps it on track and the reviewer catches any issues).
Now that you mention it, though, I have seen it do a few things that weren't in the plan. The reviewer caught them, though, so they didn't cause a problem, and it's so cheap that overall it's a massive improvement.
I specifically use multiple different models and providers, so this wouldn't be useful for me.
And it contributes to the problem of each person vibe-coding their own, incompatible, half-baked tool in a space, instead of contributing to a small set of tools and expanding them.
It'd be better to just extend an existing tool.
Besides being even better at the caching, I'm not sure what benefits you'd get compared to just firing up OpenCode with the DeepSeek API yourself, it'll similarly do caching for sure and also "talks directly to api.deepseek.com" if that matters, and you'll get a much more mature harness.
Ah, reminds me of good old "There are only 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors."
You quip, but LLM KV caching (from the harness side) is quite easy: You get a cache hit on stable prompt prefixes, period. That means you want to keep the prefix stable, and only append at the end of the conversation. Made up example: Don't put the git branch name into the system prompt part (that comes first), as whenever the branch name changes, that'd trigger a cache invalidation of the entire prompt.
Getting this right requires some care to not by accident modify the prefix, basically, and some design on communicating the things that can change (user configuration, working dir, git information, ...).
Can you share the bridge. DeepSeek v4 is awesome paired with claude-code or opencode. I found that claude code costs me less than opencode and I am presuming this is due to a better engineered harness.
I only used it for a few hours to play around with stuff before the quota issue was fixed and I could resume using GPT models, and the bridge was coded by DeepSeek-V4-Flash-IQ2XXS + DwarfStar4 locally, I take no responsibility for what might happen with your computer or you, during usage or just reading the code.
Edit: heh, like don't look at line 117 for example where seemingly it likes to handle misspellings in the .env file which totally wasn't my fault for typo'ing the API key in that file... I'm sure there are tons of sharp edges and dumb stuff in there.
Obviously, if you do deal with any sort of secrets, then using local LLMs over OpenAI, Anthropic, DeepSeek or whoever is obviously preferred, and in the case of personal data of users, probably a requirement.
Same with codex? codex-rs at least, is a TUI as well, it does run a "app-server" in the background, that the TUI actually interacts with, but that's just an implementation detail. Also makes it easy to hook in your own programs to fire of codex "headless" sessions even without the TUI.
This is how all AI coding harnesses work, isn't it?
The author claims (in another AI-written post):
> LangChain — along with every generic agent framework I checked — rebuilds the prompt every turn. Timestamps get injected. History gets reordered. Tool schemas re-serialize with different whitespace.
I haven't touched LangChain in a long, long time, but don't think any of the current harnesses, Claude Code, Pi, Crush, OpenCode etc do any of that. Keeping the context stable for caching is a very basic principle and not a wild innovation. Also curious why DeepSeek would be different? All providers behave like this.
I'm concerned since i really want SOTA reasoning, but DeepSeek still has me interested.
From the FAQ, I see:
>Can I point it at a self-hosted / private DeepSeek endpoint?
>Yes. Since 0.30 we accept non-standard key prefixes for self-hosted DeepSeek endpoints. Just point `baseUrl` at your internal address — the loop, cache strategy, and tool protocol are unchanged.
But my question is: If I use Reasonix to talk to a deepseek endpoint through openrouter, am I still getting the cache-hit benifits of this agent harness?
I have been using it for a while, and I wholeheartedly agree. imo, it is as good as codex or claude which I also use. It is a winner in the cost-sensitive tier, and if some startup could put it together with data-retention in mind, it could be a great product sold to the enterprise, as data-retention and privacy are the main issues for the coding-assistant usecase.
> Independent open-source project · not affiliated with DeepSeek
Yes, but a lot of harnesses change previous context. E.g. the system prompt injects the current time/date, working directory, files in the working directory, etc. Compaction also changes the whole previous context. I _think_ changing the list of tools also invalidates cache, so invoking a subagent with different tools would invalidate the cache.
My vague impression is that it's in a similar vein to functional programming languages. It generally disallows doing things that lead to bugs (cache misses in this case), and presumably allows you to do those things in a way that makes it much clearer that this is likely to cause cache misses. I would guess that in this paradigm, you don't mutate your existing session, you derive a new session by mutating the prior context into a new context.
Is this improving the cache hit and hence overall efficiency of coding workflows?
Does it also let me host a local llm (deepseek)? What are model min requirements for this?
That's the pinnacle of AI slop over engineered garbage in my opinion. All of that information is noise.
This is still art as much as science and the different harnesses take different approaches.
Is this really the behavior you want? Yes, doing tool-result clearing and such will blow your cache, but if you do it only occasionally, it's still likely a win. Yes, cache hits are good, but not so good that it's okay to be profligate with context to preserve those precious, precious KVs.
(that is, different places on the Pareto efficiency graph)
https://mimo.mi.com
https://news.ycombinator.com/item?id=48237663
It's bad enough that I'm working on guardrails at the harness level because prompting appears to be useless.
Do you have the same issue?
Now that you mention it, though, I have seen it do a few things that weren't in the plan. The reviewer caught them, though, so they didn't cause a problem, and it's so cheap that overall it's a massive improvement.