The finding that self-generated skills provide negative benefit (-1.3pp) while curated skills give +16.2pp is the most interesting result here. It's a clean empirical demonstration of something many of us have noticed anecdotally: LLMs are better consumers of procedural knowledge than producers of it.
This has real architectural implications. If you're building agent systems with skill/tool registries, the naive approach of "have the agent write its own skills as it learns" is a dead end. You need human-curated skills, which means you need a distribution and curation story — essentially a package manager for procedural knowledge.
The "2-3 focused modules beat comprehensive docs" finding also maps to what I've seen in practice. Agents choke on large reference documents not because of context limits but because they can't prioritize. A tight SKILL.md that says "here's the workflow, here are the three commands, here's the one gotcha" outperforms dumping the full docs every time.
The part I'd push back on: +4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.
Curious whether anyone's seen the "smaller model + skills ≈ larger model without" finding hold in production. The cost implications are significant if Sonnet + curated skill genuinely matches bare Opus on domain-specific tasks.
"Self-Generated Skills: No Skills provided, but the agent is prompted to generate relevant procedural knowledge before solving the task. This isolates the impact of LLMs’ latent domain knowledge"
This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.
I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.
It's even worse than this: the "tasks" that are evaluated are limited to a single markdown file of instructions, plus an opaque verifier (page 13-14). No problems involving existing codebases, refactors, or anything of the like, where the key constraint is that the "problem definition" in the broadest sense doesn't fit in context.
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First
Before attempting to solve this task, please follow these steps:
1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed.
2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library,
API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be
reusable for similar tasks.
3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name.
4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
I don't see how "create an abstraction before attempting to solve the problem" will ever work as a decent prompt when you are not even steering it towards specifics.
If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
Yeah some of my most useful AI tooling are skills created via a “role play session”. Basically brain dumping to the agent and telling it to ask questions and figure out how to accomplish a task, then distilling it into a skill at the end which is much tighter and evidence based from the actual problem solving session
> Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to
Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.
It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.
Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.
I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.
The point of so-called 'skills' is to be short how-to reminders that the agent can pull into its context and then act upon. If the knowledge is already in the model, it will most likely be surfaced in reasoning phase anyway, so there's little benefit to writing it up as a skill, unless perhaps it's extremely relevant and hard to surface, and you want the model to skip that part of the reasoning.
There is a benefit of a skill though. If an AI keeps encoding common tasks as skills and scripts, the LLM eventually just becomes a dumb routing mechanism for ambiguous user requests, which ultimately drives down token usage.
If everything you want an LLM do is already captured as code or simple skills, you can switch to dumber models which know enough about selecting the appropriate skill for a given user input, and not much else. You would only have to tap into more expensive heavy duty LLMs when you are trying to do something that hasn’t been done before.
Naturally, AI companies with vested interest in making sure you use as many tokens as possible will do everything they can to steer you away from this type of architecture. It’s a cache for LLM reasoning.
AI companies don't want you to waste tokens, they benefit when you use them efficiently because they can serve more users on the infra that's the main bottleneck for them. It's Jevons' paradox in action.
Yeah I care about LLM's generating skills after attempting tasks and learning lessons from those attempts, not before attempting a task for the first time. This result seems a little silly and detached from the reality of how skills are "auto-generated" in the real world.
I have a custom skill-creator skill that contains this:
> A common pitfall is for Claude to create skills and fill them up with generated information about how to complete a task. The problem with this is that the generated content is all content that's already inside Claude's probability space. Claude is effectively telling itself information that it already knows!
> Instead, Claude should strive to document in SKILL.md only information that:
> 1. Is outside of Claude's training data (information that Claude had to learn through research, experimentation, or experience)
> 2. Is context specific (something that Claude knows now, but won't know later after its context window is cleared)
> 3. Aligns future Claude with current Claude (information that will guide future Claude in acting how we want it to act)
> Claude should also avoid recording derived data. Lead a horse to water, don't teach it how to drink. If there's an easily available source that will tell Claude all it needs to know, point Claude at that source. If the information Claude needs can be trivially derived from information Claude already knows or has already been provided, don't provide the derived data.
Sincerely, perhaps you should publish on arxiv before a researcher reads it to run it and write a study.
It's fairly common we notice these types of threads where one thing is being postulated and then there's comments upon comments of doer's showing what they have done.
The AI world moves at a blistering pace. Academic publishing does not. In this particular case the "random dude on HN" is probably six to nine months ahead of the academic publication, not in the sense of being that much smarter but literally just being that much further progressed through time relative to the academic publication pipeline.
I only generate skills _after_ I've worked through a problem with the model - usually by asking it "what have you learned in this session?". I have no idea why people would think it can zero-shot a problem space without any guidance or actual experience...
There is almost no point in telling an agent to build a skill without augmenting it's knowledge on the thing it's writing about as you're just piping output to input without expanding the information in the system. If you get an agent to perform a bunch of research online, distil that down to information that the models tend not to get right or is newer than what is in their training data or simply better aligns with your desired workflow than what they generate out of the box - that's going to create a far more useful skill. I use a skill that gets activated when creating a skill to help guide this approach: https://github.com/sammcj/agentic-coding/blob/main/Skills/sk...
I find it useful for it to automatically update skills after trying them out in the wild. It can then improve the skills with real feedback. Seems to work well but I didn't do real research on it.
This has been my observation with self-generated docs as well.
I have seen some devs pull out absolutely bad guidance by introspecting the code with the LLM to define "best practices" and docs because it introduces its own encoded biases in there. The devs are so lazy that they can't be bothered to simply type the bullet points that define "good".
One example is that we had some extracted snippet for C#/.NET that was sprinkling in `ConfigureAwait(false)` which should not be in application code and generally not needed for ASP.NET. But the coding agent saw some code that looked like "library" code and decided to apply it and then someone ran the LLM against that and pulled out "best practices" and placed them into the repo and started to pollute the rest of the context.
I caught this when I found the code in a PR and then found the source and zeroed it out. We've also had to untangle some egregious use of `Task.Run` (again, not best practice in C# and you really want to know what you're doing with it).
At the end of it, we are building a new system that is meant to compose and serve curated, best practice guidance to coding agents to get better consistency and quality. The usage of self-generated skills and knowledge seems like those experiments where people feed in an image and ask the LLM to give back the image without changing it. After n cycles, it is invariably deeply mutated from the original.
Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions. The next abstraction is Markdown and I think that teams that invest their time in writing and curating markdown will create better guardrails within which agents can operate without sacrificing quality, security, performance, maintainability, and other non-functional aspects of software system.
> Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions.
I don't completely disagree (I've argued the same point myself). But one critical difference between the LLM layer and all of those others you listed, is that LLMs are non-deterministic and all those other layers are. I'm not sure how that changes the dynamic, but surely it does.
The LLM can be non-deterministic, but in the end, as long as we have compilers and integration tests, isn't it the same? You go from non-deterministic human interpretation of requirements and specs into a compiled, deterministic state machine. Now you have a non-deterministic coding agent doing the same and simply replacing the typing portion of that work.
So long as you supply the agent well-curated set of guidance, it should ultimately produce more consistent code with higher quality than if the same task were given to a team of random humans of varying skill and experience levels.
The key now is how much a team invests in writing the high quality guidance in the first place.
The unspoken truth is that tests were never meant to cover all aspects of a piece of software running and doing its thing, that's where the "human mind(s)" that had actually built the system and brought it to life was supposed to come in and add the real layer of veracity. In other words, "if it walks like a duck and quacks like duck" was never enough, no matter how much duck-related testing was in place.
The general rule seems to be, the more layers you automate with LLMs, the worse each successive layer gets. Piping LLM output as input into new LLM calls, you're already starting to notice how things fall apart and get lost quickly.
If you have the idea, more or less the implementation plan, let the LLM do the coding, you can end up with something maintainable and nice, it's basically up to you.
Strip away one layer, so you have the idea, but let the LLM come up with the implementation plan, then also the implementation, and things end up a lot less than ideal.
Remove another layer, let the LLM do it all, and it's all a mess.
It's like those sequences of images where we ask the LLM to reproduce the same image exactly, and we just get some kind of grotesque collapse after a few dozen iterations. The same happens with text and code. I call this "semantic collapse".
I conjecture that after some years of LLMs reading a SharePoint site, producing summaries, then summaries of those summaries, etc... We will end up with a grotesque slurry.
At some point, fresh human input is needed to inject something meaningful into the process.
People like to make the comparison between zip file compressions, where you can degrade something by continually compressing. Same with using jpeg or mp3. But I like to use the analogy of the game "Telephone" (also called "Chinese Whispers"). I think it also highlights how fraught natural language is and just how quickly it can degrade. I think a lot of people are insufficiently impressed with how good we are at communicating at all.
I think this principle applies only if you lack feedbacks. Yes, when you go through multiple layers of open loop control, you're going to get less precise at each layer. It's less clear that the situation is as dire if each level has metrics and can self-adjust to optimize its performance.
But these are inherently subjective things, what the "right idea" is, or the "right implementation" is all up in our head that we can try to verbalize, but I don't think you can come up with an objective score for it, ask 100 programmers you'll get 100 different answers what "clean design" is.
And that's why my whole schtick when it comes to agent design is that agents need to learn online, continuously, and in adapter space via some PEFT mechanism (I like soft prompts and prefix tuning), because it's really hard to ascend gradients in discrete domains like tokens.
The model knows damn well when it's written ugly code. You can just ask it. The problem is what the Greeks called "akrasia", a wonderfully precise word that refers to knowing that you should do X yet not doing X.
But I've been beating this drum for years and we keep playing around with trying to fine-tune via fine prose. At the very least, we should be making these skills/prompts/whatever using a principled approach like https://arxiv.org/abs/2502.07978, not... whatever people are vibing.
> The model knows damn well when it's written ugly code. You can just ask it.
That's not been my experience at all, what model and prompt would you use for that? Every single one I've tried is oblivious to if a design makes sense or not unless explicitly prompted for it with constraints, future ideas and so on.
The derivative of a LLM agent's capabilities (on its own) is negative. It's not that they can't do useful work -- it means that (for now) they require some level of input or steering.
If that were to change -- if an agent could consistently get better at what it does without intervention -- that would represent a true paradigm shift. An accelerating curve, rather than one trending back towards linearity.
This represents a necessary inflection point for any sort of AI "takeoff" scenario.
So this study is actually kind of important, even though it's a null result. Because the contra view would be immensely significant.
Isn't the title editorialised? Probably for clicks?
I think that most of the adoption around Agent Skills would have a focus on ease of use, standarization and context management and not correctness.
My own thoughts on how to approach skill building target people who are adopting LLM development now more than ever although this was definitely possible (in a non standard way before) [1]
"Less is best" is not a new realization. The concept exists across contexts. Music described as "overplayed". Prose described as verbose.
We just went through an era of compute that chanted "break down your monoliths". NPM ecosystem being lots of small little packages to compose together. Unix philosophy of small composable utilities is another example.
So models will improve as they are compressed, skeletonized down to opcodes, geometric models to render, including geometry for text as the bytecode patterns for such will provide the simplest model for recreating the most outputs. Compressing out useless semantics from the state of the machines operations and leaving the user to apply labels at the presentation layer.
When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge. Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.
But that's what the paper's authors did!
When they say 'self-generated' they don't allow the model any tool access at all, not even web search.
It would be much more interesting if they had tested skills that were created in one of these ways:
A) The model interviews a human and then creates the skill, or
B) The model executes one or more deep research tasks in order to gather information, or
Despite skills being just a new form of memory and context engineering for an agent, I think the framework is still great for agents to self-develop, given a good prompt to regularly review their own sessions and pick learning points to save as skills. In fact, I think the "craft" of prompt engineering has been lost somewhat - I still enjoy puzzling out and iterating over the best possible starting prompt for a conversation to get the best result I can for a one-shot
FWIW I didn't read the paper and am judging it based on its title, which I think is fair because "self-generated agent skills" is a pretty loose definition.
Skills seem to be a crutch until we get continual learning. Imagine you've been running an instance for 6 months and it still remembers when you told it was running on your linux server over ssh and not on your Mac.
That's what I mean by continual learning, skills, memory are a crutch until real learning can happen, which could be weights changing in the local instance.
It seems intuitive that a naive self-generated Skill would be low-value, since the model already knows whatever it's telling itself.
However, I've found them to be useful for capturing instructions on how to use other tools (e.g. hints on how to use command-line tools or APIs). I treat them like mini CLAUDE.mds that are specific only to certain workflows.
When Claude isn't able to use a Skill well, I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.
With these Skills in place, the agent is able to do things it would really struggle with otherwise, having to consume a lot of tokens failing to use the tools and looking up documentation, etc.
> I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.
We are probably undervaluing the human part of the feedback loop in this discussion. Claude is able to solve the problem given the appropriate human feedback — many then jump to the conclusion that well, if Claude is capable of doing it under some circumstances, we just need to figure out how to remove the human part so that Claude can eventually figure it out itself.
Humans are still serving a very crucial role in disambiguation, and in centering the most salient information. We do this based on our situational context, which comes from hands-on knowledge of the problem space. I'm hesitant to assume that because Claude CAN bootstrap skills (which is damn impressive!), it would somehow eventually do so entirely on its own, devoid of any situational context beyond a natural language spec.
A pattern I use a lot is after working with the LLM on a problem, directing it, providing additional context and information, ask it to summarize its learning into a skill. Then the next session that has a similar theme can start with that knowledge.
I am lucky to count friends who are academics engaged in research, and one topic of discussion I notice around AI is researchers with a non-tech background and/or a lack of implementation / operationalization / commercialization in applying technology to Business, which can also cloud these kidns of results.
I have systemized and automated businesses for a long time before LLMs came out, which generally wasn't very popular.
It is really weird to see everyone get excited about this kind of automation and then try to jump to the end points with something that's non-deterministic and wonder why it doesn't work like every other computer they've used (all or none).
Agents can self generate skills, maybe not effortlessly, or with psychic skills of reading between the lines (special exception for Claude), it's also about the framework and scaffolding in which to create skills that work, and what can be brought back to the "self-generation".
Without experience in creating computer skills in general, attempts for self-generating agent skills is kind of trying to use AI to autocomplete a sentence and then not like how it went. To a fair degree it can be lined up to improve considerably.
Right now there seems to be a 6-12 month lag between studies like these and it being shared/reported in the wild.
Too often, they are researching something reported in the wild and trying to study it, and it very well may work for some cases, but not all cases, and the research kind of entirely misses it.
With AI, it's incredibly important to follow show and not tell.
Sharing this from genuine curiousity if this resonates with anyone, and if so, how/where.
This has real architectural implications. If you're building agent systems with skill/tool registries, the naive approach of "have the agent write its own skills as it learns" is a dead end. You need human-curated skills, which means you need a distribution and curation story — essentially a package manager for procedural knowledge. The "2-3 focused modules beat comprehensive docs" finding also maps to what I've seen in practice. Agents choke on large reference documents not because of context limits but because they can't prioritize. A tight SKILL.md that says "here's the workflow, here are the three commands, here's the one gotcha" outperforms dumping the full docs every time.
The part I'd push back on: +4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.
Curious whether anyone's seen the "smaller model + skills ≈ larger model without" finding hold in production. The cost implications are significant if Sonnet + curated skill genuinely matches bare Opus on domain-specific tasks.
To me, author reads like an articulate native English speaker, but typing on their phone.
This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.
I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
LLMs are not mind readers.
Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.
It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.
Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.
I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.
If everything you want an LLM do is already captured as code or simple skills, you can switch to dumber models which know enough about selecting the appropriate skill for a given user input, and not much else. You would only have to tap into more expensive heavy duty LLMs when you are trying to do something that hasn’t been done before.
Naturally, AI companies with vested interest in making sure you use as many tokens as possible will do everything they can to steer you away from this type of architecture. It’s a cache for LLM reasoning.
> A common pitfall is for Claude to create skills and fill them up with generated information about how to complete a task. The problem with this is that the generated content is all content that's already inside Claude's probability space. Claude is effectively telling itself information that it already knows!
> Instead, Claude should strive to document in SKILL.md only information that:
> 1. Is outside of Claude's training data (information that Claude had to learn through research, experimentation, or experience) > 2. Is context specific (something that Claude knows now, but won't know later after its context window is cleared) > 3. Aligns future Claude with current Claude (information that will guide future Claude in acting how we want it to act)
> Claude should also avoid recording derived data. Lead a horse to water, don't teach it how to drink. If there's an easily available source that will tell Claude all it needs to know, point Claude at that source. If the information Claude needs can be trivially derived from information Claude already knows or has already been provided, don't provide the derived data.
For those interested the full skill is here: https://github.com/j-r-beckett/SpeedReader/blob/main/.claude...
It's fairly common we notice these types of threads where one thing is being postulated and then there's comments upon comments of doer's showing what they have done.
I have seen some devs pull out absolutely bad guidance by introspecting the code with the LLM to define "best practices" and docs because it introduces its own encoded biases in there. The devs are so lazy that they can't be bothered to simply type the bullet points that define "good".
One example is that we had some extracted snippet for C#/.NET that was sprinkling in `ConfigureAwait(false)` which should not be in application code and generally not needed for ASP.NET. But the coding agent saw some code that looked like "library" code and decided to apply it and then someone ran the LLM against that and pulled out "best practices" and placed them into the repo and started to pollute the rest of the context.
I caught this when I found the code in a PR and then found the source and zeroed it out. We've also had to untangle some egregious use of `Task.Run` (again, not best practice in C# and you really want to know what you're doing with it).
At the end of it, we are building a new system that is meant to compose and serve curated, best practice guidance to coding agents to get better consistency and quality. The usage of self-generated skills and knowledge seems like those experiments where people feed in an image and ask the LLM to give back the image without changing it. After n cycles, it is invariably deeply mutated from the original.
Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions. The next abstraction is Markdown and I think that teams that invest their time in writing and curating markdown will create better guardrails within which agents can operate without sacrificing quality, security, performance, maintainability, and other non-functional aspects of software system.
I don't completely disagree (I've argued the same point myself). But one critical difference between the LLM layer and all of those others you listed, is that LLMs are non-deterministic and all those other layers are. I'm not sure how that changes the dynamic, but surely it does.
So long as you supply the agent well-curated set of guidance, it should ultimately produce more consistent code with higher quality than if the same task were given to a team of random humans of varying skill and experience levels.
The key now is how much a team invests in writing the high quality guidance in the first place.
If you have the idea, more or less the implementation plan, let the LLM do the coding, you can end up with something maintainable and nice, it's basically up to you.
Strip away one layer, so you have the idea, but let the LLM come up with the implementation plan, then also the implementation, and things end up a lot less than ideal.
Remove another layer, let the LLM do it all, and it's all a mess.
I conjecture that after some years of LLMs reading a SharePoint site, producing summaries, then summaries of those summaries, etc... We will end up with a grotesque slurry.
At some point, fresh human input is needed to inject something meaningful into the process.
Reading this on HN... Sic transit gloria mundi!
The model knows damn well when it's written ugly code. You can just ask it. The problem is what the Greeks called "akrasia", a wonderfully precise word that refers to knowing that you should do X yet not doing X.
But I've been beating this drum for years and we keep playing around with trying to fine-tune via fine prose. At the very least, we should be making these skills/prompts/whatever using a principled approach like https://arxiv.org/abs/2502.07978, not... whatever people are vibing.
That's not been my experience at all, what model and prompt would you use for that? Every single one I've tried is oblivious to if a design makes sense or not unless explicitly prompted for it with constraints, future ideas and so on.
The derivative of a LLM agent's capabilities (on its own) is negative. It's not that they can't do useful work -- it means that (for now) they require some level of input or steering.
If that were to change -- if an agent could consistently get better at what it does without intervention -- that would represent a true paradigm shift. An accelerating curve, rather than one trending back towards linearity.
This represents a necessary inflection point for any sort of AI "takeoff" scenario.
So this study is actually kind of important, even though it's a null result. Because the contra view would be immensely significant.
I think that most of the adoption around Agent Skills would have a focus on ease of use, standarization and context management and not correctness.
My own thoughts on how to approach skill building target people who are adopting LLM development now more than ever although this was definitely possible (in a non standard way before) [1]
[1] https://alexhans.github.io/posts/series/evals/building-agent...
This was realized in 2023 already: https://newsletter.semianalysis.com/p/google-we-have-no-moat...
"Less is best" is not a new realization. The concept exists across contexts. Music described as "overplayed". Prose described as verbose.
We just went through an era of compute that chanted "break down your monoliths". NPM ecosystem being lots of small little packages to compose together. Unix philosophy of small composable utilities is another example.
So models will improve as they are compressed, skeletonized down to opcodes, geometric models to render, including geometry for text as the bytecode patterns for such will provide the simplest model for recreating the most outputs. Compressing out useless semantics from the state of the machines operations and leaving the user to apply labels at the presentation layer.
When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge. Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.
But that's what the paper's authors did!
When they say 'self-generated' they don't allow the model any tool access at all, not even web search.
It would be much more interesting if they had tested skills that were created in one of these ways:
A) The model interviews a human and then creates the skill, or
B) The model executes one or more deep research tasks in order to gather information, or
C) Some combo of the above.
FWIW I didn't read the paper and am judging it based on its title, which I think is fair because "self-generated agent skills" is a pretty loose definition.
Not even sure how you envision continuous learning, but if you mean model updates, I'm not sure the economics work out
What Ai's get is a cheat sheet for the session
However, I've found them to be useful for capturing instructions on how to use other tools (e.g. hints on how to use command-line tools or APIs). I treat them like mini CLAUDE.mds that are specific only to certain workflows.
When Claude isn't able to use a Skill well, I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.
With these Skills in place, the agent is able to do things it would really struggle with otherwise, having to consume a lot of tokens failing to use the tools and looking up documentation, etc.
We are probably undervaluing the human part of the feedback loop in this discussion. Claude is able to solve the problem given the appropriate human feedback — many then jump to the conclusion that well, if Claude is capable of doing it under some circumstances, we just need to figure out how to remove the human part so that Claude can eventually figure it out itself.
Humans are still serving a very crucial role in disambiguation, and in centering the most salient information. We do this based on our situational context, which comes from hands-on knowledge of the problem space. I'm hesitant to assume that because Claude CAN bootstrap skills (which is damn impressive!), it would somehow eventually do so entirely on its own, devoid of any situational context beyond a natural language spec.
1. You MUST review and correct them
2. Embrace minimalism, they are spark notes and an index, not comprehensive
3. Force them into context
I imagine similar concepts hold for skills
I have systemized and automated businesses for a long time before LLMs came out, which generally wasn't very popular.
It is really weird to see everyone get excited about this kind of automation and then try to jump to the end points with something that's non-deterministic and wonder why it doesn't work like every other computer they've used (all or none).
Agents can self generate skills, maybe not effortlessly, or with psychic skills of reading between the lines (special exception for Claude), it's also about the framework and scaffolding in which to create skills that work, and what can be brought back to the "self-generation".
Without experience in creating computer skills in general, attempts for self-generating agent skills is kind of trying to use AI to autocomplete a sentence and then not like how it went. To a fair degree it can be lined up to improve considerably.
Right now there seems to be a 6-12 month lag between studies like these and it being shared/reported in the wild.
Too often, they are researching something reported in the wild and trying to study it, and it very well may work for some cases, but not all cases, and the research kind of entirely misses it.
With AI, it's incredibly important to follow show and not tell.
Sharing this from genuine curiousity if this resonates with anyone, and if so, how/where.