One Markdown File Made My AI Agent 23 Points Smarter

Last week I read a paper that made me re-evaluate everything I have written about AI agent optimization. Microsoft and three Chinese universities published a method called SkillOpt. The result: a single Markdown file, between 300 and 2,000 tokens, lifted GPT-5.5 by an average of 23 points across six procedural benchmarks. No fine-tuning. No new model. No extra tools. Just a Markdown file that gets fed to the agent as context at inference time.

The skill beats handwritten instructions, one-shot LLM-generated instructions, and four specialized training methods (Trace2Skill, TextGrad, GEPA, EvoSkill). It works in Codex. It works in Claude Code. It transfers across model sizes. A spreadsheet skill trained in the Codex loop lifts Claude Code to the same level as a skill trained directly in Claude Code.

After reading the paper, I stopped adding features to my AI agent for a week. I started writing skill files instead. Five of them. All under 1,000 tokens. All producing measurable improvements in my daily work. This post is the five skills and the pattern I now use by default for any procedural task.

What SkillOpt actually proved

The key idea is not “use Markdown.” The key idea is “treat the skill document like a trainable state.” A second language model acts as the optimizer. It reads the agent’s logs, spots recurring error and success patterns, and proposes small edits to the skill. Each edit is only accepted if it performs better on a held-out validation set.

The authors mapped several deep learning concepts onto the text level. A learning rate caps how many edits can land per step. A scheduler shrinks the step size across epochs. Rejected edits go into a buffer and serve as negative examples. A slow update at the end of each epoch preserves stable edit directions across rounds.

What makes this practical is the clean split between training and deployment. The optimizer model only runs during training. At inference, the target model simply receives a plain Markdown file as context. The skill is not a fine-tune. The skill is not a tool. The skill is a 300 to 2,000-token document the agent reads at the start of each task.

The benchmark details matter. Six benchmarks covering search, spreadsheets, document analysis, math, and embodied action. Seven target models including GPT-5.5 and the smaller Qwen3.5-4B. The biggest gains show up on tasks with strict format requirements and tool use — exactly where my own agents struggle.

The transferability claim changed my workflow. A skill trained on a larger model also improves smaller models in the same family. A skill trained in Codex works unchanged in Claude Code. Write the skill once, deploy it everywhere.

Why prompt engineering is the wrong frame

Most prompt engineering advice treats the prompt as a one-shot creative artifact. You sit down, write a clever instruction, ship it, hope. The result is what SkillOpt’s authors observed: prompts are “either written by hand, generated in a single pass by a language model, or loosely self-revised. None of these approaches behaves like a real optimizer.”

The shift SkillOpt implies: a skill is a configuration file, not a piece of writing. You version it. You test it against a held-out set. You keep edits that improve the metric. You reject edits that do not. You build a feedback loop around it.

This is the same loop engineers already use for unit tests, config files, and SQL queries. The novelty is applying it to natural language instructions. Once you accept the frame, the workflow becomes obvious. Write a skill. Run it on 20 tasks. Inspect the failures. Patch the skill. Re-run. Keep what works. Discard what does not. Version the file. Diff it in code review.

The 5 skill files I shipped this week

I have written five skills in the last seven days. Each is under 1,000 tokens. Each targets a procedural task I run daily. Each has measurably improved the agent’s success rate on that task. Here are the five, with the full content of the first one so you can see the pattern.

Skill 1: Code review checklist

# Code Review Checklist

When reviewing a pull request, check these in order. Do not skip steps.

1. **Read the diff once for context, then again for bugs.** First pass: understand intent. Second pass: off-by-one, race conditions, missing error handling.
2. **Check the tests.** Every behavioral change needs a test. If the test does not fail without the change, it is not a real test.
3. **Check the error messages.** Users read them at 2 AM. They must say what went wrong and what to do next.
4. **Check the public API surface.** Any rename, signature change, or new return type is a breaking change. Flag it.
5. **Check the dependency list.** New dependency = new attack surface, new license review, new update burden. Reject unless it saves more than 100 lines.
6. **Output format.** Use exactly: ## Summary, ## Blocking issues, ## Suggestions, ## Nitpicks. If no blocking issues, write "## Blocking issues\n\nNone." Do not omit the heading.
7. **Tone.** Constructive and specific. "This will break if X" not "this is bad." Cite the line number.

Before this skill, my code-review agent produced inconsistent output — sometimes a 4-line summary, sometimes a 200-line essay. After, every review has the same structure. The “## Blocking issues\n\nNone.” line alone saved 30 minutes a week of misread review severity.

Skill 2: Daily standup writer

This skill takes a git log, a Linear board snapshot, and Slack threads I was tagged in. It produces a 5-line standup in the format my team agreed on. The skill is 600 tokens and includes the section headers, verb tense rules, and three examples of good standups from the team’s history. The examples are the secret sauce. A skill with examples outperforms a skill with abstract rules 9 times out of 10, in my testing.

Skill 3: Incident post-mortem template

This skill forces the agent to write a post-mortem in the structure my on-call rotation requires: timeline (UTC), root cause (one sentence), contributing factors (bulleted), what we will change (with owner and date), what we will not change (and why). 800 tokens. The counter-example: “Do not write ‘human error’ as a root cause. Name the system that allowed the human error. Human error is a symptom, not a cause.” Skills without a “do not” section leak failure modes.

Skill 4: PR description from commit history

This skill takes a branch name and a list of commits and produces a PR description: one-line summary, what changed, why, how to test, screenshots placeholder. 700 tokens. Critical rule: never invent a “why” if the commits do not justify one. If the why is unclear, write “## Why\n\nNeeds author input.” and stop. The agent used to hallucinate plausible-sounding rationales. The skill prevents that.

Skill 5: Customer support reply drafter

This skill drafts a reply to a customer email, always in the same three-paragraph structure: acknowledge, explain, next step. The skill is 900 tokens. It includes 12 phrases the company has banned (“We’re sorry you feel that way”, “Per our policy”, “As a valued customer”) and 8 phrases the company prefers (“Here’s what happened”, “What I’ll do next”, “I’ll get back to you by Friday”). The banned-phrases list is what made this skill work. The agent stopped producing the corporate-speak that triggered complaints.

The skill file pattern I now use by default

After writing these five, the pattern is clear. A good skill has six sections, in this order:

1. Role and context — one sentence: who the agent is, what task it is doing, what the output is for. 2. Process — numbered steps, each atomic and falsifiable. “Read the diff” not “understand the diff.” 3. Output format — exact section headers, what to write when there is nothing to say. 4. Examples — at least 2 good, 1 counter-example. Real examples from the team’s history beat invented ones. 5. Do not — a short list of phrases, behaviors, or output patterns the agent must avoid. 6. Verification — one sentence on how to check the output.

Skills under 600 tokens outperform longer ones for narrow tasks. Skills over 1,500 tokens drift — the agent begins to ignore the lower-priority rules. Keep skills small. Keep skills specific. Keep skills versioned.

The single most important shift is the test loop. Treat the skill file like a config. Run it on 20 tasks. Compare success rate to the previous version. Keep the edit only if it improves the metric. Reject otherwise. This is the loop SkillOpt automates, and the loop any engineer can run by hand in 30 minutes per skill per week.

What this changes about how I build agents

I have stopped adding features to my AI agent for the next two months. I am only writing skill files. The agent’s surface area is frozen. The skill files are the new surface area, and the skill files are versioned, tested, and diffable in the same way code is.

If you only do one thing this week, pick the procedural task your agent performs worst, write a 500-token skill file with the six sections above, run it on 20 real examples, and compare to your current success rate. The result will either surprise you or it will not. Either way, you will know.


Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.

Leave a Comment

Discover more from Susiloharjo

Subscribe now to keep reading and get access to the full archive.

Continue reading