Opus 4.8 Plans, Gemini 3.5 Executes

For the last six weeks I have been running my project work through a two-agent loop, and it has changed how I think about AI assistants. Opus 4.8 plans. Gemini 3.5 executes. I sit between them as the human in the loop, and the work gets faster and cleaner than any single-agent setup I have run before.

This is what the flow looks like, what each model is actually good at, and where the loop breaks when I push it too hard.

The flow

I open a new project in Claude Code with Opus 4.8 as the planner. The first thing I do is not write code. It is a brainstorming session. I describe what I want to build, the constraints I am working under, the things I have already tried. Opus 4.8 pushes back, asks clarifying questions, surfaces edge cases I had not thought about, and proposes an architecture. I keep iterating until the plan is something I would actually defend in a design review.

Once the plan is solid, I ask Opus 4.8 to convert it into a checklist. Not a high-level outline. A numbered checklist where every item is something a different agent can pick up and execute without needing to understand the rest of the system. “Task 1: scaffold the Postgres schema for the events table.” “Task 2: implement the webhook receiver with idempotency keys.” Each task has a clear input, a clear output, and a definition of done. I hand the checklist to Gemini 3.5 with the instruction “execute tasks 1 through N one at a time, and write a short report after each one.”

Gemini works through the checklist. After each task it returns a short report: what it built, where the files live, what it tested. I do not read the code in real time. I read the reports. When the report says the task is done, I check it against the plan. If it matches, I move to the next task. If it does not match, I take the report back to Opus 4.8 and ask Claude to update the plan or write a fix task. Gemini then executes the new task. The planner owns the plan. The executor owns the execution. The bug does not live in the same place as the spec.

The loop closes when every task is done. Then I open Opus 4.8 again, give it the reports, and ask it to do the integration check. Does the system work end to end? Are there gaps between tasks that the executor missed? Opus 4.8 writes a list of follow-up tasks. I push them back to Gemini. The cycle repeats until the work is actually done.

This is the loop. Two models with two different jobs, and a human in the middle deciding what is good enough to move on.

Why Opus 4.8 for planning

The reason I do not let Gemini do the planning is not because Gemini is bad at planning. It is because planning is the step where I want maximum reasoning effort, and Opus 4.8 is the model that gives me the most thinking per prompt. According to Anthropic, Opus 4.8 has adaptive thinking that lets it spend more time on harder problems and respond quickly to simpler ones. That maps directly onto what planning needs. The first pass at a plan is fast and shallow. The second pass catches the gaps. The third pass catches the gaps in the second pass. Opus 4.8 does all three passes in one prompt if I let it run with maximum effort.

Gemini is also a strong planner. But when I use the same model for both planning and execution, I lose the role separation. The model starts optimizing for execution speed during planning. It skips over edge cases because it knows it can patch them later. Splitting the roles across two models forces me to write down what the planner actually intends, and that written plan is what I check the executor against.

There is also a cost reason. Opus 4.8 pricing starts at $5 per million input tokens and $25 per million output tokens. That is the most expensive seat in my workflow. But I only use it for the parts that need its full reasoning budget: the brainstorming, the plan, and the integration check. The bulk execution goes to Gemini at a much lower cost per token.

Why Gemini 3.5 for execution

Gemini 3.5 Flash is the current Gemini model I use for execution. It is fast, it follows checklists well, and it produces reports that are detailed enough to verify without reading every line of code. The thing Gemini does best in this flow is the boring part. Scaffold a schema. Write the boilerplate. Run the test suite. Fix the linting errors. None of those tasks need Opus-level reasoning. They need a model that will not get bored halfway through and start skipping items.

When I give Gemini a 20-item checklist, it does all 20 items. When I gave Opus 4.8 a 20-item checklist, it would pick the three most interesting items and try to do them at a depth that was overkill for what was actually needed. The other 17 items got a sentence each. That is fine for the interesting items. It is not fine for the boring items, which is most of the work.

Gemini also reports back in a format I can verify without re-reading the code. It tells me which files it changed, which tests it ran, which tests it skipped and why, and what assumptions it made. That is the right level of detail for execution. If the report says “implemented task 3 per the spec, all 14 tests pass,” I check the test output and move on. I do not need to read the implementation unless the report says something is off.

What I do as the human in the loop

The middle role is the part I underestimated when I started this workflow. I thought the human job was to read the final output and ship it. It is not. The human job is to make three decisions per loop.

The first decision is whether the plan is good. Opus 4.8 will keep generating more detailed plans if I let it. The question is not “is this plan complete” — it never is — but “is this plan good enough to execute against.” If the plan is 80% right, the executor will fill in the 20%. If the plan is 50% right, the executor will fill in a different 50% and the system will not work end to end. I push back on the plan until the shape is right, then let go.

The second decision is whether each task report is good enough to accept. “Implemented per the spec” is not the same as “implemented correctly.” I check the test output. I skim the diff. I look for the patterns that mean trouble: a test that is passing for the wrong reason, a function that is doing too much, a TODO left in the code. If the pattern looks right, I accept. If the pattern looks wrong, I go back to Opus 4.8 and ask Claude to either update the plan with a fix task or rewrite the spec for the next loop.

The third decision is whether the integration check surfaces something real. The integration check is where Opus 4.8 catches the gaps between tasks. Some of those gaps are real. Some of them are over-engineering. I have to tell the difference. If the gap is real, I add it to the follow-up checklist. If the gap is over-engineering, I push back on Opus 4.8 and tell it why we are not solving that problem in this iteration.

Three decisions per loop. That is the job.

Where the loop breaks

The loop does not work for everything. Three things have bitten me.

First, when the plan is too vague. If I let Opus 4.8 generate a plan with high-level items like “implement the auth flow,” the executor has to fill in too much. Different tasks end up with different interpretations, and the integration check finds a mess. The fix is to push the planner to break things down further. “Implement the auth flow” becomes 6 items: schema, signup, login, password reset, session check, logout. The more granular the checklist, the cleaner the execution.

Second, when the reports get hand-wavy. When Gemini is unsure about a task, it sometimes writes reports that sound confident but do not actually say what changed. “The feature is now working as expected” with no file path and no test output is a red flag. The fix is to require a specific report format: file path, what changed, what tests ran, what passed, what did not. If the report does not match the format, I do not accept the task.

Third, when I skip the integration check. On a good day, the integration check catches the cross-task gaps. On a bad day, when I am tired or rushing, I skip it and ship directly. The system usually works for a week and then breaks in production on an edge case the integration check would have caught. The fix is to never skip the integration check, even when I think the work is done. The 10 minutes it saves are not worth the 4 hours I spend debugging later.

The numbers

Across the last 6 projects I shipped with this workflow:

– Average task-completion rate on the first pass: 78%. About 1 in 4 tasks comes back with a report that needs a fix before I accept it. The fix is usually one round-trip back to Claude for a spec update, then a single Gemini execution pass, not a full re-do. – Average project time end to end: about 60% of what it was when I was using a single agent for both planning and execution. The savings are bigger on larger projects. On small projects the overhead of running two models and managing the loop eats most of the savings. – Average Opus 4.8 token spend per project: about 30% of total. Gemini takes the other 70%. The total token spend is similar to a single-agent setup, but the expensive model spends its budget on the reasoning-heavy work.

The 60% time saving is the headline number. The reason for the saving is not that the models are faster. It is that the role separation forces me to write down what I want before I let the executor build it, and the written plan is what I check the work against. Most of the time I used to spend debugging was debugging things I never actually specified. The plan makes the specification explicit. The executor fills in the gaps. The integration check catches what was filled in wrong.

The takeaway

Two models with two different jobs and a human in the middle is the workflow I am running now. Opus 4.8 does the work that needs its full reasoning budget: planning and integration checking. Gemini 3.5 does the work that needs speed and checklist discipline: bulk execution. I make three decisions per loop: is the plan good enough, is each task report good enough, is the integration check surfacing something real.

The loop is not magic. It is just role separation. The planner plans. The executor executes. The human decides. The reason it works is the same reason any good team works — clear roles, clear handoffs, and someone in the middle who can say “this is not done yet” without slowing down.

If you are running a single-agent setup and feeling the friction, try splitting the work. Put your expensive model on planning and integration. Put your cheap model on execution. Run the loop. Write down what you want before the executor starts. Check the reports before you accept the work. The savings show up in the first project.

Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.