Why a Simple If-Else Can Beat an LLM

Why a Simple If-Else Can Beat an LLM

TL;DR: When you can describe the inputs and the expected outputs in advance, you don’t need a model — you need a function. Here’s the principle, the proof, and the one case where the principle breaks.

A teammate burned $47 of API credits last quarter on a “smart” classifier. The job: sort incoming support emails into four buckets (billing, technical, account, other) and route them to the right Slack channel. The model nailed it about 91% of the time. The remaining 9% it was confidently, hilariously wrong — sending a billing dispute to the technical channel, an outage report to “other.”

I replaced it with a 40-line Python script using if and a handful of keyword checks. It runs in 12 milliseconds per email, costs $0, and gets the same 91% — except the 9% it gets wrong are predictably wrong, so we know to watch them. The classifier used to hallucinate categories that didn’t exist. The script never invents a fifth bucket.

That’s not an edge case. That’s the principle: when the parameters are already known, deterministic code is the right answer. The question is why this works, and when it stops working.

The Principle: Why a 40-Line Script Beats a 70B-Parameter Model

Three forces are at work, and all three push in the same direction.

Models guess; code knows. A language model produces outputs by sampling from a probability distribution over its training data. Even at temperature zero, the next-token probabilities reflect the model’s learned sense of “what would plausibly come next” — not what is correct in any given problem. A keyword check, by contrast, knows exactly what it’s looking for. There’s no approximation, no inference, no probability mass leaking into wrong answers. The 9% error rate the script produces is the true error rate given the rule set. The 9% the model produces is its estimated error rate — and the remaining 91% isn’t all “right,” it’s “probably right given the distribution of training text.” A meaningful chunk of the LLM’s correct answers are right for the wrong reasons. Code can’t be right for the wrong reasons.

Latency compounds. A 40-line Python script runs in 12ms. A 4B-parameter model takes 800ms minimum, even on a good day, even with caching. When you’re processing 200 emails, that’s 160 seconds of wall time where code takes 2.4 seconds. When the workflow is interactive, this difference is the difference between “feels instant” and “I have time to context-switch and forget what I was doing.” Latency isn’t a cosmetic issue. It’s a UX issue. It’s a throughput issue. It compounds.

Reproducibility is a feature. The script’s output for any given input is deterministic — same input, same output, every time, forever. The model’s output is approximately-deterministic. Across three runs of the same prompt with the same input, the lead-scoring model gave me three different scores. Three. The deterministic version with a weighted-average formula gives the same number every time, and the weights are explicit so I can defend them in a sales call. In production, “approximately correct” is often indistinguishable from “broken.” Code gives you the audit trail. Code gives you the test case. Code gives you the answer you can show your customer when they ask why their ticket was routed there.

These three forces — correctness, latency, reproducibility — are not independent. They reinforce each other. When you stack them against the LLM on a clear-flow problem, the script wins on every axis. It wins by less in some contexts (long tail of edge cases) and by more in others (high volume, low tolerance for surprise). But it always wins somewhere. And the larger the deployment, the larger the win.

The Three Questions I Ask Before Reaching for an LLM

The principle only fails when the parameters can’t be enumerated. The job is to recognize those cases. Before I open an LLM client — whether that’s the API, Claude Code, or a local Ollama endpoint — I run through three questions:

1. Can I describe the inputs and expected outputs in a list? If yes, that’s a function, not a prompt.

2. Are the decision boundaries static? If “billing email” is always a billing email regardless of context, I don’t need language understanding — I need substring matching.

3. Will the cost of being wrong be the same for every approach? LLMs hallucinate. Code throws errors. When a wrong answer is acceptable either way, the tiebreaker is speed and cost, and code wins on both.

If all three answers are “yes,” I write the script. I don’t even consider the LLM. The threshold for invoking AI in my stack is higher than it was 18 months ago, and the bar keeps rising as models get better — because models get better at the genuinely hard problems, leaving the easy ones looking more obviously like things code should handle.

Three Examples That Hold Up Under Scrutiny

Newsletter triage. I get 47 newsletters a day. For 40 of them, the value is binary: did this come from a sender I’ve flagged as “always read” (4 senders), “scan subject only” (12 senders), or “archive” (the rest)? A YAML config and a single Python loop does it. I never ask the LLM to “decide which newsletter is interesting” — that’s a category error. The interesting/uninteresting decision is mine, encoded in my config. The model would also be guessing at the same time, with the same knowledge — but slower, expensively, and with no way to inspect why it labeled something the way it did.

Log anomaly detection. When a service is misbehaving, the first thing I want is “show me lines that match known-bad patterns.” Regex, not AI. The AI comes in later, after the regex filter, to summarize the 30 lines I narrowed down. That’s a 95% reduction in tokens spent on log analysis. The LLM never sees the 99% of logs that are noise. The principle is the same: filter first, model second. The model isn’t bad at log analysis; it’s just expensive at it relative to regex.

Lead scoring for a side project. Inbound form submissions get scored 0–100 based on company size, role title, and a handful of signal keywords. I tried the LLM first. It gave wildly inconsistent scores — same company, same input, three different numbers across three runs. The deterministic version with a weighted-average formula gives the same number every time, and the weights are explicit so I can defend them in a sales call. The win wasn’t the speed or the cost. The win was the defensibility. The customer can ask “why did this lead get a 73?” and I can answer with the formula. With the LLM, the answer is “the model said so.” That’s not an answer.

When the Principle Breaks: The Subtext Problem

A year ago I would have said: “Routing customer support tickets is a clear if-else problem.” I was half right. The first layer — what type of ticket — is deterministic. But the second layer — what’s the actual issue, and what’s the customer’s emotional state — turned out to need a model. We tried hardcoding it. We missed sarcasm. We missed urgency that hid behind polite language. We missed the difference between “I’m canceling” and “I’m considering canceling and wanted to give you a chance to fix it.” The model reads the subtext. The script doesn’t.

The principle breaks when one or both of the following is true:

  • The input space is open-ended (subtext, sarcasm, context-dependent meaning)
  • The output space is open-ended (subjective judgment, summarization, generation)

When input is bounded and output is bounded, code. When one or both is open-ended, model. The trap I see most often — including in my own past work — is using AI on the bounded layer because it’s easier than thinking through the rules. “I’ll just have the LLM classify this” is faster to write than “let me list the four cases.” But “faster to write” is exactly the wrong optimization when the same script will run 10,000 times a day for the next three years. The 40 minutes you save writing the prompt cost you $47 in API fees, plus the 9% error rate you’ll spend weeks debugging.

The Practical Heuristic: The 80/20 Test

When I’m deciding, I use a simple test: could a junior engineer with domain knowledge write the rules in an afternoon? If yes, that’s not AI’s job. If it would take a domain expert two weeks of discovery interviews to even start writing the rules, that’s a model problem.

Most of what gets routed to AI today falls in the first category. We just don’t notice because the model’s failure mode is silent — it returns a confident-sounding answer that is, on inspection, completely wrong. The script fails loudly. The model fails quietly. Loud failures get fixed. Quiet failures compound into technical debt that someone has to pay down later, usually at 2 AM during an incident that the model “handled correctly” three weeks ago and only just surfaced as user complaints.

The next time you find yourself reaching for an LLM client, pause for ten seconds. Write the rules in a comment block. If you can, you just wrote a spec for a function. Build the function. The token bill is the smallest of the three costs — correctness and reproducibility are the real prizes.

Think smart, not expensive.

Related Reading


Discover more from Susiloharjo

Subscribe to get the latest posts sent to your email.

Leave a Comment

Discover more from Susiloharjo

Subscribe now to keep reading and get access to the full archive.

Continue reading