How to Catch AI Mistakes In Your Code

6 Deterministic Ways To Prevent AI Mistakes

May 27, 2026

The agent harness wasn't supposed to be the black box (Sponsor)

Agent loop is the most important piece of infrastructure in your workflow right now and for most developers, it’s the one piece they can’t open up. Agent builders have to jump through all the hoops themselves, crafting the infrastructure and tools, testing the harness, while fighting to maintain what they’ve built.

Meet Cline SDK: agent harness behind Cline 2.0, fully open-sourced. The same runtime that powers Cline across VS Code, JetBrains, and the CLI is now an npm install away: npm i @cline/sdk. Inspect it, fork it, extend it, ship on it.

Best-in-class harness: 74.2% on Terminal-Bench 2.0 with Claude Opus 4.7 ahead of Claude Code (69.4%) and strongest numbers published on open-weight models.
Open model & provider choice: Anthropic, OpenAI, Google, Bedrock, Mistral, or any OpenAI-compatible endpoint.
Real plugin system: Register tools, hooks, commands, providers, message builders. Prototype as a local file, harden into a package. Extend it freely for any of your agent use cases.
Scheduled + event-driven agents: Cron and event specs for PR reviews, dependency checks, coverage audits, changelogs no separate orchestration layer.

Stop building around your agent. Start building on it.

Install Cline SDK today: npm i @cline/sdk

Or try the rebuilt harness directly: npm i -g @cline

Get Started Today

Try It Now

Motivation

Stripe’s internal AI agents ship 1,300 pull requests every week. The world’s biggest payment processor is moving faster than human review teams can keep up.

AI made code cheap. Writing code is no longer the bottleneck. Reviewing it is.

Human reviews do not scale to AI speed.

That’s where deterministic guardrails matter.

Compilers, tests, linters, analyzers, and CI should enforce rules. Humans should focus on intent, tradeoffs, and product decisions.

Guardrails work because they exist outside the agent’s control.

An AI can ignore a style guide or rationalize past review comments.

It cannot ignore a failing compile, a red test, or a blocked CI pipeline.

Here are 6 deterministic ways to stop AI from shipping bad code:

Static Analysis and Linting

Agents leave a recognizable trail: suppressed “@ts-ignore”, dead “console.logs”, hand-rolled retries next to a util that already exists.

Every codebase ends up with its own “slop register.” This is where low quality code stacks up.

Static analysis and linting catch dangerous patterns automatically: dead code, unused dependencies, unsafe async flows, code smells, complexity, security issues, and all the repetitive nonsense humans keep pointing out in reviews.

TypeScript with ESLint is strong to catch things at the code-pattern level. Tools like SonarQube, Semgrep, or CodeQL go wider to catch issues at scale.

The rule is simple: if your team keeps repeating the same PR comment, turn it into automation.

Example of bad usage of async:

Example of good usage of async:

Now turn the rule into automation. One line in your ESLint config makes this a build failure, not a review comment.

Don’t forget: A repeated review comment is just an unimplemented rule.

Architecture Tests

Agents are great at producing code that looks fine locally while quietly breaking global structure.

The compiler catches some structural mistakes for free. Architecture tests catch the rest: naming conventions, forbidden placements, feature coupling, and rules like “controllers only live in the API layer” or “handlers must depend only on abstractions.”

Tools like ArchUnitNET make these tests very readable. You can assert that controllers never appear outside the API namespace, or that classes ending with Service only live where you expect them.

That turns architecture into a contract that guards your system against mistakes AI agents might make.

Type System Guardrails

Agents often loosen types just to make code compile.

An “any” here.

An “as unknown as X” there.

An “@ts-ignore” to silence the last error.

Then you quickly realise that your compiler starts accepting anything.

There are 3 main ways to lock it down in the TypeScript:

1. Turn on strict TypeScript settings

Use:

{
  "strict": true,
  "noUncheckedIndexedAccess": true,
  "exactOptionalPropertyTypes": true
}

Enable them for new code first. Then improve the existing code gradually.

2. Block the easy shortcuts

Add ESLint rules like:

// eslint.config.js
import tseslint from "typescript-eslint";

export default tseslint.config({
  rules: {
    "@typescript-eslint/no-explicit-any": "error",
    "@typescript-eslint/ban-ts-comment": "error"
  }
});

These block the first escape routes agents usually reach for.

3. Model state with discriminated unions, not boolean flags

Optional booleans create impossible states: a payment that is both "processing" and "paid", or a failed payment with a receipt ID.

❌ Bad:

✅ Good:

The first model lets agents create invalid states. The second makes invalid states impossible.

Type System Guardrails prevents AI from adding bad code.

Test Guardrails

Good tests do more than prove code works. They protect the system from lying to you.

That matters even more with AI. Generated tests often look clean and complete while missing the one thing that matters: would they catch a real bug?

There are 3 testing practices that help you protect your systems:

Mutation Testing

Code coverage is not a quality metric. You can have 100% coverage without any test assertions. Code coverage can help to detect untested areas of the code, but it doesn't tell anything about the tested areas

The solution is to use mutation testing. It changes your code in small ways and checks whether your tests fail. If they still pass, your safety net is fake.

Coverage tells you what code ran. Mutation testing tells you whether your tests would catch a bug.

Stryker mutation testing framework works well for JavaScript, TypeScript, C#, and Scala. If you use another language, ask your LLM for mutation testing framework options.

To learn more about mutation testing, check out my previous article here.

Contract Testing

Architecture tests protect boundaries inside your codebase. Contract tests protect boundaries between systems.

They catch integration bugs early: an external API changes, your application still expects the old response, and production would fail.

API docs are not enough. Executable tests against real test environments verify the contract still holds.

Invariant & Property Based Testing

Contract tests prove the API shape matches. The system can still be wrong.

That is where invariants come in: rules that must always stay true.

For a REST API, examples might be:

createdAt <= updatedAt
totalCount is never smaller than the number of returned items
private fields are never exposed in public responses

Property-based testing checks these invariants across hundreds of generated inputs, including edge cases humans and agents often miss.

Agents tend to write happy-path tests for happy-path code. Tools like fast-check (TypeScript), Hypothesis (Python), and jqwik (JVM) generate unexpected inputs and search for cases where your rules break.

The goal is to catch violated business rules, broken assumptions, or edge cases.

For learn more about property based testing, check out my previous article here.

Supply Chain Guardrails

AI loves dependencies a little too much.

Give an agent a task and there is a good chance it will install two packages, one abandoned library, and a helper utility nobody on your team has ever heard of. That is supply-chain chaos.

Supply chain guardrails stop that early. They block vulnerable, outdated, or unapproved libraries from quietly entering the system. They also make updates more reliable by automating review and enforcement around package health.

One of the simplest defenses is to reject packages that were published too recently. If a package was uploaded 20 minutes ago, your CI should not trust it just because the registry serves it. A short age gate gives the ecosystem time to catch obvious compromises before your build pulls them in.

Setting it to 3 days is a practical default. Examples:

Audit checks, Renovate, lockfile discipline, and package age gates are all part of the same idea: your system should be selective about what it allows in.

Don’t forget: Working code is not the same as safe code.

Migration Guardrails

AI generated migrations are where dev-time speed turns into prod-time disasters.

An agent writes:

ALTER TABLE users 

ADD COLUMN email_verified 

BOOLEAN NOT NULL DEFAULT false;

Every test passes. Then it locks a 50-million-row table in production for 40 minutes.

The agent did not know your table size. It did not know your write throughput. It only knew the syntax.

Migration linters catch this before merge.

For example, Squawk for Postgres flags non-CONCURRENTLY indexes, dangerous renames, missing defaults, and 30+ other footguns.

strong_migrations does the same for Rails.

Run them in CI on every PR that touches migrations/ folder

Humans Review Intent. Machines Enforce Rules

In the AI era, quality will not come from reading more diffs. It will come from building better systems of enforcement.

Human review still matters. But it should not be spent catching forgotten awaits, broken boundaries, weak models, or sketchy packages.

Machines can do that faster and more consistently.

Human review should move up the stack. It should focus on whether the feature solves the right problem, whether the business rule is correct, whether the tradeoff makes sense, and whether the design fits the product.

Stop relying on hero code reviewers. Start adding deterministic guardrails to Craft Better Software.

Promote Yourself to 37,000+ subscribers by sponsoring this newsletter.

celikelozdinc

Jun 5Edited

Thanks for your sharing, do you suggest any other resources (book, video, publication etc) for further reading and deep-dive on this topic?

Nishant

May 27

Great list. The supply chain point is underrated - we have seen agents pull in packages that were published 20 minutes ago and nobody noticed until a CVE dropped. The gap with all 6 of these is setup cost - most small teams building with Cursor or Claude Code do not configure any of them. That's the problem we are solving at vibedoctor.io.

Craft Better Software

Discussion about this post

Ready for more?