Software Engineering

AI Code Audit Findings: 6 Problems in Almost Every AI-Built Codebase

June 18, 2026
AI Code Audit Findings: 6 Problems in Almost Every AI-Built Codebase

TL;DR. We run a lot of audits on apps built with Cursor, Claude Code, Bolt, Lovable, and long ChatGPT sessions. The codebases differ wildly, but the findings almost never do. Six problems show up again and again: no test coverage (which makes every deployment a gamble), duplicated variables and functions (the same logic copy-pasted with slight drift), no consistent architecture (two screens in the same project written like two different apps), stream-based state avoided (so async logic collapses into callback hell and complex forms never quite work), no awareness of the deployment environment (in-memory state and local files that break the moment the app runs as more than one instance), and security left wide open (.env files committed, tokens hardcoded in source). None of these are exotic. All of them are predictable — and all of them are fixable without a rewrite. This is the engineering companion to our founder's guide to shipping an AI prototype; if you want it handled, that is what an AI code audit does.


Why the findings repeat

AI coding tools optimize for one thing: the shortest path to code that runs. That is genuinely useful — you get an idea into a working state in hours. But "runs in the demo" and "maintainable in production" are different targets, and the gap between them is remarkably consistent across projects.

The reason is structural. A model generating code has a narrow window of context and no memory of the decisions it made three files ago. It cannot test the thing it just wrote, it has no sense of your runway or your security posture, and it has no incentive to keep the codebase coherent over time. So it makes the locally optimal choice every time — and the sum of locally optimal choices is a codebase that works today and resists every change tomorrow.

After enough audits, the failures cluster into the same six buckets. Here they are, in the order they tend to hurt.


1. No tests — every deployment is a gamble

The single most common finding: there are no tests at all. Not a thin suite, not flaky tests — zero. The app was validated by clicking through it, and that is the entire safety net.

This is invisible right up until the moment it is catastrophic. With no tests, there is no way to know whether a change broke something other than shipping it and waiting for a user to complain. Every deployment becomes a manual regression pass that nobody actually performs, so refactoring becomes terrifying, dependency updates get skipped, and the codebase calcifies — not because the code is good, but because no one dares touch it.

Why AI does this. Generating a feature and generating tests for that feature are two separate requests, and nobody made the second one. The model will happily write tests if asked, but left to its own devices it ships the happy path and stops.

How we fix it. We do not aim for 100% coverage on day one. We add a thin layer where it pays off most: a smoke test that the app boots, tests around the money-handling and auth logic, and a regression test for every bug we fix during the audit. That alone turns deployments from a gamble into a routine. From there, coverage grows with the codebase instead of being a wall to climb.


2. Duplicated variables and functions

Open an AI-built codebase and search for the same date-formatting helper. You will often find it three or four times — slightly different each time, because each was generated in isolation for the screen that needed it. The same goes for validation rules, API clients, currency math, and configuration constants.

Duplication is not just ugly; it is a correctness time bomb. When the logic needs to change — a new tax rule, a fixed rounding bug, an updated endpoint — you have to find every copy. You will miss one. Now two parts of the app disagree about something they should agree on, and that disagreement is the next production incident.

Why AI does this. The model rarely searches the existing codebase for a helper it could reuse. It is cheaper, from its perspective, to regenerate the function inline than to discover and import the one that already exists. Each generation is locally reasonable; the aggregate is drift.

How we fix it. We find the clusters of near-identical code, extract a single source of truth, and route every call site through it. This is the highest-leverage cleanup in most audits: it shrinks the codebase, removes whole categories of "fixed here but not there" bugs, and makes the next change a one-line edit instead of a scavenger hunt.


3. No consistent architecture

This one is jarring to see for the first time. Two screens in the same project will be written as if by two different teams: one fetches data in the component, the other through a service layer; one holds state one way, the next does it completely differently; naming, folder structure, and error handling change from feature to feature. There is no spine.

A codebase with no consistent architecture is one where every file you open is a surprise. Onboarding a developer takes weeks because there is no pattern to learn — only a hundred special cases to memorize. Worse, when patterns conflict, the seams between them are exactly where bugs breed: data fetched two different ways, errors handled in one place and swallowed in another.

Why AI does this. The model has no persistent picture of "how this app is built." Each prompt is a fresh start, so it reaches for whatever pattern fits that one request. Over a project's life that produces a patchwork — every piece sensible alone, the whole thing incoherent.

How we fix it. We pick one architecture that fits the project — not a dogmatic one, a fitting one — and converge the codebase onto it incrementally. Data access, state, error handling, and naming get a single agreed shape. The goal is not purity; it is that a developer who learns one feature can predict how the next one works.


4. Stream-based state avoided — straight into callback hell

This is the most technically interesting failure, and the one that quietly breaks the hardest features. AI-generated code tends to avoid stream- and reactive-state models in favor of imperative callbacks. Instead of modeling "this value changes over time and the UI reacts," it wires up a callback, which triggers another callback, which sets a flag, which fires a third — and the result is callback hell.

For simple screens you barely notice. But the moment the state is genuinely complex — a multi-step form with cross-field validation, a live-updating dashboard, anything with debouncing, retries, cancellation, or optimistic updates — the callback approach falls apart. The classic symptom is the form that almost works: it validates, but the error clears at the wrong moment; it submits, but a double-tap fires it twice; a field updates, but a dependent field lags one keystroke behind. These are not random bugs. They are the inevitable result of coordinating complex async state by hand instead of modeling it as a stream.

Why AI does this. Imperative callbacks are the most common pattern in its training data and the easiest to generate one piece at a time. Reactive and stream-based models require holding the whole state machine in mind at once — exactly what a context-limited generator is worst at.

How we fix it. We identify the complex-state features and rebuild their state layer properly — as streams or a reactive state model appropriate to the stack — so the UI is a function of state rather than a pile of callbacks racing each other. The callback-hell features that "mostly worked" start working completely, and they become changeable without fear.


5. No awareness of the deployment environment

The model writes code as if it will run as a single process on one machine — because from inside the prompt, that is the only environment it can see. It has no idea how many instances will run, what managed services already exist, how traffic is routed, or which work runs in parallel. So it defaults to the simplest possible topology, and that default quietly breaks the moment the app is deployed for real.

The symptoms are always the same. State that lives in process memory — a cache, sessions, rate-limit counters, an in-flight lock — works perfectly on one instance and silently diverges the moment a second replica comes up behind the load balancer. Background jobs and scheduled tasks fire on every instance instead of once, so the email goes out three times and the nightly cleanup runs in triplicate. Uploaded files get written to local disk that vanishes on the next deploy and was never shared between instances anyway. The code is correct for a single box and wrong for the cluster it actually runs on.

Why AI does this. It has no picture of your infrastructure. It does not know you already have Redis, a message queue, object storage, and a managed database — so it reimplements them in memory. It does not know you run three replicas, so it assumes one. The deployment topology is exactly the context a prompt cannot contain, so the model fills the gap with the simplest assumption and moves on.

How we fix it. We map the actual deployment — how many instances, which managed services exist, how work is parallelised — and move shared state to where it belongs: cache, sessions, and locks into Redis or the database, files into object storage, recurring work onto a real scheduler or queue that runs once instead of firing on every node. The result is code that scales horizontally — adding an instance makes the app faster instead of corrupting its state.


6. Security left wide open

Every audit finds the same security smells, and they are exactly the ones that cost real money: .env files committed to the repository, API keys and tokens hardcoded directly in source, credentials baked into client bundles. One leaked third-party key can run up thousands in unauthorized charges within hours — or hand an attacker your data store.

Alongside the secrets, the usual companions: authentication with no real authorization behind it (a login screen, but endpoints that never check who is asking), user input concatenated straight into queries and prompts, and sensitive data sitting in plain text or in logs. We covered the security side in depth for founders in what breaks when an AI prototype hits production, and the right way to handle data in our guide to encryption at rest and in transit.

Why AI does this. The shortest path to a working feature is almost never the secure one. Hardcoding a key works immediately; wiring up a secrets manager does not. The model picks the path that runs.

How we fix it. Secrets come out of source and into proper environment management, keys get rotated (anything committed to git is already compromised), authorization checks go on every endpoint that needs them, and input gets validated and parameterized. This is the non-negotiable part of any audit — a leaked key or an auth bypass is an emergency, and it gets fixed first.


The pattern behind the pattern

Step back and the six findings share one root cause: AI optimizes each generation locally, and nobody is optimizing the codebase globally. Tests, deduplication, architecture, state modeling, deployment awareness, and security are all whole-system properties. They cannot emerge one prompt at a time, because no single prompt can see the whole. That is precisely the gap a human review closes.

The reassuring part is that none of this means the AI-built foundation is wasted. The features work; the product is real. What is missing is the connective tissue — and adding it is far faster than rebuilding from scratch.


Frequently asked questions

Across dozens of AI-generated codebases, six engineering problems recur almost every time: no automated tests, heavily duplicated variables and functions, no consistent architecture across the project, complex async state implemented as callback hell instead of streams, code written with no awareness of the deployment environment, and security gaps such as committed .env files and hardcoded tokens. The specific code differs from project to project, but these six categories show up again and again.

Get the findings for your codebase

If you have an AI-built app and you recognize any of these six, you are not behind — you are exactly where almost every AI-generated codebase lands. The fix is not a rewrite; it is a focused audit that adds the connective tissue the AI could not.

Run your codebase through an AI code audit, or book a free assessment and we will tell you which of the six is your biggest risk, what it takes to fix, and give you a fixed-scope quote — not a guess.

Ilya Nixan

Ilya Nixan

Founder & Lead Developer

More from Ilya Nixan