When I wrote about why I built Mneme HQ, the argument was theoretical. Architectural drift exists; AI agents have no memory; we need to fix that. Reasonable people could agree with the framing and still ask the harder question — how often does it actually happen, and what does it actually catch?
Ninety days of telemetry across a handful of real projects gives me something more concrete to point at. The numbers below come from Mneme’s own dogfooding plus a small set of teams who opted in to anonymised reporting. They are not benchmarks. They are observations — a first attempt to put weight behind the claim that decision infrastructure matters.
“You cannot improve what you do not measure. The same is true of the governance layer itself.”
The shape of the data
Across the period, the instrumented projects produced roughly 11,400 Mneme check runs. Each run evaluates the current diff against the decisions stored in the repo and returns a PASS, WARN, or FAIL per decision. A single AI-assisted commit typically triggers between three and twelve evaluations, depending on how many decisions touch the changed files.
The headline distribution looked like this:
| Outcome | Count | Share |
|---|---|---|
| PASS | 9,612 | 84.3% |
| WARN | 1,341 | 11.8% |
| FAIL | 447 | 3.9% |
A 3.9% hard-fail rate is small in isolation but consequential in aggregate. Across the cohort it meant 447 architectural decisions were violated by an AI-assisted change before review — nearly five per day. Without the gate, those changes would have reached human reviewers carrying a class of mistake that humans are demonstrably bad at catching: pattern-level violations buried in plausible-looking code.
What gets violated most
The failures clustered around three categories. Together they account for 78% of all FAILs:
| Category | FAIL share | Typical example |
|---|---|---|
| Dependency policy | 34% | Adding a library that an ADR explicitly excluded (often a near-equivalent of one already approved) |
| Layering / boundary | 27% | Calling a repository directly from a controller, bypassing a service layer locked in months earlier |
| Storage / persistence | 17% | Reaching for a second database, adding an ORM alongside an existing query builder, or introducing a redundant cache |
None of these are exotic. They are exactly the kinds of decisions that get made in a Slack thread on a Tuesday and forgotten by the following month. The agent has no way to know about the Slack thread. The agent has every way to know about the library it has seen ten thousand times in training.
Where the gate pays for itself
The cheapest place to catch an architectural violation is before it gets reviewed by a person. The next cheapest is in code review. The most expensive is after merge, when the violation is now load-bearing and unwinding it means coordinating across the team.
Of the 447 hard fails, 412 (92%) were resolved by the original author inside the same session — usually by adjusting the prompt, choosing a different approach, or accepting the rule and updating the decision. Twenty-six were resolved by editing the ADR itself (the rule was wrong; this is healthy). Nine were waived with an inline override and a comment, all of which went to human review.
The interesting number is the 92%. In a process without the gate, almost all of those would have shipped to review carrying the violation. Reviewers would have caught some, missed others, and asked questions about a third group that wasted everyone’s time. The gate compresses that whole loop into a CI run.
What surprised me
Three things did not match my prior beliefs.
WARN volume is the real signal, not FAIL. Warnings are where the conversation lives — a new dependency that isn’t banned but isn’t approved, a pattern that is allowed but discouraged. Teams that engaged with WARNs (either by promoting them to FAILs or by approving the new pattern as a decision) ended up with sharper decision sets. Teams that ignored them drifted slower than no-Mneme, but they still drifted.
Decisions decay. About 18% of ADRs created at the start of the period had been edited, retired, or replaced by day 90. This is fine. The point of writing them down is that you can change them deliberately. The problem with prompt-only governance is that you cannot.
Most violations were not in the agent’s output. They were in human edits made in response to agent output. The agent suggested something near a boundary; the human accepted it and then extended it past the boundary. Governance has to bind both sides of the interaction or it binds neither.
What this does not show
The data above is not a controlled study. It does not tell you whether Mneme makes teams faster, whether it reduces bugs in production, or whether it would replicate at a Fortune 500. It tells you that on the projects where it ran, the system caught a non-trivial volume of pattern violations that would otherwise have reached review or merge, and that the violations clustered in predictable, fixable categories.
That is the case for governance infrastructure stated as observation rather than argument. The next ninety days will tell a longer story; I will write that one too.

© Theo Valmis