Summary

Prompt engineering and governance are not the same thing. Prompting an AI to follow your architecture is a request. Governance enforces it. Prompts are session-scoped, manually maintained, inconsistently applied, and produce no machine-verifiable output. Real governance requires durable, machine-readable constraints that can be validated programmatically — not instructions that the model may or may not follow in any given session.

There is a category confusion running through most engineering teams’ approach to AI governance. The confusion is between making information available to a model and enforcing a constraint on its output. These are related but not equivalent. The first is what prompt engineering does. The second is what governance requires. Conflating them produces a governance model that looks functional until it is tested under real conditions, at which point it fails in ways that are invisible until they compound.

The prompt engineering approach to AI governance looks roughly like this: you write a detailed system prompt or a CLAUDE.md or a Cursor rules file that describes your architectural decisions. You tell the model what patterns to use, what dependencies are approved, what structures to follow. The model sees this context at the start of a session and generates code with it in view. Some percentage of the time — a high percentage, in favourable conditions — the model follows the guidance. You have solved the problem, or so it appears.

“A decision that can be ignored isn’t a decision. It’s a suggestion.”

What you have actually built is a context injection mechanism with no enforcement layer. The difference matters because the failure modes are different. A system with no enforcement layer does not produce clear failures. It produces drift — gradual, invisible, compounding — that is only visible in aggregate, long after the individual decisions that caused it.

The Four Limits of Prompt-Based Constraint Enforcement

Prompt engineering as governance fails at scale for four distinct reasons. Each is independent. Solving one does not solve the others.

Prompts are session-scoped. A constraint in a system prompt exists in that session. If the session ends and a new one begins without the same context, the constraint does not exist. In practice, context is forgotten, sessions restart, developers work across multiple projects and forget to apply the same configuration. The constraint is applied inconsistently — enforced in the sessions where someone remembered to include it, absent in the sessions where they did not. That inconsistency is invisible until you audit.

Prompts are not version-controlled as decisions. A CLAUDE.md file is a text file in your repository. It can be version-controlled. But there is a difference between version-controlling a text file and version-controlling an architectural decision with a structured schema, a severity level, a scope, and an enforcement mechanism. The CLAUDE.md can grow stale, contradict itself, be modified without the same review rigour as a decision record, and contain instructions the model may or may not follow. It is a documentation artefact, not a governance layer.

Prompts degrade as context fills. During a long coding session, context windows fill. As they fill, the model’s attention to instructions earlier in the context decreases. Constraints specified in a system prompt at the beginning of a session receive less weight by the hundredth generation turn than by the first. This is a known behaviour of transformer-based models, not a speculative risk. The governance that appeared to be working during a short session degrades during a long one.

Governance must be verifiable. A prompt is not verifiable.

Prompts cannot be validated against output. This is the most fundamental limit. When you specify a constraint in a prompt, there is no mechanism to determine, after generation, whether the model complied. You can read the generated code and judge manually — but that is not verification, it is review, and it has all the throughput limitations of human review. You cannot run a prompt constraint check and receive a PASS or FAIL. You cannot fail a build because a prompt constraint was violated. You cannot audit compliance across a codebase. The feedback loop does not close.

What These Limits Produce at Scale

In a small team working on a single project with a disciplined prompt practice, the limits described above are manageable. Someone is responsible for maintaining the context configuration. Sessions are short enough that context degradation is not severe. The same person who wrote the constraint reviews the generated code and can apply judgement about whether it was respected.

That scenario dissolves at scale. When ten developers are generating code independently across multiple projects, using different AI tools with different context configurations, the consistency of prompt-based constraint enforcement approaches zero. Not because developers are careless. Because the mechanism was never designed to work consistently across independent actors. It was designed for individual sessions, not organisational governance.

The failure mode at scale is not a visible breakage. Tests still pass. Features still ship. The codebase continues to function. What accumulates is a set of architectural violations that individually look like acceptable implementation choices and collectively represent a divergence from the decisions the team made about how the system should be structured. A second ORM adapter introduced because the model defaulted to it and nobody caught it in review. A security-adjacent pattern modified because the constraint was not in context during a long session. A structural rule violated because the developer joining the project six months in never saw the original decision and the prompt they were given did not mention it.

By the time the drift is visible, its causes are distributed across hundreds of sessions and dozens of decisions. The remediation is expensive. The post-mortem is inconclusive. Nobody made a bad decision. The infrastructure was inadequate.

What Governance Actually Requires

The properties that prompt engineering cannot deliver are exactly the properties that governance requires. They are not aspirational. They are definitional.

Durability. A governance constraint must persist. It cannot be session-scoped. It must exist in a form that survives session boundaries, context resets, and developer turnover. The appropriate form is a version-controlled record in the repository itself, co-located with the code it governs, subject to the same review process as any other architectural artefact.

Consistency. A governance constraint must be applied uniformly. It cannot depend on individual developers remembering to include it. It must be applied to every relevant generation, by every developer, regardless of which tool they are using or how they configured their session. This requires a mechanism that is independent of individual session configuration — tooling-level enforcement, not prompt-level guidance.

Verifiability. A governance constraint must be verifiable against output. The fundamental test of a governance layer is whether it can produce a machine-readable PASS or FAIL. Constraints that can only be evaluated by human judgement are not governance constraints; they are review criteria. Review criteria have a role, but they are not the same thing and they do not substitute for it.

Independence from model compliance. A governance layer must not depend on the model choosing to comply. Models are probabilistic. They follow instructions most of the time, in favourable conditions, with the right context. “Most of the time” is not an acceptable reliability level for architectural constraints. Governance must be enforced independent of whether the model followed the guidance it was given.

The Relationship Between Prompting and Governance

None of this is an argument that prompt engineering has no role in AI-assisted development governance. It has a significant role, in the right layer.

Context injection — surfacing the relevant architectural decisions in the model’s context before generation begins — reduces the probability of violations. A model that knows about a constraint is less likely to violate it than a model that does not. That probability reduction is valuable. It reduces the number of violations that reach the enforcement layer, which reduces the friction of working with the governance system and makes it easier to maintain.

But reducing the probability of violations is not the same as enforcing a constraint. Probability reduction belongs in the generation layer. Enforcement belongs in a separate layer, downstream, that validates generated output against the constraints regardless of what the model was told. The two layers work together: context injection reduces the rate of violations, enforcement catches the ones that occur anyway.

The mistake is building only the first layer and treating it as governance. The result is a system that performs well under test conditions — short sessions, consistent context, cooperative generation tasks — and degrades invisibly under real conditions, where sessions are long, context is inconsistent, and the volume of generation is high enough to produce violations at a meaningful rate even when the probability per generation is low.

The Cost of Conflating the Two

The cost of treating prompt engineering as governance is not primarily the violations it fails to catch. It is the false confidence it produces about the violations it is catching. A team that has invested in a careful prompt practice believes it has solved its AI governance problem. The codebase continues to drift in ways the team’s existing tooling cannot surface, because the tooling was designed to reduce the probability of drift, not to detect and flag it.

That false confidence is more expensive than acknowledged ignorance. A team that knows it has no governance layer will notice the symptoms — inconsistency, architectural violations in review, dependencies that should not have been introduced — and attribute them correctly. A team that believes its prompt practice is sufficient will attribute the same symptoms to other causes: developer carelessness, inadequate review, insufficient documentation.

The right frame is one that distinguishes clearly between what prompting can do (increase the probability that the model follows guidance) and what governance requires (enforce constraints independent of model compliance). Prompting is a valuable input to the generation layer. Governance is a separate layer that the generation layer feeds into. Building only one and expecting both is the category confusion that most teams are currently making.

Key Takeaways

FAQ

What’s wrong with a detailed CLAUDE.md for governance?
A CLAUDE.md is a context injection tool, not an enforcement mechanism. It increases the probability that the model follows guidance in a given session. It cannot be validated against output, does not apply consistently across developers, degrades as context fills, and produces no machine-readable compliance signal. It is a useful input to the generation layer, not a substitute for a governance layer.
If prompting can’t enforce constraints, what can?
Programmatic validation: structured, machine-readable decision records that a CLI or CI pipeline can evaluate against generated code. The check produces a deterministic PASS or FAIL independent of what the model was told and independent of whether it followed the guidance.
Does this mean teams should stop writing detailed system prompts?
No. Context injection remains valuable as one layer of a governance approach. The point is to add enforcement as a second, independent layer — not to remove the first. Both layers are needed; neither substitutes for the other.
Is this problem specific to Claude and Cursor, or does it apply to all AI coding tools?
All AI coding tools. The failure mode — session-scoped, model-dependent, unverifiable constraint enforcement — is a property of how large language models process context, not a specific product limitation. The enforcement layer solution is similarly tool-independent.