Summary

Data lakes were designed for cheap storage and batch analytics. AI has structurally different requirements: data freshness, schema consistency, retrieval-oriented design, and governance strong enough to prevent model bias. Most enterprises discover the mismatch only after attempting to build AI features on lake data and finding the outputs stale, inconsistent, and difficult to audit. The lake does not need to be replaced. But it needs to be extended with layers it was never designed to include — and understanding which layers are missing is the prerequisite for building AI infrastructure that actually works.

The data lake was one of the more consequential ideas in enterprise data infrastructure of the last decade. The proposition was direct: store everything cheaply, and figure out the structure later. Schema-on-read rather than schema-on-write. Raw data preserved in its original form, transformations applied at query time. For the analytics problems that dominated in the Hadoop and early Spark era — batch processing, ad-hoc exploration, historical analysis — this was a defensible architectural choice. It solved the problem of the time.

AI has a different set of problems. And most data lakes are not equipped to solve them.

This is not a criticism of the organisations that built lakes. It is a statement about architectural fit. A lake built to answer “what happened last quarter?” is not the same infrastructure as what is needed to answer “what is true about this entity right now, and how confident are we?” The workloads are different. The requirements are different. The failure modes are different.

What Data Lakes Were Actually Built For

The data lake emerged from a specific set of constraints. Storage had become cheap enough that retaining raw data indefinitely was economically viable. Distributed processing frameworks made it possible to run large-scale transformations over that data without the schema-up-front cost that data warehouses required. The dominant question organisations wanted to answer was retrospective: what did our customers do, how did our campaigns perform, what trends are present in the last six months of data.

The schema-on-read design choice was the key enabler. Rather than imposing a schema at ingestion — which required agreeing on structure before you knew what questions you would want to ask — the lake let data land in its raw form. The analyst applied schema at query time, fitting the data to the question rather than fitting the question to the data. This flexibility was genuinely useful for exploratory analytics.

It was also the origin of most of the problems that followed.

Schema flexibility at ingestion means schema inconsistency at scale. Tables accumulate without owners. Column definitions drift as upstream producers change formats. Multiple tables represent the same concept with subtly different logic. Analysts who built their workflows on these tables carry the reconciliation knowledge in their heads, not in documentation. The lake becomes, over time, a data swamp: vast, full of information, and navigable only by those who know which paths to avoid.

What AI Actually Needs

AI workloads — whether training models, running inference, or powering retrieval-augmented generation pipelines that surface organisational knowledge — have requirements that the data lake was not designed to meet.

Freshness. A recommendation engine running on yesterday’s batch ingestion does not know what the user did this morning. A RAG pipeline retrieving from a lake updated twelve hours ago answers questions about a state of the world that no longer exists. For many AI features, that staleness is not a minor inaccuracy. It is the difference between a useful output and a misleading one. Batch-first architecture introduces systematic latency that analytics workloads could tolerate and AI workloads often cannot.

Schema consistency. Embedding-based retrieval — one of the foundational techniques in AI data pipelines — requires knowing what a field means before you embed it. Schema-on-read, where structure varies between records and is applied at query time, is architecturally incompatible with this. To build a reliable vector index over your product catalogue, customer records, or support history, those records need consistent field definitions. A column that means one thing in one batch and something slightly different in the next cannot be reliably embedded. The inconsistency does not announce itself. It distributes into retrieval errors.

Lineage and provenance. AI outputs need to be auditable. When a model produces a recommendation or a RAG pipeline surfaces an answer, the question “where did this data come from?” needs a traceable answer — both for debugging and for compliance. Data lakes, which often accumulate data without strong lineage tracking, make this difficult. The transformation history from raw data to model input is frequently undocumented, reconstructed from code rather than recorded as metadata.

Governance. The consequences of bad data differ between analytics and AI in a way that changes how seriously governance needs to be taken. A wrong number in a BI dashboard is a visible, correctable error. A bad training example biases a model in ways that may not surface in any individual output — the model learns from it, and the effect distributes across its behaviour. A misdefined metric retrieved by a RAG pipeline produces a confident, plausible wrong answer.

Bad data in a BI pipeline produces a wrong number. Bad data in a training pipeline produces a biased model. The stakes are structurally different.

The Four Structural Mismatches

The incompatibility between data lake architecture and AI requirements can be reduced to four mismatches. Each is independent. Each has a different fix. And most organisations that discover they have a problem discover all four at once.

Batch versus freshness. Lakes are designed around batch ingestion: data arrives on a schedule, the lake is updated periodically, and queries run against a snapshot. For analytics, this is fine. For AI features that need to reflect current state, it introduces latency that the architecture does not provide a natural way to reduce. Incremental or streaming pipelines are an addition to the lake, not a feature of it.

Schema flexibility versus schema consistency. The lake’s tolerance for heterogeneous schema was a feature for exploratory analytics. For AI pipelines, it is a liability. Embedding requires consistent field definitions. Retrieval requires stable schema. Schema-on-read environments, where structure varies by record or by batch, cannot guarantee either. The fix requires schema contracts: upstream producers commit to a structure; downstream consumers are protected from breaking changes.

Storage-first versus retrieval-first. Data lakes optimise for write and storage costs. Data is cheap to ingest; retrieval is a secondary concern. AI retrieval pipelines optimise for relevance and latency. The index structures, caching strategies, and data organisation required for fast, accurate retrieval are orthogonal to the organisation principles of a cost-optimised storage system.

Governance-optional versus governance-required. Many data lakes accumulated over years without strong governance. Table ownership was unclear. Metric definitions were undocumented. Multiple versions of the same business concept existed with different logic. For BI workloads, this was a manageable nuisance — analysts learned the landscape. For AI workloads, ungoverned data is a structural risk. Models learn from it. RAG pipelines retrieve from it. The inconsistency enters the output layer.

Data lakes optimised for storage cost. AI infrastructure must optimise for data quality, schema consistency, and retrieval precision. These are different cost functions from the ground up.

The Data Swamp, Accelerated

The “data swamp” problem — a lake that has accumulated without governance until it is full of undocumented, poorly owned, inconsistently defined data — is not new. Most mature data lakes have some degree of it. It was, for analytics teams, a manageable problem: the engineers who built the tables knew their quirks, and institutional knowledge filled the governance gap.

AI workloads accelerate the problem in two ways. The first is scale of consumption. A BI analyst querying a single table encounters the governance problems of that table. A model training on the entire lake encounters the governance problems of every table simultaneously. The inconsistencies that were previously localised and known become simultaneously active inputs to a system that cannot distinguish a misdefined column from a correctly defined one.

The second is output visibility. BI errors are visible inside the organisation, to the analysts who produced them. AI errors surface in user-facing features, in model outputs, in answers given to customers. The governance problem that was previously an internal nuisance becomes an external reliability question.

The Lakehouse Is a Step, Not a Solution

The lakehouse model — implemented through Delta Lake, Apache Iceberg, or Apache Hudi — addresses the most acute structural problems of the raw lake by adding ACID transaction support, schema enforcement at the storage layer, and time travel for data versioning. For organisations already heavily invested in lake infrastructure, migrating to a lakehouse format is a sensible step and genuinely improves AI readiness.

What the lakehouse does not provide is the semantic layer: agreed definitions for business entities and metrics, enforced consistently across all consuming systems. Without it, the tables in your lakehouse may have consistent schema within themselves but inconsistent meaning across the organisation. One team’s “customer” is another team’s “account.” One team’s “revenue” excludes refunds; another’s includes them. A RAG pipeline that retrieves across both produces answers calibrated to whichever definition happens to be in the retrieved context.

A data lake without a semantic layer is infrastructure without a contract. AI makes the cost of that missing contract visible at every retrieval.

The lakehouse also does not automatically provide freshness-aware pipelines for the data AI features will consume. It provides better tooling for managing incremental updates, but the pipelines themselves still need to be built. And it does not provide the retrieval-oriented data organisation that AI retrieval systems benefit from — the partitioning strategies, the vector index structures, the caching layers that reduce retrieval latency to something workloads can depend on.

What AI-Ready Data Infrastructure Actually Requires

The practical response to this is not to tear out the lake and start again. It is to identify what the lake is missing and add it incrementally, starting with the data that AI workloads will actually consume first.

The first priority is schema contracts for AI-consumed data. Upstream producers of the tables that AI features will read should commit to a schema: field names, types, and definitions that downstream consumers can depend on. Violations should be caught at ingestion rather than discovered at query time. This does not require migrating the entire lake — it requires applying schema governance to the specific tables that matter first.

The second priority is a semantic layer: agreed definitions for the business entities and metrics that AI outputs will reference. What counts as an active customer, how churn is defined, what the canonical revenue figure is. These definitions need to be enforced at the data layer, not left to each consuming model or pipeline to interpret independently. The semantic layer is the part of data infrastructure that organisations most frequently defer and most frequently regret deferring when AI workloads land on inconsistent definitions.

The third priority is freshness-aware pipelines for the data AI features need to be current. Not a full streaming migration, but incremental processing for the specific entities where staleness produces meaningfully worse AI outputs. The scope is usually smaller than it first appears: most AI features are sensitive to staleness in a small number of high-priority entities, not across the entire data estate.

The fourth is data lineage for AI-consumed tables: documentation of the transformation path from raw source to the table the model or pipeline reads. Not comprehensive retroactive lineage across the whole lake, but prospective lineage built into the pipelines producing AI-relevant data from this point forward. When an AI output is wrong, lineage is how you find out why.

None of this is a rewrite. It is a prioritised extension of existing infrastructure, applied to the data that matters for AI first. The governance coverage grows as AI workloads expand. The lake remains; the layers it was never designed to include get built on top of it.

Key Takeaways

FAQ

We have a well-maintained data warehouse, not a lake. Does this argument apply to us?
Partly. A well-maintained data warehouse typically has better schema consistency and governance than a data lake, which addresses two of the four mismatches. The freshness and retrieval-orientation gaps often still apply: warehouses are also batch-oriented, and the schema design optimised for aggregation queries is not the same as schema optimised for record-level retrieval. The semantic layer question — are your metric definitions enforced and consistent? — applies regardless of whether you are on a lake or a warehouse.
What should we do first if we want to start building AI features on our lake today?
Identify the specific tables your AI feature will consume and characterise them: how fresh is this data, are the field definitions consistent across historical batches, who owns this table, and where does it come from? If those questions have clear answers, you have enough to start with reasonable confidence. If they do not, closing those gaps is the prerequisite. Building AI features on tables whose ownership, freshness guarantees, and schema consistency are unknown is building on an undefined foundation.
How does RAG interact with data lake limitations?
RAG pipelines retrieve records from a data store and include them in the context window of a language model. If the retrieved records come from a lake with inconsistent schema, they may describe the same concept differently depending on when they were ingested or which pipeline produced them. If the lake has staleness, the retrieved records may not reflect current state. If there is no semantic layer, the same entity may appear under different identifiers in different retrieved chunks. Each of these failures produces confident-sounding wrong answers — the model reasons correctly over the retrieved context, but the context is wrong.
Is the lakehouse migration worth the cost?
For organisations planning significant AI workloads on existing lake data, yes. The schema enforcement and ACID transaction support that Delta Lake and Apache Iceberg provide make a meaningful practical difference to pipeline reliability and data quality. The migration cost varies significantly with how well-organised the existing lake is. The more valuable question to answer first is whether the semantic layer and freshness pipeline work has been scoped — because migrating to a lakehouse format without those layers still leaves the most consequential AI readiness gaps unaddressed.