Insights / Engineering · 2024-08-27 · 5 min read

Schema changes that don't wake anyone up

A column gets renamed upstream. Somewhere downstream a join silently returns nulls, a metric drops, and three days later someone asks why the number looks wrong. Schema changes are the most common quiet break there is.

Source schemas change — that's not avoidable. What's avoidable is finding out about it from a confused stakeholder instead of from your pipeline. The goal is to make schema drift loud at the boundary, before it propagates into wrong numbers.

Pin the schema you depend on

Where you ingest a source, declare the columns and types you rely on and check them on every run. If a column you need disappears or changes type, the ingestion fails immediately, naming the source — rather than passing nulls downstream where the damage is hard to trace.

Additive changes are safe; the rest aren't

A new column upstream shouldn't break you, and your checks shouldn't be so rigid that they fail on harmless additions. But a removed or renamed or retyped column should stop the line. Distinguish the two: tolerate additions, reject removals and type changes.

You want to be the first to know a column was renamed — not the last, after it's already corrupted a week of reports.

Version the contract with the consumer

Keep the schema expectation next to the code that consumes the source, in the same repository, reviewed in the same pull request. When a deliberate change is needed, it's a visible diff a human approved — not a surprise that arrived through the data at midnight.

← All insights