Data contracts that survive contact with reality
Most data contract projects start with a manifesto and end in a wiki nobody reads. The version that works is smaller, less ambitious, and shipped on a Tuesday.
A data contract is an agreement about the shape and meaning of a dataset, with an owner attached. That's the whole idea. The trouble starts when teams treat it as a platform initiative instead of a habit — drawing up a registry, a governance council, and a roadmap before a single check runs in production.
Why the big version stalls
The ambitious rollout asks every producer to declare a formal schema for every table before anyone sees a benefit. The cost is paid up front by the people writing the contracts; the payoff lands much later, somewhere else. That asymmetry is fatal. Engagement drops, the registry goes stale, and within two quarters the contracts describe a system that no longer exists.
A contract that documents last quarter's schema is worse than no contract — it tells people something false with the authority of a spec.
The version that ships
Start with one table. Specifically, the table whose breakage causes the most pain — usually the one feeding a board-level metric or a customer-facing number. Write down four things:
- Schema: column names, types, and which are non-null.
- Semantics: what one row means, in a sentence.
- Owner: the team that gets paged when it breaks.
- Guarantees: freshness and a few invariants that should always hold.
Then — and this is the part teams skip — turn those guarantees into a test that runs in the pipeline, before the data is published. A contract without an executable check is a comment.
# a contract as a check, not a document
assert row_count > 0, "table is empty"
assert null_rate(user_id) == 0, "user_id must be present"
assert max(event_ts) > now()-2h, "data is stale"
assert set(status) <= ALLOWED, "unexpected status value"
When a check fails, it fails before the bad data reaches a dashboard, and it names the producer who can fix it. That's the entire value proposition, delivered on one table, in an afternoon.
Let it spread by demand, not decree
Once one team has been saved from a 9pm incident by a contract that caught a renamed column upstream, the next team asks for one. Adoption pulled by relief beats adoption pushed by policy every time. You end up with the same coverage the big-bang plan wanted, minus the governance council and the dead wiki.
What we've seen
The teams that succeed treat contracts as a property of the pipeline, versioned with the code, reviewed in the same pull request as the transformation it guards. The teams that struggle keep the contract in a separate system that drifts. Keep the agreement next to the thing it describes, make it executable, and start with one table that matters.