Insights
Field notes.
Short, practical writing on the parts of data engineering that don't make conference talks but do keep your numbers honest. 16 pieces and counting, since 2023.
Backfills without the reconciliation spreadsheet
If every reprocessing job ends with someone hand-building a spreadsheet to prove the numbers match, the proof is in the wrong place. It belongs in the pipeline.
Data contracts that survive contact with reality
Why most contract initiatives stall halfway, and the much smaller version that actually ships and keeps shipping.
Make your pipeline safe to run twice
Idempotency is the cheapest reliability you can buy. A concrete pattern for batch jobs, with the failure cases it prevents.
One definition per metric
"Active users" was 41,000 in the board deck, 38,500 in the growth dashboard, and 44,200 in the email report. None were wrong. That's the problem.
An alert nobody mutes
The three data-quality checks worth wiring up first, and how to tune their thresholds so the team keeps trusting them.
Designing for late-arriving data
Events don't always show up on time. A device was offline, a partner's batch was delayed, a retry landed a day later. If your pipeline assumes today's data is complete tonight, it's quietly wrong.
Slowly changing dimensions, without the dogma
A customer changes plan. A product changes category. Do your historical reports use the old value or the new one? That single question is what slowly changing dimensions are really about.
Partition for how you query, not how you load
Teams partition by load date because that's how the data arrives. Then every analytical query scans the whole table, because nobody queries by load date. Partition for the reader.
Tests you'll still trust in a year
Most data tests start strict and end ignored. The ones that survive are few, specific, and tied to something a human actually cares about.
Retries, timeouts, and the orchestrator's job
An orchestrator's main value isn't drawing a nice DAG. It's handling the moment a task fails — and that only works if your tasks are safe to retry.
Schema changes that don't wake anyone up
A column gets renamed upstream. Somewhere downstream a join silently returns nulls, a metric drops, and three days later someone asks why the number looks wrong. Schema changes are the most common quiet break there is.
The quiet cost of the nightly full refresh
It worked when the table had a million rows. Now it has eight hundred million, the nightly job rebuilds all of it, and the warehouse bill has a hockey-stick shape nobody can explain.
Where bad records should go
One malformed row shouldn't fail a million-row job. But it also shouldn't vanish. The records you can't process need a destination, not a silent drop.
Lineage you can actually use during an incident
Lineage diagrams look impressive in a slide. The test is whether, at 2am with a wrong dashboard, you can answer one question fast: what feeds this, and what does this feed?
Streaming is not a default
Streaming is exciting, and for a narrow set of problems it's the only right answer. For most analytics, it's a large bill and an operational burden bought to solve a latency problem nobody had.
Naming things in the warehouse
Two of the hard problems in computing are cache invalidation and naming things. In a data warehouse, the second one quietly costs you more than you'd think.