Insights / Modeling · 2023-02-28 · 4 min read

Naming things in the warehouse

Two of the hard problems in computing are cache invalidation and naming things. In a data warehouse, the second one quietly costs you more than you'd think.

Naming feels like the least important part of building a warehouse, right up until an analyst is staring at usr_dt_2 next to user_date next to signup_ts and has no idea which to trust. Good names are documentation that can't drift, because they're attached to the thing itself.

Conventions beat cleverness

Pick a small set of rules and hold to them everywhere: timestamps end in _ts, dates in _date, booleans start with is_ or has_, IDs end in _id. The specific convention matters less than its consistency. A reader who's learned the pattern once can predict the shape of a column from its name.

A column name is read hundreds of times and written once. Optimize for the reader, every time.

Name for meaning, not for source quirks

Source systems hand you names that reflect their internals — cryptic codes, legacy abbreviations, fields called flag3. Don't propagate those into the modeled layer. Rename to what the column means to someone analyzing the business, and leave the source's vocabulary at the boundary.

Same concept, same name

If it's the customer identifier, it's customer_id in every table — not cust_id here and client_no there. Consistent names across tables make joins obvious and mistakes rare. It's unglamorous discipline, and it pays back every single day someone reads your warehouse.

← All insights