Quick one on data quality

I've been doing data for over 25 years and have done projects that covered billions worth of spend in lots of industries. Again and again, data quality remains one of the biggest issues.

Here's my view in 5 points:

  1. While data teams need to track and classify data quality issues throughout the pipeline, data quality issues in pipelines created by data or analytics teams are software bugs, not data quality issues. Data teams must fix them as they would any other bugs. I will die on this hill.

  2. The person responsible for reporting a metric at the highest level must be accountable for the quality of the data behind that metric to the person being reported to.

  3. Most data quality issues are operational, not analytic. The total cost of resolving operational data quality issues is in the region of 100x the cost of getting the data right the first time - remember its not just the data - its the shipping, and product waste, and time.

    I’d bought a pair of trousers from an online retailer to wear to my wife’s birthday party in London. They were meant to ship it to the hotel I was staying. However they missed out a field. Meaning that the actual address of the hotel wasn’t transferred across to the delivery service. I never got the shipment. So not only did they have to refund me they had to pay the shipping twice. Easily 100x the cost of making sure the data was there.

  4. In my experience, differences in reference data are the number 1 issue and easiest to fix. Stupid data validation rules are the number 2 issue and are harder to fix but should be resolved and people trained. Have you heard of 'Null Island'? It's the location in the Atlantic somewhere where the coordinates are 0°N, 0°E. It doesn't exist, but there are plenty of null-valued locations that point there.

  5. Differences in metric and rule definitions massively affect the effective quality that stakeholders see in their data. Define this clearly and shift definitions and calculations as far left on the pipeline as possible.

Shift everything left - ‘left’ being where data is entered and not where it’s analysed and reported.

Spend the effort at source and with the people entering data (whether customers or staff) and you’ll improve the overall quality massively. And make sure your testing and process control and observability through the data pipes is picking things up so they can be addressed at source.

Previous
Previous

A reason M&A fails - not getting the data and metrics right?

Next
Next

Order line unit economics and why it’s crucial for the data team