You might lose signals in data as trends change, too. When contact numbers for customers shifted from landline to mobile phones, organizations lost the ability to extract the customer location from the number. “If you were using area codes to validate locality, you lost a lot of records,” Kashalikar adds. Two companies you work with might also merge, so deciding whether to treat them as the same entity or keep them separate in your golden master record of companies depends on the use case.
Even without major changes, the underlying data itself might have drifted. “The relationships between the outcome variables of interest and your features may have changed,” Friedman says. “You can’t simply lock in and say, ‘This dataset is absolutely perfect’ and lift it off the shelf to use for a problem a year from now.”
To avoid all these problems, you need to involve people with the expertise to differentiate between genuine errors and meaningful signals, document the decisions you make about data cleaning and the reasons for them, and regularly review the impact of data cleaning on both model performance and business outcomes.
Rather than doing masses of data cleaning up front and only then starting development, take an iterative approach with incremental data cleaning and quick experiments.
“What we’ve seen to be successful is onboard data incrementally,” says Yahav. “There’s a huge temptation to say let’s connect everything and trust that it works. But then when it hits you, you don’t know what’s broken, and then you have to start disconnecting things.”
So start with small amounts of recent data, or data you trust, see how that works, and build more sources or volume of data from there and see where it breaks. “It’s going to eventually break because something you forgot is going to reach the main pipeline, and something’s going to surprise you,” he says. “You want this process to be gradual enough for you to understand what caused that.”