TLDR
In a recent survey of data scientists conducted by Mage, we found out that the #1 pain point when working on AI projects is data cleaning. The main reasons cited are that it takes a long time to do well, and is painful or tedious work.
Outline
What data needs to be cleaned
The magic question
A magic wand
Conclusion
What data needs to be cleaned
Data cleaning stems from the need to purge datasets of information that’s not useful to the model, which unfortunately occurs more than once at each stage of the machine learning lifecycle. Due to this, it’s an iterative process that becomes tedious and difficult to estimate the time until completion.
Uneven or biased data
Throughout the machine learning lifecycle there are plenty of ways for poor quality data to seep into the dataset itself. It can stem from selection bias, where the sample size wasn’t unique or diverse enough during data collection. This leads to an uneven distribution that data scientists will need to even out as part of the cleaning process.
Inconsistent data
Poor quality data can also come from user error by filling in invalid values or dealing with inconsistent data when joining different datasets together. For instance, when working with international datasets, users from the UK and the US don’t use the same measurements: metric versus imperial system. But if you’re stuck looking for a location and one has distance measured in kilometers and the other is miles, you’ll need to convert both units to match in order to find the correlation between distance.
Private, protected, or personal data
When merging countless datasets together it’s easy to miss data that shouldn’t be used to train the model. Due to legal concerns, it’s discriminatory to use certain features without consent and all datasets should be purged of it. The problem is the definition of what’s private, protected, or personal varies from source to source. Some users may have signed a disclosure or ToS, while others opted out. Tracking and managing this then becomes extremely tedious when cleaning the data as you may still want to train on those features but not everyone has given consent to use them.
🤔 The magic question
To first answer the magic question, we need to identify what brings pain. Here we’ll look at pain as boring, repetitive work, or frustration that comes when dealing with big data.
Boring pain points
Looking for uneven or biased data tends to be very manual which makes it quite boring or difficult to endure for long hours. You’ll often find yourself running the same query over and over to find the outliers and perform any analytics necessary to identify what you need to remove. On top of that, there are things computers can’t catch yet so fully-automated solutions still need to be verified by a human.
Frustrating parts of ML
A case I alluded to earlier is in the inconsistencies of data when uploading data from many sources, especially during enrichment. Lot’s of things can and will go wrong when mixing datasets that are independent from one another. These minor inconsistencies, when missed, lead to models that make the wrong assumptions and can have dire consequences.
Seeing as these datasets get touched by a team, not having the proper documentation and passing of the project leads to time wasted trying to catch up. Likewise, you’ll most likely be working on more than one model at a time, so being able to switch gears quickly is a must or and you might revisit a model made weeks ago.
A magic wand 🪄
For me, reducing the pain would be finding ways to streamline the boring process, and provide better documentation or suggestions towards the frustrating parts.
Solution to boring data
Ideally, I’d want something that resembles a linter, just being able to identify what features aren’t clean, biased, or where the outliers are is a huge plus. A tool that reduces the manual strain for anyone trying to perform insights or analysis on the data would greatly reduce the pain of a boring task. Given this knowledge, I’d then want to be able to make a choice.
Whether to cleanse it or leave it in should depend on the user. Though some may disagree, the ability to have a choice rather than instantaneously perform cleaning is a better solution than fully automated processes. Afterall, I only desire to reduce the boring task of sifting through the data, and keep the fun parts like decision making and customizing data!
Dealing with frustration
Similar to the responses we’ve gotten from the survey, frustration comes when you lose track of progress or when you’re unable to gauge progress. Due to this uncertainty it can feel like it’ll take ages to train the model that satisfies your needs. To circumvent this, a tool that provides versioning would be ideal.
Git for data scientists
, essentially tracking who performed what transformer actions to the dataset, when they did it, and a commit message describing what the objective was.
Conclusion
Of course, this only covers the tip of the iceberg, only looking at the data itself. Once a model is trained, you’ll get to see its performance and will turn to either data cleaning for retraining, or model tuning by playing with hyper parameters or selecting a different training algorithm. There’s many more areas that will most certainly require a human touch, like determining which features lead to data leakage, and striving to make a dataset that’s understable by a machine but also other humans.