Data quality: the invisible villain of machine learning

What are the main things a modern machine learning engineer does?

This seems like an easy question with a simple answer:

Build machine learning models and analyze data.

In reality, this answer is often not true.

Efficient use of data is essential to the success of a modern enterprise. However, transforming data into tangible business outcomes requires that it go through a process: it must be acquired, securely shared, and analyzed in its own development lifecycle.

The explosion of cloud computing in the mid- to late-2000s and enterprise adoption of machine learning a decade later effectively addressed the beginning and end of this journey. Unfortunately, companies often encounter obstacles in the middle stage related to data quality, which is typically not on most executives’ radar.

Oliver Gordon

Solutions consultant in Ataccama.

How poor data quality affects businesses

Poor quality and unusable data are a burden on those at the end of the data journey – the data users who leverage the data to build models and contribute to other revenue-generating activities.

Too often, data scientists are the people hired to “build machine learning models and analyze data,” but bad data prevents them from doing anything of the sort. Organizations spend a lot of effort and attention gaining access to this data, but no one thinks to check whether the data going into the model is usable. If the input data is flawed, the output models and analysis will be flawed as well.

It is estimated that data scientists spend between 60 and 80 percent of their time ensuring that data is cleaned so that the results of their projects are reliable. This cleaning process can involve guessing the meaning of the data and inferring gaps, and they can inadvertently discard potentially valuable data from their models. The result is frustrating and inefficient, as this dirty data prevents data scientists from doing the valuable part of their job: solving business problems.

This massive, often invisible cost slows down projects and reduces their results.

The problem is compounded when data cleansing tasks are performed in repetitive silos. Just because one person has found and fixed a problem in one project doesn't mean they've fixed the problem for all their colleagues and their respective projects.

Even if a data engineering team can perform bulk cleaning, they may not be able to do it instantly and may not fully understand the context of the task and why they are doing it.

The impact of data quality on machine learning

Clean data is particularly important for machine learning projects. Whether it’s classifications or regressions, supervised or unsupervised learning, deep neural networks, or when a machine learning model goes into production, your developers need to constantly evaluate new data.

A crucial part of the machine learning lifecycle is managing data drift to ensure the model remains effective and continues to deliver business value. After all, data is a constantly changing landscape. Source systems may merge after an acquisition, new governance may come into play, or the business landscape may change.

This means that prior assumptions about the data may not be valid. While tools like Databricks/MLFlow, AWS Sagemaker or Azure ML Studio cover model promotion, testing and retraining effectively, they are less equipped to investigate what part of the data has changed, why it has changed and then rectify the issues, which can be tedious and time-consuming.

Using data prevents these issues from arising in machine learning projects, but it’s not just about technical teams building models and processing pipelines – the entire business needs to be aligned. Examples of how this might arise in practice include when data may require a business workflow with someone to approve it, or when a non-technical stakeholder from the front office brings knowledge to the front end of the data journey.

The obstacle to creating ML models

Enabling business users as customers of your organization’s data is increasingly possible with AI. Natural language processing enables non-technical users to query data and extract contextualized insights.

The expected growth rate of AI between 2023 and 2030 is 37 percent. 72 percent of executives see AI as the main business advantage and 20 percent of the EBIT of AI-mature companies will be generated by AI in the future.

Data quality is the backbone of AI. It improves the performance of algorithms and enables them to generate reliable forecasts, recommendations, and rankings. For 33 percent of companies that report that their AI projects failed, the reason is poor data quality. In fact, organizations that pursue data quality can drive greater AI effectiveness across the board.

But data quality isn’t just something you can check off as mandatory. Organizations that make it an integral part of their operations can realize tangible business outcomes, from generating more machine learning models per year to more reliable and predictable business outcomes by providing confidence in the model.

How to overcome data quality barriers

Data quality shouldn’t be a matter of waiting for a problem to occur in production and then scrambling to fix it. Data should be constantly tested, wherever it is, against an ever-growing set of known issues. All stakeholders should contribute, and all data should have clear, well-defined owners. So when you ask a data scientist what they do, they can finally say: build machine learning models and analyze data.

We list the best cloud storage for businesses.

This article was produced as part of TechRadarPro's Expert Insights channel, where we showcase the brightest and brightest minds in the tech industry today. The views expressed here are those of the author, and not necessarily those of TechRadarPro or Future plc. If you're interested in contributing, find out more here:

You may also like...