“Code me a treasure hunting game.” “Make an Adele-esque cover of Psy’s ‘Gangnam Style.’” “Create a close-up photorealistic video of two pirate ships fighting each other while sailing inside a coffee cup.” Even that final message isn’t an exaggeration: Today’s best AI tools can create all of this and more in minutes, making AI seem like a kind of modern, real-world magic.
We know, of course, that it’s not magic. In fact, a tremendous amount of work, training, and information goes into the models that power GenAI and produce its output. AI systems need to be trained to learn patterns from data: GPT-3, ChatGPT’s base model, was trained on 45TB of data from Common Crawl, the equivalent of around 45 million 100-page PDF documents. In the same way that humans learn from experience, training helps AI models better understand and process information. Only then can they make accurate predictions, perform important tasks, and improve over time.
This means that the quality of the data we feed into our tools is crucial. So how can we ensure that we foster data quality to create practical and successful AI models? Let’s take a look.
Chief Operating Officer of Northern Data Group.
The risks of poor data
Good quality data is accurate, relevant, complete, diverse and unbiased. It is the backbone of effective decision-making, sound operational processes and, in this case, valuable AI results. However, maintaining good quality data is difficult. A survey by a data platform found that 91% of professionals say data quality has an impact on their organization, and only 23% consider good data quality to be part of their organizational ethos.
Bad data also often contains limited and incomplete information that doesn’t accurately reflect the world at large. The resulting biases can impact how data is collected, analyzed, and interpreted, leading to unfair or even discriminatory outcomes. When Amazon built an automated hiring tool in 2014 to help speed up its hiring process, the software team fed it data on the company’s current pool of software engineers, overwhelmingly men. The project was scrapped after just a year, when it became clear that the tool was systematically discriminating against female candidates. Another example is Microsoft’s now-cancelled chatbot Tay, which became infamous for making offensive comments on social media due to the poor data it was trained on.
Coming back to AI, messy or biased data can have an equally catastrophic effect on a model’s productivity. Feeding messy or poor-quality synthetic data into an AI model and expecting it to deliver clear, actionable insights is futile; it’s like microwaving a plate of spaghetti with letters on it and expecting the letter “The quick brown fox jumps over the lazy dog” to come out. Therefore, data readiness – the state of readiness and quality of data within an organization – is a key hurdle to overcome.
Feeding the AI model correctly
Research shows that when it comes to global enterprises’ AI strategies, only 13% consider themselves leaders in terms of data readiness. Meanwhile, 30% are classified as chasers, 40% as followers, and a worrying 17% as laggards. These numbers need to change if data is to drive successful AI outcomes worldwide. To ensure good data readiness, we need to collect complete and relevant data from trusted sources, cleanse it to remove errors and inconsistencies, accurately label it, and standardize its formats and scales. Most importantly, we need to continuously verify and update data to maintain its quality.
To start, companies should create a centralized data catalog that incorporates data from multiple repositories and silos into a single, organized location. They should then classify and organize this data to make it easier to find, use, and surface contextual business insights. Next, engineers should implement a robust data governance framework that incorporates regular data quality assessments. Data scientists should continually detect and correct inconsistencies, errors, and missing values in data sets.
Finally, data lineage tracking involves developing a clear understanding of data origins, processing steps, and access points. This tracking ensures transparency and accountability in the event of a bad outcome. And it is becoming particularly crucial in the face of heightened concerns about AI privacy.
Ensuring data is fair and secure
Today, personal AI queries are fast becoming the new confidential Google search. But users wouldn’t trust them with private information if they knew it would be shared or sold. According to Cisco research, 60% of consumers are concerned about organizations using their personal data for AI, while nearly two-thirds (65%) have already lost some trust in organizations as a result of AI use. So, in addition to regulatory concerns, we all have an ethical and reputational responsibility to ensure absolute data privacy when developing and leveraging AI technology.
Privacy is about ensuring that people who interact with AI-based tools and systems – from patients to online shoppers – have control over their personal data and can rest easy knowing it is being used responsibly. In this regard, companies should operate under a concept of “privacy by design,” where their technology only collects strictly necessary data, stores it securely, and is transparent about its use.
A good option is to anonymize all the data you collect. That way, you can reuse it in future AI model training without compromising customer privacy. And, once you no longer need this data, you can delete it to eliminate the risk of future breaches. This seems simple, but it’s an often-forgotten step that can save you stress, reputational damage, and even regulatory fines.
Keeping data sovereignty at the forefront
Compliance with regulatory requirements is, of course, paramount for any organization, and data residency is an increasingly important issue around the world. In Europe, for example, the GDPR stipulates that EU citizens’ data must reside in the European Economic Area. That means you or your cloud partner need data centers within the region – if you transfer data elsewhere, you risk breaking the law. Data residency is already a priority for regulators and users alike, and it will only become more important as more regulations are implemented around the world.
For businesses, compliance means either purchasing data storage facilities at specific sites outright or partnering with a specialized provider that offers data centers in strategic locations. Just ask the World Economic Forum, which says that “the backbone of sovereign AI lies in a robust digital infrastructure.” Simply put, data centers with high-performance computing capabilities, operating under policies that ensure generated data is stored and processed locally, are the foundation for the effective and compliant development and deployment of AI technologies around the world. It’s not exactly magic, but the results can be equally impressive.
We list the best AI chatbots for businesses.
This article was produced as part of TechRadarPro's Expert Insights channel, where we showcase the brightest and brightest minds in the tech industry today. The views expressed here are those of the author, and not necessarily those of TechRadarPro or Future plc. If you're interested in contributing, find out more here: