Researchers at the Amazon Web Services (AWS) artificial intelligence lab have discovered that a large amount of online content comes from automatically translated (MT) sources.
This content, which is translated into many different languages, is often of low quality, which the team says highlights the critical need for data quality and consideration of sources when training large language models (LLMs).
The researchers also found that machine-generated content is common in translations from less resource-intensive languages and makes up a significant portion of all web content.
Selection bias
“In fact, we became interested in this topic because several colleagues who work in MT and are native speakers of low-income languages noticed that much of the Internet in their native language seemed to be generated by MT,” Mehak Dhaliwal, former applied sciences intern on AWS. and current doctoral student at the University of California, Santa Barbara, told Motherboard.
“So the idea really came from low-income language speakers, and we did the study to better understand the problem and see how widespread it was.”
The team developed a vast resource known as Multi-Way ccMatrix (MWccMatrix) to better understand the characteristics of machine-translated content. This resource contains 6.4 billion unique sentences in 90 different languages and includes translation tuples, which are sets of sentences in multiple languages that are translations of each other.
The study, which was submitted to Cornell University's arXiv preprint server, found that large amounts of web content are often translated into numerous languages, primarily through machine translation. This content is not only prevalent in translations in languages with fewer resources but also constitutes a significant part of all web content in these languages.
The researchers also noted a selection bias in the type of content that is translated into multiple languages, likely for the purpose of generating advertising revenue.
The article concludes that “MT technology has improved dramatically over the last decade, but it still falls short of human quality. TM content has been added to the web for many years using TM systems available at the time, so it is likely that much of the TM on the web is of very low quality by modern standards. This could produce less smooth LLM models with more hallucinations, and selection bias indicates that the data may be of lower quality, even before considering MT errors. “Data quality is crucial in LLM training, where high-quality corpora, such as books and Wikipedia articles, are typically sampled multiple times.”