Artificial intelligence models require as much useful data as possible to function, but some of the largest AI developers rely in part on YouTube videos transcribed without the creators' permission, violating YouTube's own rules, as discovered in an investigation by Test News and Cabling.
The two outlets revealed that Apple, Nvidia, Anthropic and other major AI companies have trained their models on a dataset called YouTube Captions that incorporates transcripts of nearly 175,000 videos across 48,000 channels — all without the videos’ creators knowing.
The YouTube subtitles dataset includes the text of video subtitles, often with translations into multiple languages. The dataset was created by EleutherAI, which described the dataset’s goal as lowering barriers to AI development for those outside of big tech companies. It is just one component of EleutherAI’s much larger dataset called Pile. Along with YouTube transcripts, Pile contains Wikipedia articles, European Parliament speeches, and, according to the report, even Enron emails.
However, Pile has a strong following among major tech companies. For example, Apple used Pile to train its OpenELM AI model, while Salesforce’s AI model released two years ago was trained with Pile and has since been downloaded more than 86,000 times.
The YouTube caption dataset spans a range of popular news, education, and entertainment channels. That includes content from top YouTube stars like MrBeast and Marques Brownlee. All of them have had their videos used to train AI models. Proof News set up a search tool that will search the collection to see if any particular video or channel is in the mix. There are even some TechRadar videos in the collection, as seen below.
Exchange of secrets
The YouTube subtitle dataset appears to contradict YouTube’s terms of service, which explicitly prohibit the automatic extraction of its videos and associated data. However, that is exactly what the dataset was based on, with a script that downloaded subtitles via the YouTube API. The investigation reported that the automatic download selected videos with nearly 500 search terms.
The discovery sparked a lot of surprise and anger among YouTube creators interviewed by Proof and Wired. Concerns about unauthorized use of content are valid, and some creators were upset at the idea of their work being used without payment or permission in AI models. This is especially true for those who discovered that the dataset includes transcripts of deleted videos, and in one case, the data came from a creator who has since completely deleted their online presence.
The report did not contain any comment from EleutherAI. It did note that the organization describes its mission as democratizing access to AI technologies by publishing trained models. This may conflict with the interests of content creators and platforms, if this dataset is anything to go by. Legal and regulatory battles over AI were already complex. This kind of revelation will likely make the ethical and legal landscape of AI development more treacherous. It’s easy to suggest a balance between innovation and ethical responsibility for AI, but producing it will be much harder.