Scraping to train Artificial Intelligence is raising issues

30 Maggio 2023

With use of generative AI models growing, content owners affected by scraping—the practice by which data is collected to train software—are taking action to restrict access for programming platforms.

Artificial intelligence generative software such as Midjourney, Dall-E, and ChatGPT makes wide-ranging use of content “caught” online. This practice is known as scraping. While until recently few were familiar with the term, it now frequently appears in discussions of training of AI models. On Reddit, users post 41 comments per second; on Spotify, artists and labels upload 46 music tracks per second; and in the same timeframe, 1,100 images are shared on Instagram. According to some estimates, data in the possession of the Big Four (Google, Meta, Amazon, and Microsoft) amounts to approximately 1,200 petabytes (i.e., 1.2 million terabytes).

These figures are so massive that they are beyond human comprehension. Yet someone—or better, something—has been reading and cashing in on this information for years and can be expected to do so for many years to come. Artificial intelligence models seem to be able to tame this content behemoth. It remains to be seen whether current AI models will prove truly capable of selecting and understanding the stored information and processing reliable output using it. But what is currently fueling debate among those in the field is whether it is permissible to train AI models with data and content available online.

Artificial intelligence models such as Midjourney, Dall-E, and ChatGPT “learn” to produce text or images from a dataset of content scraped online. This activity, i.e., training by scraping, is at the center of legal debates and recent court actions taken by Getty Images, a well-known photo agency with a large database of photographic images. Getty Images sued the Stability AI platform, accusing it of reusing 12 million copyrighted images appropriated by its AI model without the permission of either Getty Images or the creators of the photographs.

A similar case is in motion in the music world: a video for a song created using artificial intelligence with the explicit goal of imitating rapper Drake went viral online. A few days later, Universal Music Group asserted that training artificial intelligence models with the works of its artists violates copyright law. Recently it was reported that Elon Musk is on the warpath against the scraping of data downloaded from Twitter. On April 19, via Twitter, the CEO of Tesla and owner of SpaceX threatened legal action against Microsoft, which allegedly trained an unidentified artificial intelligence model by scraping Twitter data. This came in response to Microsoft’s announcement that it would drop Twitter from its advertising platform after Twitter began charging a fee for accessing its information through its scheduling interface, API.

Musk’s declaration of war is the latest indication that data ownership is rapidly becoming a battleground in the race for generative artificial intelligence. Large technology companies are working to develop cutting-edge AI models, such as OpenAI’s GPT, and data owners are thinking about how to exploit access to their content economically. Large language models (LLM) such as GPT require terabytes of data for training, and much of that data is collected from platforms such as Reddit, Stack Overflow, and Twitter. Training data from social networks is valuable because it captures informal conversation, the type of text most needed by the models in question. As new AI models move from university research centers to the corporate world, however, the owners of the data are beginning to mark their territory and stake financial claims.

For example, Reddit announced in early April that it will charge companies for access to its programming interface, API, which is used to train AI models using conversations among Redditors. This means Twitter is not an isolated case, though data from social networks such as Facebook, Instagram, and LinkedIn remain freely available for the time being. For owners of structured databases and copyrighted digital works, the regulatory environment in Europe may support opting out—meaning explicitly excluding third parties from unrestricted access to their protected content. Owners of such content can invoke the “text and data mining” exception introduced by European Union Directive 2019/790 on Copyright and Related Rights in the Digital Single Market (the “Directive”), which, among other things, resulted in an amendment to Italian Copyright Law.

“Text and data mining” refers to any automated analytical technique designed to analyze text and data in digital form in order to generate information, including but not limited to patterns, trends, and correlations. Specifically, Articles 3 and 4 of the Directive provide that, under certain circumstances, copyright and exclusive database rights may not be used by their respective holders to prevent massive extraction of protected content, i.e., text and data mining, unless use of the works has been expressly “reserved” by their copyright holders. In essence, it is up to copyright holders to take action to protect their works from scraping by ensuring that they do not become part of the diet fed to generative platforms.

According to the Directive and the Italian implementing legislation, massive digital data extraction and reproduction activity is freely permitted when it is done by research organizations and cultural heritage institutions acting within the limits of nonprofit study and research, and provided that access to such data is lawful. That’s not all. According to Article 4 of the Directive, Member States must establish that text and data mining is always lawful as long as the use of the extracted works and other materials has not been expressly reserved by the rightsholders in an appropriate manner. In other words, from the European perspective, rightsholders must take appropriate action to keep the content to which they hold exclusive rights from being subject to massive data mining.

The recitals of the Directive specify that holders of the relevant exclusive rights to content made publicly available online, should use machine-readable means, including website metadata and T&Cs, to reserve their rights. Italian lawmakers decided not to give specific guidance as to how exclusive rightsholders can reserve rights and not to authorize text and data mining activities, even implicitly. In Germany and the Netherlands, however, use of works accessible online may be reserved effectively only in machine-readable form.

One example of a technology that goes in this direction comes from DeviantArt, which created a metadata tag for images shared on the Web to warn AI researchers not to scrape them for content. But this will be applicable only in the future. As for the past and the present, the website Have I Been Trained? (haveibeentrained.com) already allows a creator to check whether their work is in the Stable Diffusion deep learning software database and to process opt-outs for individual pieces of content, removing them from the information set that feeds the model developed by Stability AI. In recent months, more than 80 million works have been opted out.

Clearly, these are early—often cumbersome and always limited—technological attempts to deal with the interests of market players and their rights in a regulatory framework that is struggling to keep up. The solution of closing off free access to APIs could fit into that framework, provided that content that is opted out falls under the scope of the aforementioned legislation. We know, though, that the issue of access to data—especially when those who claim ownership to the data hold significant market shares, as is the case for social networks—raises legal questions that go beyond intellectual property law and concern antitrust law and market dominance. It turns out that size does matter, at least when it comes to this issue.

Indietro