In the last year, there has been a surge in AI models that create art, music, and code by learning from the work of others.
However, as these tools become more prevalent, unanswered legal questions may shape the field’s future.
Generative AI has had a fantastic year. Corporations such as Microsoft, Adobe, and GitHub are incorporating the technology into their products; startups are raising hundreds of millions of dollars to compete; and the software has cultural clout, with text-to-image AI models spawning countless memes. But if you listen in on any industry discussion about generative AI, you’ll hear a question whispered in the background by advocates and critics alike in increasingly concerned tones:
is any of this legal?
The issue arises as a result of how generative AI systems are trained. They, like most machine learning software, identify and replicate patterns in data. However, because these program are used to generate code, text, music, and art, the data itself is created by humans, scraped from the internet, and copyright protected in some way.
This wasn’t a big deal for AI researchers in the distant misty past (aka the 2010s). At the time, cutting-edge models could only generate blurry, fingernail-sized black-and-white images of faces. This was not a direct threat to humans. However, in 2022, when a lone amateur can use software like Stable Diffusion to copy an artist’s style in a matter of hours, or when companies sell AI-generated prints and social media filters that are explicit knock-offs of living designers, questions of legality and ethics become much more pressing.
Is it legal to train generative AI models on copyright-protected data?
Consider the case of Hollie Mengert, a Disney illustrator who discovered that her art style had been cloned as part of an AI experiment by a Canadian mechanical engineering student. The student downloaded 32 of Mengert’s pieces and spent several hours training a machine learning model to replicate her style. “For me, personally, it feels like someone is taking work that I’ve done, you know, things that I’ve learned — I’ve been a working artist since I graduated art school in 2011 — and using it to create art that I didn’t consent to and didn’t give permission for,” Mengert told technologist Andy Baio, who reported the case.
But is that right? And what can Mengert do about it?
The Verge spoke with a variety of experts, including lawyers, analysts, and employees at AI startups, to answer these questions and better understand the legal landscape surrounding generative AI. Some people were confident that these systems could infringe on copyright and would face serious legal challenges in the near future. Others, equally confident, suggested the opposite: that everything currently happening in the field of generative AI is legal and that any lawsuits are doomed to fail.
“I see people on both sides of this extremely confident in their positions,” Baio, who has been closely following the generative AI scene, told The Verge. “And anyone who claims to know how this will play out in court is mistaken.”
Andres Guadamuz, an academic specialising in AI and intellectual property law at the University of Sussex in the United Kingdom, suggested that while there were many unknowns, there were also only a few key questions from which the topic’s many uncertainties could emerge. First, can the output of a generative AI model be copied, and if so, who owns it? Second, does owning the copyright to the input used to train an AI give you any legal claim over the model or the content it generates? Once these questions are answered, a new one arises: how do you deal with the consequences of this technology? What legal constraints could — or should — be imposed on data collection? And can there be harmony between those who design these systems and those whose data is required to build them?
Let’s go through these questions one by one.
The output question: do you have the right to copy what an AI model creates?
At least for the first question, the answer is not too difficult. In the United States, there is no copyright protection for works created entirely by a machine. However, it appears that copyright may be possible in cases where the creator can demonstrate significant human input.
The US Copyright Office granted a first-of-its-kind registration for a comic book created with the help of text-to-image AI Midjourney in September. The comic is a finished product: an 18-page narrative with characters, dialogue, and a standard comic book layout. Although the USCO has since stated that it is reviewing its decision, the comic’s copyright registration has not yet been revoked. The amount of human input involved in creating the comic appears to be one factor in the review. The artist, Kristina Kashtanova, told IPWatchdog that the USCO asked her to “provide details of my process to show that there was substantial human involvement in the process of creating this graphic novel.” (The USCO does not comment on individual cases.)
According to Guadamuz, granting copyright for works generated with the help of AI will be a continuing issue. “I don’t think typing ‘cat by van Gogh’ is enough to get copyright in the US,” he says. “However, if you start experimenting with prompts, producing multiple images, fine-tuning your images, using seeds, and engineering a little more, I can totally see that being protected by copyright.”
Copyrighting the output of an AI model will most likely be determined by the level of human involvement.
With this criteria in mind, the vast majority of the output of generative AI models is unlikely to be copyright protected. They are typically mass-produced with only a few keywords as a prompt. However, more involved processes would result in better cases. These could include contentious works, such as the AI-generated print that won a state art fair competition. The creator in this case stated that he spent weeks honing his prompts and manually editing the finished piece, indicating a relatively high level of intellectual involvement.
According to Giorgio Franceschelli, a computer scientist who has written about the issues surrounding AI copyright, measuring human input will be “especially true” for deciding cases in the EU. And the law is even more complicated in the United Kingdom, another major jurisdiction of concern for Western AI startups. Surprisingly, the United Kingdom is one of only a few countries that grants copyright to works generated entirely by a computer, but it defines the author as “the person by whom the arrangements necessary for the creation of the work are undertaken.” Again, there is room for interpretation (is this “person” the model’s developer or operator? ), but it establishes precedent for some form of copyright protection to be granted.
Guadamuz cautions that registering copyright is only the first step. “The United States Copyright Office is not a court,” he explains. “You must register if you intend to sue someone for copyright infringement, but a court will decide whether or not that is legally enforceable.”
The input question: can you train AI models using copyrighted data?
For most experts, the most pressing concerns about AI and copyright concern the data used to train these models. Most systems are trained on massive amounts of web content, whether text, code, or imagery. For example, the training dataset for Stable Diffusion, one of the largest and most influential text-to-AI systems, contains billions of images scraped from hundreds of domains, including personal blogs hosted on WordPress and Blogspot, art platforms like DeviantArt, and stock imagery sites like Shutterstock and Getty Images. Indeed, generative AI training datasets are so large that there’s a good chance you’re already in one (there’s even a website where you can check by uploading a picture or searching some text).
The justification given by AI researchers, startups, and multibillion-dollar tech companies alike is that using these images is protected (at least in the United States) by fair use doctrine, which encourages the use of copyright-protected work in order to promote freedom of expression.
There are several factors to consider when determining whether something is fair use, according to Daniel Gervais, a professor at Vanderbilt Law School who specialises in intellectual property law and has written extensively on how this intersects with AI. However, he claims that two factors are “much, much more prominent.” “What is the purpose or nature of the use, and what is the market impact?” In other words, does the use-case alter the nature of the material in some way (usually referred to as a “transformative” use), and does it endanger the original creator’s livelihood by competing with their works?
Training a generative AI on copyright-protected data is probably legal, but that same model could be used illegally.
Given the emphasis placed on these factors, Gervais believes that “training systems on copyrighted data are much more likely than not” to be covered by fair use. However, the same cannot be said for creating content. To put it another way, you can train an AI model using other people’s data, but what you do with that model could be illegal. Consider the difference between making fictitious money for a movie and attempting to buy a car with it.
Consider the same text-to-image AI model in various scenarios. It is extremely unlikely that copyright infringement will occur if the model is trained on millions of images and used to generate new images. The training data was transformed during the process, and the output poses no threat to the market for original art. However, if you train that model on 100 images by a specific artist and generate images that match their style, an unhappy artist will have a much stronger case against you.
“If you give an AI ten Stephen King novels and tell it, ‘Produce a Stephen King novel,’ you’re competing directly with Stephen King.” Is that permissible? “Probably not,” Gervais says.
But, crucially, there are countless scenarios between these two poles of fair and unfair use in which input, purpose, and output are all balanced differently and could sway any legal ruling one way or the other.
According to Ryan Khurana, chief of staff at generative AI firm Wombo, most companies selling these services are aware of these distinctions. “Using prompts that draw on copyrighted works to generate an output […] violates the terms of service of every major player,” he explained via email to The Verge. However, “enforcement is difficult,” he adds, and companies are more interested in “finding ways to prevent models from being used in copyright violating ways […] than limiting training data.” This is especially true for open-source text-to-image models such as Stable Diffusion, which can be trained and used with no supervision or filters. The company may have protected itself, but it could also be facilitating copyright infringement.
Another factor in determining fair use is whether the training data and model were created by academic researchers or nonprofits. This generally strengthens fair use defences, which startups are aware of. As a result, Stability AI, the company that distributes Stable Diffusion, did not directly collect or train the models behind the software. Instead, academics were funded and coordinated this work, and the Stable Diffusion model was licenced by a German university. This allows Stability AI to commercialise the model (DreamStudio) while remaining legally separate from its creation.
Baio refers to this practise as “AI data laundering.” He mentions that this method has previously been used in the development of facial recognition AI software, citing the case of MegaFace, a dataset compiled by University of Washington researchers by scraping photos from Flickr. “Academic researchers took the data, laundered it, and commercial companies used it,” says Baio. He claims that this data, which includes millions of personal photographs, is now in the hands of “[facial recognition firm] Clearview AI, law enforcement, and the Chinese government.” A tried-and-true laundering procedure will almost certainly protect the creators of generative AI models from liability as well.
However, Gervais points out that the current interpretation of fair use may change in the coming months due to a pending Supreme Court case involving Andy Warhol and Prince. Warhol used photographs of Prince to create artwork in this case.
Is this a case of fair use or copyright infringement?
“Because the Supreme Court doesn’t rule on fair use very often, when they do, it’s usually for a big reason.” “I believe they will do the same here,” Gervais predicts. “And it is risky to say anything is settled law while waiting for the Supreme Court to change the law.”
How can artists and AI companies coexist peacefully?
Even if it is determined that the training of generative AI models is covered by fair use, this will not solve the field’s problems. It will not appease the artists whose work has been used to train commercial models, nor will it necessarily apply to other generative AI fields such as code and music. With this in mind, the question is: what technical or non-technical solutions can be implemented to allow generative AI to thrive while giving credit or compensation to the creators whose work makes the field possible?
The most obvious solution is to licence the data and compensate the creators. For some, however, this will be the end of the industry. The authors of “Fair Learning,” a legal paper that has become the backbone of arguments promoting fair use for generative AI, Bryan Casey and Mark Lemley, claim that training datasets are so large that “there is no plausible option simply to licence all of the underlying photographs, videos, audio files, or texts for the new use.” Allowing any copyright claim, they argue, “amounts to saying not that copyright owners will be compensated, but that the use will be prohibited entirely.” Allowing for “fair learning,” as they put it, not only encourages innovation but also allows for the development of better AI systems.
Others, however, argue that we’ve already dealt with copyright issues of comparable size and complexity, and that we can do so again. Several experts The Verge spoke with compared it to the era of music piracy, when file-sharing programmes were built on massive copyright infringement and thrived only until legal challenges led to new agreements that respected copyright.
“In the early 2000s, there was Napster, which everyone loved but was illegal. And now we have Spotify and iTunes,” Matthew Butterick, a lawyer who is suing companies for scraping data to train AI models, told The Verge earlier this month. “How did these systems emerge?” By companies entering into legitimate licencing agreements and bringing in content. All of the stakeholders came together and made it work, and the idea that a similar thing cannot happen for AI is, in my opinion, a little catastrophic.”
Companies and researchers are already experimenting with different methods of compensating creators.
Ryan Khurana of Wombo predicted a similar outcome. “Music has by far the most complex copyright rules due to the various types of licencing, the variety of rights-holders, and the various intermediaries involved,” he explained to The Verge. “Given the nuances [of AI’s legal issues], I believe the entire generative field will evolve into a licencing regime similar to that of music.”
Other alternatives are being tested as well. Shutterstock, for example, has stated that it intends to establish a fund to compensate individuals whose work has been sold to AI companies to train their models, while DeviantArt has created a metadata tag for images shared on the web that warns AI researchers not to scrape their content. (At least one small social network, Cohost, has already implemented the tag across its site and says it “won’t rule out legal action” if researchers continue to scrape its images.) However, artistic communities have had a mixed reaction to these approaches. Can one-time licence fees ever compensate for lost wages? And how does deploying a no-scraping tag now assist artists whose work has already been used to train commercial AI systems?
Many creators believe that the damage has already been done. However, AI startups are suggesting new approaches for the future. One obvious step forward is for AI researchers to simply create databases with no risk of copyright infringement — either because the material has been properly licenced or because it was created specifically for AI training. One such example is “The Stack,” a dataset for AI training that is specifically designed to avoid accusations of copyright infringement. It only includes code with the most permissive open-source licencing possible and provides developers with an easy way to remove their data upon request. Its creators claim that their model could be used across the industry.
“The Stack’s approach can absolutely be adapted to other media,” Yacine Jernite, Machine Learning & Society lead at Hugging Face, which collaborated with partner ServiceNow to create The Stack, told The Verge. “It is an important first step in exploring the wide range of consent mechanisms that exist — mechanisms that work best when they take into account the rules of the platform from which the AI training data was extracted.” Hugging Face, according to Jernite, wants to help create a “fundamental shift” in how AI researchers treat creators. However, the company’s approach is still unusual.
What comes next?
Regardless of where we land on these legal issues, the various actors in the field of generative AI are already preparing for… something. Companies making millions from this technology are firmly entrenched, repeatedly declaring that everything they do is legal (while presumably hoping no one actually challenges this claim). Copyright holders on the other side of no man’s land are staking out their own tentative positions without committing to action. Getty Images recently prohibited AI content due to the legal risk to customers (“I don’t think it’s responsible. “I believe it may be illegal,” CEO Craig Peters told The Verge last month), while the RIAA declared that AI-powered music mixers and extractors are infringing on members’ copyright (though they did not go so far as to launch any actual legal challenges).
The first shot in the AI copyright wars was fired last week, with the filing of a proposed class action lawsuit against Microsoft, GitHub, and OpenAI. The case accuses all three companies of knowingly reproducing open-source code using Copilot, an AI coding assistant, but without the necessary licences. The lawyers behind the suit told The Verge last week that it could set a precedent for the entire generative AI field (though other experts disputed this, saying any copyright challenges involving code would likely be separate from those involving content like art and music).
“However, once someone breaks cover, I believe lawsuits will start flying left and right.”
Meanwhile, Guadamuz and Baio both say they’re surprised there haven’t been more legal challenges. “Honestly, I’m astounded,” Guadamuz says. “However, I believe this is due in part to these industries’ fear of being the first to sue and losing a decision.” However, once someone breaks cover, I believe lawsuits will begin to fly left and right.”
One issue, according to Baio, is that many of those most affected by this technology — artists and others — are simply not in a position to file legal challenges. “They lack the resources,” he says. “This type of litigation is very expensive and time-consuming, and you’ll only do it if you’re confident you’ll win.” This is why I’ve long assumed that the first AI art lawsuits will come from stock image sites. They appear to stand to lose the most as a result of this technology; they can clearly demonstrate that a significant portion of their corpus was used to train these models, and they have the funds to take it to court.”
Guadamuz concurs. “Everyone knows how much it’s going to cost,” he says. “Whoever sues will get a decision in the lower courts, then appeal, then appeal again, and it could eventually go all the way to the Supreme Court.”