Pixabay for Pexels.

The Suit Against Copilot and What it Means for Generative AI

In June 2021, Microsoft subsidiary Github unveiled Copilot—a cloud-based artificial intelligence tool that promises to “help you write code faster and with less work.”  Copilot does this by suggesting source code and functions as developers type either code or natural language into its module.  This way, users can code without knowing how to write in code; they simply need to know how to speak a language.

The launch was long-awaited: according to GitHub, more than 1.2 million developers used Copilot’s technical preview a year before the final product was released.  The launch, however, reiterated the continued legal uncertainty of intellectual property rights in generative AI systems.

Powered by OpenAI Codex, Copilot is a neural network trained on billions of lines of code taken from publicly accessible repositories.

In November 2022, lawyer and programmer Matthew Butterick filed a class-action lawsuit in the Northern District of California.  He alleged that because Copilot was trained on copyrighted material—and outputted copyrighted material without the required attributions—Microsoft, Github, and OpenAI had violated the rights of developers whose code was used to train the program. To be precise, Butterick claims that the defendants ignored, violated, and removed the licenses on a scale amounting to software piracy.

The Copilot lawsuit shows that many questions on the legality of generative AI systems remain unanswered, even as stakes remain high.

Broadly speaking, generative AI refers to the unsupervised or semi-supervised machine learning algorithms that input data—often scraped from the Internet—to generate new content. For example, in addition to Copilot, OpenAI released DALL-E-2 (an AI-powered image generation tool trained on approximately 650 million image-text pairs scraped from the Internet) and more recently, Chat GPT-3 (a general purpose dialogue-based chatbot). Similarly, Stable Diffusion scraped 2.3 billion images from hundreds of domains, such as Pinterest, WordPress, and Getty Images, to form one of the largest text-to-AI systems.

These systems, among many others, have attracted the attention of investors, who have poured millions of dollars into their development. But they have also introduced a whole host of problems relating to intellectual property rights. Like in the case of Copilot, many artists have found that their works had been scraped—often without permission or attribution—to train these AI systems. For example, illustrator Hollie Mengert discovered that a Redditor had non-consensually used 32 of her illustrations on Stable Diffusion to create an AI model that could generate an image in her style. Similarly, Greg Rutkowski learned that Stable Diffusion had created thousands of images copying his style, causing online searches of his name to return not his work, but AI-generated work. As a result, many artists worry that because users might opt for AI-generated images rather than their works, they may lose their livelihoods.

These concerns also hold true for Copilot. The programmers who authored the code used to train OpenAI Codex have not been given their due attribution—even though Copilot generates near-identical reproductions of such authored code. Without the proper acknowledgment, programmers risk losing potential job opportunities. As one engineer said, “Attribution is a really big deal to me because that’s how I get all my clients…. If you take my attribution off, my career is over, and I can’t support my family. I can’t live.”

Suggested solutions and the fair use test do not adequately protect artists and creatives.

 Some have posited non-legal solutions. For instance, Shutterstock has proposed compensating creators for selling artists’ works to AI models. Absent any transparency mechanism into the algorithms behind the AI systems, however, this model is seriously complicated by the weighted values of the works, by which different works may contribute in different proportions to the “generated” output. In simpler terms, if 10 artworks are used to train an AI dataset, one art piece may be weighed more heavily than the other 9, such that the “generated” artwork relies primarily on that one piece. As such, compensation models may severely miscalculate the contributions of different artworks and artists to a “new” “generated” work. And in the context of Copilot, compensation does not get at the problem: that these models bypass attribution and reduce creations into mere data points.

The legality of training these models based on scraped data without consent or attribution is unclear. Claims of copyright infringement are likely to be met with invocations of the fair use doctrine, which permits the unlicensed use of copyright-protected works based on a four-factor test. Among the four factors to be considered is the effect of the use on the potential market of the copyrighted work.

The key characteristic of Copilot, however, is that much of the source code it was trained on was part of an open-source agreement, under which developers could use the code for free so long as they included the required attribution and copyright notices. This is significant: because the code is free, there is arguably no effect on the market of the code, but by way of attributions, which may lead to more work for many programmers, there will be an effect on the market for programmers’ labor. However, because the factor in the fair use test only considers the market of the copyrighted work, consideration of that factor will then necessarily always favor the generative AI system—here Microsoft and Github—precisely because there has been no effect on the market for the code. As a result, if fair use is found to apply to Copilot, programmers may be disincentivized to contribute to the open-source repositories of code that have long spurred technological innovation.

This is just the beginning. Copilot is one of the first, but many—such as Amazon CodeWhisperer—will likely soon follow.  And it highlights the need to reconceptualize intellectual property rights not solely as the exclusive right to use a work, but also as the right to recognition.

Sherry Tseng

Georgetown Law Technology Review Articles Editor; Georgetown Law J.D. Candidate 2024; University of Pennsylvania M.A./B.A. 2020.