A lawsuit filed in the Manhattan federal court last week by the New York Times claims that the defendants—Microsoft and OpenAI—have used millions of its articles to train and create its large language models (LLMs) and other products. The Times is seeking damages in realms of billions of dollars, though it doesn’t give a specific number.
But yeah, it’s going to be looking for a pretty large payout if it does win.
“The law does not permit the kind of systematic and competitive infringement that Defendants have committed,” reads the official complaint (pdf warning). “This action seeks to hold them responsible for the billions of dollars in statutory and actual damages that they owe for the unlawful copying and use of The Times’s uniquely valuable works.”
The lawsuit states that the New York Times had been in negotiations with the defendants “for months” and that it was looking to reach an agreement “in accordance with its history of working productively with large technology platforms to permit the use of its content in new digital products.” The idea put forward in the court document is that its goal was both to get fair value out of its contribution to the training, because of the weighting The Times’ content was given during training, and to “facilitate the continuation of a healthy news ecosystem, and help develop GenAI technology in a responsible way that benefits society and supports a well-informed public.”
For its part, a statement from an OpenAI spokesperson, Lindsey Held, is quoted by The New York Times article itself as saying the company thought that negotiations had been constructive and was “surprised and disappointed” by the lawsuit.
“We’re hopeful that we will find a mutually beneficial way to work together,” they are quoted as saying, “as we are doing with many other publishers.”
One of the most intriguing parts of the lawsuit, and arguably the part that has got The Times’ hackles up, is that it seems like OpenAI has given particular weight to the publisher’s content during the training of its LLMs.
During the training of GPT-3 specifically, the lawsuit states that one of the key datasets—one weighted as high quality set—used nearly 210k unique New York Times URLs, which amounted to 1.23% of all the sources in the dataset.
(Image credit: Microsoft)
The largest, and most heavily weighted dataset used to train GPT-3, however, includes “at least 16 million unique records of content from The Times across News, Cooking, Wirecutter, and The Athletic.”
It also then goes on to state that OpenAI itself has said that the datasets it sees as the most high quality ones are then sampled more frequently during the training of a model. “By OpenAI’s own admission,” reads the court document, “high-quality content, including content from The Times, was more important and valuable for training the GPT models as compared to content taken from other, lower-quality sources.”
This isn’t the first lawsuit against OpenAI for copyright infringement in the training of its LLMs as The Times notes there has also been a lawsuit brought by 17 authors, including George RR Martin and John Grisham, against the company for “systematic theft on a mass scale” and one from Getty against Stability AI, the creators of the generative AI image maker, Stable Diffusion, over the use of its images in the training of its model.
And it’s unlikely to be the last lawsuit against AI makers, either. But given the seeming reticence of AI companies to tackle the issues of copyright infringement, and fair compensation for the training of their multi-billion dollar products themselves, it’s looking like legal proceedings might be one of the few ways to keep them in check.