OpenAI says it’s ‘impossible’ to create ChatGPT without copyrighted content, as if that’s somehow a good excuse

Just a couple weeks after being sued by the New York Times over allegations that it copied and used “millions” of copyrighted news articles to train its large-language models, OpenAI has told the UK’s House of Lords communications and digital select committee (via The Guardian) that it has to use copyrighted materials to build its systems because otherwise, they just won’t work.

Large-language models—LLMs—that form the basis of AI systems like OpenAI’s ChatGPT chatbot harvest massive amounts of data from online sources in order to “learn” how to function. That becomes a problem when questions of copyright come into play. The Times’ lawsuit, for instance, says Microsoft and OpenAI “seek to free-ride on The Times’ massive investment in its journalism by using it to build substitutive products without permission or payment.”

It’s not the only one taking issue with that approach: A group of 17 authors including John Grisham and George RR Martin filed suit against OpenAI in 2023, accusing it of “systematic theft on a mass scale.”

In its presentation to the House of Lords, OpenAI doesn’t deny the use of copyrighted materials, but instead says it’s all fair use—and anyway, it simply has no choice. “Because copyright today covers virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” it wrote.

“Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

I don’t find it a particularly compelling argument. If I, for instance, got busted knocking over a bank, I don’t think it would carry much weight with the cops if I told them that it was the only way to provide myself with the money that meets the needs of me. That is admittedly a bit simplistic, and it’s possible that OpenAI’s lawyers will be able to successfully argue that using copyrighted materials without permission to train its LLMs falls within the confines of fair use. But to my ear the justification for using copyrighted works without a green-light from the original creator ultimately boils down to, “But we really, really wanted to.”

Fair use is central to OpenAI’s position that the use of copyrighted materials doesn’t actually break any rules. It said in its filing with the House of Lords that “OpenAI complies with the requirements of all applicable laws, including copyright laws,” and went deeper on that point in an update released today.

“Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents,” OpenAI wrote. “We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.

“The principle that training AI models is permitted as a fair use is supported by a wide range of academics, library associations, civil society groups, startups, leading US companies, creators, authors, and others that recently submitted comments to the US Copyright Office. Other regions and countries, including the European Union, Japan, Singapore, and Israel also have laws that permit training models on copyrighted content—an advantage for AI innovation, advancement, and investment.”

We build AI to empower people, including journalists.Our position on the @nytimes lawsuit:• Training is fair use, but we provide an opt-out• “Regurgitation” is a rare bug we’re driving to zero• The New York Times is not telling the full storyhttps://t.co/S6fSaDsfKbJanuary 8, 2024

OpenAI also drew a hard line against the New York Times’ lawsuit in the update, essentially accusing the Times of ambushing it in the midst of partnership negotiations. Perhaps taking a lesson from Twitter, which accused Media Matters of manipulating “inorganic combinations of advertisements and content” in order to make pro-Nazi ads appear next to posts by major advertisers, OpenAI also said the Times “manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate” its content and style, a central element of complaints against AI.

“Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” OpenAI wrote.

OpenAI said in its House of Lords filing that it is “continuing to develop additional mechanisms to empower rightsholders to opt out of training,” and is pursuing deals with various agencies like the one it signed with Associated Press in 2023 that it hopes will “yield additional partnerships soon.” But to me that lands like a “forgiveness instead of permission” approach: OpenAI is already scraping this stuff anyway, so agencies and outlets might as well sign some kind of deal before a court rules that AI companies can do whatever they want.