OpenAI’s ChatGPT and the GPT-4 large language model may be facing copyright issues as researchers have discovered that the models were trained on text from copyrighted books.
Academics at the University of California, Berkeley, including Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman, made the claim in a paper titled, “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4.”
The academics conducted a “name cloze” test designed to predict a single name in a passage of 40-60 tokens from a copyrighted book in order to identify whether the model has memorized the associated text.
That is because the data behind ChatGPT and GPT-4 is fundamentally unknowable outside of OpenAI, the authors said in their paper.
“Our work carries out probabilistic inference to measure the familiarity of these models with a set of books, but the question of whether they truly exist within the training data of these models is not answerable.”
The researchers found that science fiction and fantasy books dominated the list of memorized titles.
More specifically, the chatbot has memorized copyrighted titles such as the Harry Potter children’s books, Orwell’s Nineteen Eighty-Four, The Lord of the Rings trilogy, the Hunger Games books, Hitchhiker’s Guide to the Galaxy, Fahrenheit 451, A Game of Thrones, and Dune, among others.
On the other hand, ChatGPT exhibits less knowledge of works in other genres.
According to the paper, the model knows “little about works of Global Anglophone texts, works in the Black Book Interactive Project and Black Caucus American Library Association award winners.”
Researchers Advocate Use of Public Data for Transparency
The Berkeley computer scientists focused less on the copyright implications of memorizing texts, and more on the black-box nature of these models.
For context, OpenAI and other AI development labs do not disclose the data they use to train their models, which raises concerns about the validity of their text analysis.
The researchers added that they are advocating the use of public training data so model behavior is more transparent.
“Data curation is still very immature in machine learning,” Margaret Mitchell, an AI researcher and chief ethics scientist for Hugging Face, told The Register.
“‘Don’t test on your training data’ is a common adage in machine learning, but requires careful documentation of the data; yet robust documentation of data is not part of machine learning culture.”
OpenAI and Google to Face Lawsuits Over AI Development Using Copyrighted Text
Some experts have warned that the copyright implications may not be avoidable.
Specifically, if text-generating applications built on these models produce passages that are substantially similar or identical to copyrighted texts they’ve ingested.
Tyler Ochoa from the Law Department at Santa Clara University expects lawsuits against makers of large language models that generate text, including Google, OpenAI, and others.
Issues that could arise are similar to those related to AI-generated images. The first issue is whether copying texts or images for model training is fair use. The answer is likely yes, he said.
The second issue is whether text output generated by the model, which is substantially similar or identical to copyrighted text, constitutes infringement. According to Ochoa, the answer is almost certainly, yes.
The third issue is whether AI-generated text that is not a copy of existing text is protected by copyright. It is likely that it is not, as the laws in the US and certain countries require human creativity for copyright protection.
“So far we’ve seen lawsuits over issues one and three,” said Ochoa. “Issue one lawsuits so far have involved AI image-generating models, but lawsuits against AI text-generating models are inevitable.”
He added that the paper by Berkeley researchers shows that text output by AI models could be identical to copyrighted text, which would then encourage copyright holders to take legal actions.