Nvidia allegedly greenlit the use of pirated books from illegal sources to train its AI models, according to an expanded class-action lawsuit

20.01.2026 20:05

Pcgamer.com

The capabilities of AI models, such as GPT-5, Gemini, Claude, and Grok, lie in the size and scope of the dataset used to train them. This has also been the source of multiple lawsuits, claiming that the companies performing the training had no right to freely use the data. In an expanded class-action case against Nvidia, however, the accusation goes one step further, with claims that the GPU giant willingly used an illegal source of pirated books to train its models.

As reported by TorrentFreak, an amended complaint (pdf warning) filed at the district court in Oakland, California last week, specifically claims that staff at Nvidia contacted a so-called 'shadow library' known as Anna's Archive, a repository of pirated books and other documents.

The plaintiffs cite internal Nvidia communications as evidence, with the filed document purporting to show someone from the data strategy team at Nvidia writing, "we are exploring including Anna's Archive in pre-training data for our LLMs."

It continues with "We are figuring out internally whether we are willing to accept the risk of using this data, but would like to speak with your team to get a better understanding of LLM-related work you have done."

While Anna's Archive appears not to host any content directly itself, it does act as a 'search engine' for alleged pirate libraries. These third-party hosts aren't exclusively providing access to copyrighted materials, but that content is what they are most infamous for.

The original complaint against Nvidia was filed back in 2024, and as Torrent Freak reported at the time, Nvidia's response was essentially to claim that AI training on such material is not the same as owning an illegally obtained book, or even using it as a human does. "Training measures statistical correlations in the aggregate, across a vast body of data, and encodes them into the parameters of a model," it wrote in response.

In essence, Nvidia is saying that the use of such datasets falls under fair use. Given that the original complaint involved data garnered from another pirated source (Books3), it's possible that Nvidia may choose to use the same counterargument from 2024.

Similar claims have been filed against Anthropic and Meta in the past, and in the case of the former, the court judge ruled that while accessing the data did fall under fair use, "Anthropic had no entitlement to use pirated copies for its central library." How the case against Nvidia will fare, well, we'll just have to wait and see.