Harvard is releasing a massive free AI training dataset funded by OpenAI and Microsoft

Harvard University announced Thursday that it is releasing a high-quality dataset of nearly one million public-domain books that can be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. This includes books scanned as part of the Google Books project that are no longer protected by copyright.

About five times the size of Notorious Books3 dataset which was used to train AI models such as Meta’s llama, the Institutional Data Initiative’s database contains classics by Shakespeare, Charles Dickens, and Dante, along with obscure Czech math textbooks and Welsh pocket dictionaries, genres, decades and spans languages. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to highly refined varieties. Curated content repositories that typically only established tech giants have the resources to put together. “It went through a rigorous review,” he says.

Leppert believes the new public domain database can be used in conjunction with other licensed materials to build artificial intelligence models. “I think of it a little bit like Linux has become the basic operating system for much of the world,” he says, noting that companies still need to differentiate their models from those of their competitors. will need to use additional training data for

Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was consistent. Its broad beliefs about the value of creatingpools of data accessible to AI startups” that are “managed in the public interest.” In other words, Microsoft essentially swaps all the AI training data used in its models with public domain options like books in the new Harvard database. “We use publicly available data for the purposes of training our models,” Davis says.

as the dozens of of lawsuits filed over the use of Copyright data For AI training the air Their path through the courts, the future of how artificial intelligence tools are made hangs in the balance. If the AI companies win their cases, they will be able to keep it Scraping the internet without the need to enter into license agreements with copyright holders. But if they lose, AI companies could be forced to improve how their models are built. A wave of projects like the Harvard database are moving forward under the assumption that – no matter what – there will be an appetite for public domain datasets.

In addition to the book collection, the Institutional Data Initiative is now also working with the Boston Public Library to scan millions of articles from various newspapers into the public domain, and it says it is open to similar collaborations down the line. The exact way to release the books dataset has not been settled. The Institutional Data Initiative has asked Google to work together on public sharing, but the search giant has not yet publicly agreed to host it, although Harvard says it is optimistic it will. . (Google did not respond to WIRED’s requests for comment.)