"The project’s leader says that allowing everyone to access
the collection of public-domain books will help “level the playing field” in
the AI industry.
Harvard University announced Thursday it’s releasing a
high-quality dataset of nearly 1 million public-domain books that could be used
by anyone to train large language models and other AI tools. The dataset was
created by Harvard’s newly formed Institutional Data Initiative with funding
from both Microsoft and OpenAI. It contains books scanned as part of the Google
Books project that are no longer protected by copyright.
Around five times the size of the notorious Books3 dataset
that was used to train AI models like Meta’s Llama, the Institutional Data
Initiative's database spans genres, decades, and languages, with classics from
Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math
textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of
the Institutional Data Initiative, says the project is an attempt to “level the
playing field” by giving the general public, including small players in the AI
industry and individual researchers, access to the sort of highly-refined and
curated content repositories that normally only established tech giants have the
resources to assemble. “It's gone through rigorous review,” he says.
Leppert believes the new public domain database could be
used in conjunction with other licensed materials to build artificial
intelligence models. “I think about it a bit like the way that Linux has become
a foundational operating system for so much of the world,” he says, noting that
companies would still need to use additional training data to differentiate
their models from those of their competitors.
Burton Davis, Microsoft’s vice president and deputy general
counsel for intellectual property, emphasized that the company’s support for
the project was in line with its broader beliefs about the value of creating
“pools of accessible data” for AI startups to use that are “managed in the
public’s interest.” In other words, Microsoft isn’t necessarily planning to
swap out all of the AI training data it has used in its own models with public
domain alternatives like the books in the new Harvard database. “We use
publicly available data for the purposes of training our models,” Davis says.
Tom Rubin, OpenAI's chief of intellectual property and
content, described the company as “delighted” to support the project in a
statement.
As dozens of lawsuits filed over the use of copyrighted data
for training AI wind their way through the courts, the future of how artificial
intelligence tools are built hangs in the balance. If AI companies win their
cases, they’ll be able to keep scraping the internet without needing to enter
into licensing agreements with copyright holders. But if they lose, AI
companies could be forced to overhaul how their models get made. A wave of
projects like the Harvard database are plowing forward under the assumption
that—no matter what happens—there will be an appetite for public domain
datasets.
In addition to the trove of books, the Institutional Data
Initiative is also working with the Boston Public Library to scan millions of
articles from different newspapers now in the public domain, and it says it’s
open to forming similar collaborations down the line. The exact way the books
dataset will be released is not settled. The Institutional Data Initiative has
asked Google to work together on public distribution, but the details are still
being hammered out. In a statement, Kent Walker, Google's president of global
affairs, said the company was "proud to support" the project.
However the IDI’s dataset is released, it will be joining a
host of similar projects, startups, and initiatives that promise to give
companies access to substantial and high-quality AI training materials without
the risk of running into copyright issues. Firms like Calliope Networks and
ProRata have emerged to issue licenses and manage compensation schemes designed
to get creators and rights holders paid for providing AI training data.
There are also other new public-domain projects. Last
spring, the French AI startup Pleias rolled out its own public-domain dataset,
Common Corpus, which contains an estimated 3 to 4 million books and periodical
collections, according to project coordinator Pierre-Carl Langlais. Backed by
the French Ministry of Culture, the Common Corpus has been downloaded more than
60,000 times this month alone on the open source AI platform Hugging Face. Last
week, Pleias announced that it is releasing its first set of large language
models trained on this dataset, which Langlais told WIRED constitute the first
models “ever trained exclusively on open data and compliant with the [EU] AI
Act.”
Efforts are underway to create similar image datasets as
well. AI startup Spawning released its own this summer called Source.Plus,
which contains public-domain images from Wikimedia Commons as well as a variety
of museums and archives. Several significant cultural institutions have long
made their own archives accessible to the public as standalone projects, like
the Metropolitan Museum of Art in New York.
Ed Newton-Rex, a former executive at Stability AI who now
runs a nonprofit that certifies ethically-trained AI tools, says the rise of
these datasets shows that there’s no need to steal copyrighted materials to
build high-performing and quality AI models. OpenAI previously told lawmakers
in the United Kingdom that it would be “impossible” to create products like
ChatGPT without using copyrighted works. “Large public domain datasets like
these further demolish the 'necessity defense' some AI companies use to justify
scraping copyrighted work to train their models,” Newton-Rex says.
But he still has reservations about whether the IDI and
projects like it will actually change the AI training status quo. “These
datasets will only have a positive impact if they're used, probably in
conjunction with licensing other data, to replace scraped copyrighted work. If
they're just added to the mix, one part of a dataset that also includes the
unlicensed life's work of the world's creators, they'll overwhelmingly benefit
AI companies,” he says."
Komentarų nėra:
Rašyti komentarą