"Companies racing to develop more powerful artificial intelligence are rapidly nearing a new problem: The internet might be too small for their plans.
Ever more robust systems developed by OpenAI, Google and others require larger oceans of information to learn from, straining the available pool of quality public data online at the same time that some data owners are blocking access to AI companies. Some executives and researchers say the industry's need for high-quality text data could outstrip supply within two years.
AI companies are hunting for untapped information sources, and rethinking how they train these systems. OpenAI, the maker of ChatGPT, has discussed training its next model on transcriptions of public YouTube videos, people familiar with the matter said.
Companies also are experimenting with using AI-generated, or synthetic, data as training material -- an approach many researchers say could cause crippling malfunctions.
These efforts are often secret because executives think solutions could be a competitive advantage.
"There is no established way of doing this," said Ari Morcos, an AI researcher who worked at Meta Platforms and Google's DeepMind before founding DatologyAI, whose backers include some AI pioneers. It builds tools to improve data selection, which could make training cheaper.
Data is among several essential AI resources in short supply. The chips needed to run so-called large-language models behind AI bots also are scarce. And industry leaders worry about a dearth of data centers and electricity to power them.
AI language models are built using text vacuumed up from the internet. That material is broken into tokens -- words and parts of words that the models use to learn how to formulate humanlike expressions. Generally, AI models become more capable the more data they train on.
Pablo Villalobos, who studies artificial intelligence for research institute Epoch, estimated that OpenAI's most advanced language model, called GPT-4, was trained on as many as 12 trillion tokens. An AI system like GPT-5 would need 60 trillion to 100 trillion on the current growth trajectory, Villalobos and other researchers have estimated. Harnessing all the high-quality data available could leave a shortfall of 10 trillion to 20 trillion tokens or more, he said.
Two years ago, Villalobos and his colleagues wrote that there was a 90% chance demand for high-quality data would outstrip supply by 2026. They have since become a bit more optimistic, and plan to update their estimate to 2028.
Most of the data available online is useless for AI training because it contains flaws such as sentence fragments or doesn't add to a model's knowledge. Villalobos estimated that the useful sliver is perhaps just one-tenth of the information gathered by Common Crawl, whose web archive is widely used by AI developers.
Meanwhile, social-media platforms, news publishers and others have been curbing access to data for AI training over concerns about issues including compensation. And there is little public will to hand over private conversational data, such as chats over iMessage.
Meta Platforms CEO Mark Zuckerberg recently touted his company's access to data on its platforms as a significant advantage in its AI efforts.
Some tech companies, including OpenAI partner Microsoft, are building language models that are a fraction of the size of GPT-4 but could accomplish specific objectives.
OpenAI Chief Executive Sam Altman has indicated the company is working on new training methods. "I think we're at the end of the era where it's going to be these giant, giant models," he said last year." [1]
1. Internet Is Too Small To Feed AI Ambitions. Seetharaman, Deepa. Wall Street Journal, Eastern edition; New York, N.Y.. 02 Apr 2024: B.2.