“Language models have achieved remarkable progress. However, for them to tackle even more complex tasks and become more versatile, they require more than just computational power.
Whether it’s ChatGPT, Gemini, or Qwen—all these language models ultimately rely on the same underlying technology. What distinguishes them is the training data used. The specific training data that is collected, filtered, and generated determines the quality of a language model: how reliably it reproduces facts, how well it executes tasks, and when it ‘hallucinates.’
In short: For tasks where high-quality training data is available, language models perform exceptionally well. For tasks lacking such data, they fail quickly.
Currently, language models are improving primarily because we are identifying their weaknesses and deliberately collecting or generating data to bridge those gaps. For instance, language models today operate at an expert level in mathematics and programming—two areas where, just two years ago, they still exhibited significant deficiencies.
A look at the training data reveals how this process works. As a reminder: Language models learn to predict the next word in a text—initially by training on massive datasets, and subsequently by fine-tuning on specific examples, such as questions paired with their corresponding answers. As simple as the act of generating the next word may sound, it is sufficient to enable the creation of entire texts, the answering of questions, and even programming.
The texts used for this initial training are largely sourced from the internet. To this end, AI developers first gather every available text they can find. This collection is highly diverse, encompassing Wikipedia entries, news articles, scientific papers, forum discussions, and—naturally—a great deal of advertising copy. However, a significant portion of the texts obtained in this manner consists of gibberish; it is of low quality and unsuitable for direct use in training. These unsuitable texts are identified—in some cases, with the aid of smaller language models—and subsequently filtered out. What remains constitutes a mere fraction of the internet, yet it still represents a colossal volume of text—a quantity on a scale which corresponds to hundreds of millions of books.
Next, the texts are weighted: texts that appear frequently—and in very similar forms—within the training data are downweighted, while other texts of exceptionally high quality are duplicated, thereby appearing multiple times within the training set. This is a delicate process. On one hand, factual statements—such as "Berlin is the capital of Germany"—should appear multiple times to ensure the model correctly learns and reproduces such facts. On the other hand, developers aim to avoid having texts appear in identical or highly similar forms too frequently; otherwise, the language model tends to reproduce these texts verbatim.
For instance, there are articles from *The New York Times* that the OpenAI language model GPT-4 can reproduce almost word-for-word with only minimal prompting. This occurs when such texts appear with very high frequency within the training data. Incidentally, this observation forms the basis of the ongoing lawsuit filed by *The New York Times* against OpenAI and Microsoft for copyright infringement.
When Machines Generate Data
But why are language models capable of handling such a diverse range of tasks in the first place? How can they learn to summarize texts, write code, and answer questions—simply by predicting the next word in texts scraped from the internet?
The internet contains such a vast volume of text that even rare formats—such as question-and-answer pairs, or texts accompanied by their corresponding summaries—occur in large numbers. However, the relative proportion of such examples remains very small. Consequently, following this initial training phase, a model will often respond to a question not with an answer, but with another question—precisely because the internet also hosts numerous webpages consisting solely of questions, such as quiz sites or practice exercises.
To transform a language model—originally trained on internet data—into a useful assistant capable of answering questions and following instructions, it must undergo a process known as "fine-tuning." The simplest—and an effective—method of fine-tuning involves training the model on data that exemplifies the desired behavior; for instance, pairs of questions and their corresponding answers. In this way, the models learns to respond to a question instead of asking a question themselves.
In the first generation of language models, humans played a major role in generating such data. They wrote answers to questions and rated various responses as better or worse, enabling the model to learn which answers humans prefer.
Synthetic data—that is, data generated by language models themselves—is, however, becoming increasingly important for training, as having more data generally helps. And because the highest-quality texts on the internet have, in fact, already been largely utilized for training—many texts from the internet are of low quality and are therefore discarded from the training set. However, such texts can serve as an excellent foundation for generating synthetic data, which, in turn, can prove beneficial for training purposes.
Language models are employed to transform such poor or mediocre data into high-quality data.
Much like humans, language models learn more effectively when exposed to information presented in various forms. Consequently, it can be an effective strategy to utilize a language model to generate multiple variations of a single text and subsequently incorporate these into the training process. Such synthetic data is playing an increasingly pivotal role in the training of language models.
Training on synthetic data is a highly effective approach. Consequently, synthetic data can also be leveraged to replicate the capabilities of other models. For instance, if a company like OpenAI releases a new, high-performance language model, other companies could utilize it to generate data that enhances their own models—even if OpenAI’s terms of service explicitly prohibit such usage.
An intriguing example of this is DeepSeek V3. This refers to a highly capable language model that the Chinese company DeepSeek released as a freely accessible resource in December 2024. V3 quickly made headlines, as the innovators at DeepSeek had succeeded in training a top-tier model at a relatively low cost. In the wake of this development, the share price of the AI chip manufacturer Nvidia plummeted by 17 percent in a single day in January. A key factor behind this Chinese success lies in the fact that the DeepSeek team worked with exceptionally high-quality data. Superior data enables the training of a model of equal caliber using less computational power—and, consequently, at a lower cost. When asked, "What model are you?", V3 responds: "I'm an AI language model called ChatGPT, created by OpenAI"—a response strongly suggesting that its training data was, at least in part, derived from OpenAI models. Our own research supports this hypothesis: V3 responds to many prompts in a manner that is very difficult to distinguish from responses generated by GPT-4, suggesting that a portion of Deepseek’s training data was generated using GPT-4.
However, this could also be attributed to the fact that the Deepseek model was trained on data from the internet—for even as early as 2024, the internet already contained numerous texts that had been generated by OpenAI models.
**Breaking Down Thought Processes**
One of the most significant developments in the field of language models over the past year and a half has been teaching them to execute extended chains of thought. Most models now feature such a "Think" function: when faced with questions requiring reflection, the model first executes a series of internal thought steps and then provides an answer based on those steps. Examples of this include OpenAI’s O1 model, Google’s Gemini Thinking, and Deepseek’s R1 model. Such thought steps are highly beneficial for answering more complex questions.
Let us illustrate this with a simple example. Question: Anne has three pears and buys two more—how many does she have in total? The thought steps—or the reasoning process—proceed as follows: Anne has three pears. She buys two more. Three plus two equals five. Answer: five pears.
Such thought steps not only make an answer comprehensible but, more importantly, significantly enhance its quality. This enables the models to answer far more difficult questions by systematically working out the solution through these intermediate steps. These thought steps prove particularly useful when tackling complex inquiries, such as mathematical problems. During this internal reasoning process, the model proposes various approaches to a solution, discards some, experiments with others, and corrects errors as they arise. For intricate mathematical problems, these thought steps can easily span 50 pages or more; typically, however, they are not displayed to the user. This comes at a price: for longer chains of thought, the language model must generate more text, which requires greater computational power.
The difficulty in teaching a model such reasoning processes lies in the fact that very little data of this kind exists in written form. For instance, when mathematicians solve a difficult problem, they typically record only the correct solution—not the intricate lines of reasoning and attempts that contributed to finding that solution.
To acquire such mathematical reasoning steps, so-called Reinforcement Learning plays a pivotal role. In this process, models experiment with various lines of reasoning; those lines of reasoning that lead to correct solutions are subsequently reinforced, making them more likely to occur in the future. Through this mechanism, language models learn to generate long and complex reasoning processes that culminate in a solution. This approach works only if the model already possesses strong mathematical reasoning capabilities; otherwise, it would be unable to generate any reasoning processes that lead to a solution.
Reinforcement learning can therefore be viewed as a method through which language models learn from self-generated, synthetic data. A prerequisite for this is the ability to automatically assess whether a given text or result is good or bad. In mathematics and programming, this is often possible—a key reason why language models have become so powerful in these fields.
Language models are increasingly being deployed in systems where they interact autonomously with computers to perform tasks—for instance, booking flights, analyzing data, or conducting research. Such systems are referred to as "agents." As with the "steps of thought" discussed earlier, language models in this context also learn from examples where such tasks have been successfully completed—examples generated partly by humans and partly by the language models themselves.
To gauge how well a student at TU Munich understands my lecture material, I—like most of my colleagues—rely on exams and assignments. For language models, exam-style tasks are also frequently used to test the models' capabilities. One such example is the United States Medical Licensing Examination (USMLE), the licensing exam for physicians in the United States. This exam consists of multiple-choice questions covering foundational concepts, clinical knowledge, and disease management. General-purpose language models like GPT-4, as well as specialized medical language models, pass such exams with ease, achieving scores on par with those of medical professionals. The same holds true for other professional fields.
However, just as a high grade on an exam offers only limited predictive power regarding how effectively a student can apply lecture material in their professional work or research, a language model’s strong performance signals—to an even lesser extent—whether that model is capable of productively executing real-world professional workflows, such as those found in a hospital setting.
Humans are able to transfer knowledge to novel situations with far greater flexibility than language models can. Language models, conversely, demonstrate exceptional performance primarily when the training data closely resembles the specific tasks at hand. Therefore, it is certainly possible to deploy language models for highly complex professional workflows; however, they must be trained on appropriate data.
**Filling the Data Gaps**
What follows from the fact that training data lies at the very heart of artificial intelligence? Language models are particularly well-suited for tasks where an abundance of usable, high-quality data is available. Even the earliest language models excelled at generating content for which numerous good examples existed on the internet—such as recipes, summaries, and general knowledge.
Today’s models also perform exceptionally well in highly complex tasks within mathematics or programming, precisely because a vast amount of high-quality data exists for these domains—or could be generated.
For tasks where gathering or generating sufficient data proves more challenging—as is often the case in various scientific fields, legal services, or internal corporate workflows—the success of language models will hinge on the ability to fill these data gaps.
The fact that general-purpose models are constantly improving does little to alter this fundamental reality. Anyone wishing to deploy language models for specialized tasks must train them using data specifically tailored to those applications. Consequently, there remains immense potential to make language models significantly more effective and useful. And a substantial portion of this effort will involve the collection and generation of training data.
Prof. Dr. Reinhard Heckel holds the Chair of Machine Learning within the Department of Computer Engineering at the Technical University of Munich (TU Munich).
**Digital Economy**
"You will find everything you need to know about artificial intelligence, the platform economy, and digitalization bundled together—and enriched with in-depth insights—in our 'PRO-Digital Economy' product offerings." [1]
1. Wieso es in der Künstlichen Intelligenz jetzt erst recht auf hochwertige Daten ankommt. Frankfurter Allgemeine Zeitung; Frankfurt. 09 Feb 2026: 19. Von Reinhard Heckel
Komentarų nėra:
Rašyti komentarą