“ChatGPT triggered the biggest AI boom to date. But how far have language models come? And what's next? An overview.
1. The latest developments: Since the release of ChatGPT in November 2022, the AI world has developed explosively. New models and tools are sprouting up almost weekly. For users, this means unprecedented opportunities – but also an overwhelming jungle of offerings. Finding the right tool for a specific purpose now requires almost as much guidance as expertise.
These developments are also reaching the market with increasing frequency. Shortly after the announcement of ChatGPT, GPT-4 was released on March 14, 2023. This was followed by GPT-4o on May 13, 2024, and then the well-known Deepseek-R1 model on January 20, 2025. On August 5 of this year, the GPT-Oss model was released. Just a few days later, on August 7, GPT-5, the successor to GPT-4, was released. In comparison, there was a two-year gap between ChatGPT and the last "groundbreaking" model before it, GPT-3.
However, not only are new language models being developed, but also new application methods. From the user's perspective, this primarily means that it is easier to either find a suitable model for the desired application or to adapt an existing model to it. One example of a new and increasingly popular application method is the use of so-called AI agents.
Often, an AI system is needed that is not only limited to its own knowledge gained through training, but is also able to access external resources or tools to ensure that answers are correct and up-to-date, or to retrieve information from a specific external data source. This ability to use external tools is precisely what distinguishes an AI agent.
AI agents are modular in design. If a user instructs the AI agent to solve a complex mathematical problem, the AI recognizes that this is a task requiring the capabilities of an external calculator. Instead of relying solely on its (potentially insufficient) training data, the AI therefore calls upon the calculator, which performs the calculation correctly and reliably. AI agents are very diverse – they can be designed, for example, to generate images based on a description or to perform an internet search.
To evaluate current language models based on their agent capabilities, various benchmarks have been developed, such as the Tau benchmark. This benchmark specializes in evaluating agents on their interaction skills with humans, specifically in very realistic contexts where the user asks vague questions and initially doesn't provide enough information about the exact task, or even when the task changes during the conversation. Previous benchmarks, in contrast, were specialized for specific, individual task areas such as shopping or travel. Currently, AI models like Claude from Anthropic and OpenAI's GPT-4o occupy the top ranks in the Tau benchmark, meaning that these models are the most capable of handling various, sometimes complex tasks.
GPT-Oss, a new model from OpenAI, also performs well in the Tau benchmark and on other reasoning-focused benchmarks, almost as well as GPT-4o-mini. This is good news for users and researchers, because GPT-Oss has another remarkable characteristic: Unlike previous well-known OpenAI models such as GPT-4 and GPT-4o, GPT-Oss is a so-called open-weight model. This means that the pre-trained parameters of GPT-Oss can be accessed and modified through training or fine-tuning. This allows for better control over how the model behaves than simply using the right prompting.
2 How far artificial intelligence really is: New releases don't always mean progress. One example of this is GPT-5: The model was supposed to represent a milestone, to roughly the same extent as GPT-4 compared to GPT-3. Therefore, GPT-5 was released with many ambitious promises. However, the reactions after the release were mixed. Many felt their expectations were unmet, as GPT-5 was not as groundbreaking as its predecessor. Furthermore, the release led to the unannounced removal of earlier models, which angered users.
Nevertheless, in terms of performance, GPT-5 is a very good model. In fact, GPT-5 is the best model in the LM-Arena, the most popular general-purpose benchmark, which indicates the model's strong overall capabilities. However, it may not be the best model for every task. For example, GPT-5 is only the fifth-best model on the Simple Bench reasoning benchmark.
When it comes to benchmarks, it's important to note that while they provide some insight into the capabilities of a language model, this should be understood as an assessment rather than a universally valid conclusion. Models can perform very well on a benchmark, but when applied to "real-world" data in a similar task domain, they can still sometimes fail. This suggests that the model hasn't actually mastered the abilities measured by the benchmark – it only possesses limited abilities that are highly context-dependent, meaning they depend on the exact wording of the prompt or the specific examples presented to the model.
This is known to occur in the case of so-called "counterfactual" tasks, where the correct solution or approach differs from what is normally expected. In a counterfactual task, we might specify that the word "dog" doesn't refer to an animal as usual, but instead to a vehicle. While humans can quickly adapt to this new meaning and reinterpret sentences accordingly, language models have great difficulty with this. The reason: they were predominantly trained on data where words carry their everyday, conventional meaning. Unfortunately, this problem has not yet been solved – even new models like Deepseek-R1 and GPT-4/-5 make such trivial errors despite the complex training procedures designed to foster deeper reasoning abilities. This indicates that the capabilities of the new models are not yet sufficiently general.
Furthermore, in cases where AI models outperform humans on benchmarks, it's crucial to carefully consider how the human results were obtained. Often, humans are evaluated differently than models, for example, by being tested on only a smaller subset of the questions. The evaluation methods also differ depending on the task. Sometimes the majority decision of several people is used, sometimes the result of the highest-performing individual is taken. A good performance of a large AI language model can therefore mean very different things depending on the benchmark.
3 The first Transformer architecture and BERT: The current developments and the success achieved by AI language models are based on the so-called Transformer architecture, which was published in 2017 in the well-known research paper "Attention Is All You Need". The paper showed that equally good (or even better) results could be achieved with an AI model by using only certain components of previous architectures and omitting others. This represented an important paradigm shift compared to the previous dominant architectures for processing longer texts, which are behind abbreviations such as RNN and LSTM. Shortly afterwards, in 2018, the AI model BERT was developed by Google researchers. It is based on the Transformer architecture and was trained with a new training method called Masked Language Modeling. In this method, part of a text is intentionally replaced with gaps, and the model learns to predict the missing words. This allowed BERT to understand language patterns very generally and respond to many different questions by providing the most likely answer. Through this training method, BERT was considered a language model (LM).
4 Scaling and large language models after BERT: It soon became clear that language models become even better when they are made larger and trained with more training data. This process is called "scaling". Scaling transforms a "Language Model" into a "Large Language Model" (LLM). This was made possible by the significantly increased computing power. Of course, various modifications were also made to the Transformer architecture over the years, which also contributed to the success of large language models. Nevertheless, the prevailing view was that complex tasks that smaller models could not handle were primarily solvable through scaling. It was also based on what AI experts call "emergent abilities," the capacity to solve tasks that only arise above a certain model size, and in an unexpected way.
To expand the capabilities of language models, large companies therefore invested astronomical sums to further scale them up. One example is Google's PALM model, which had 540 billion parameters and outperformed smaller models on many tasks. A disadvantage of models of this size is that they cannot run on a normal PC. Instead, they are usually accessed via an interface or run on particularly powerful computing clusters. They work with powerful servers. While new models are usually more expensive than their predecessors, the added benefits can certainly justify the higher costs.
5. Other Training Methods: It's not just about scaling: Scaling is not the only method to improve the capabilities of a language model. In the post-BERT era, many complementary training methods have been developed that build upon the fundamental, general training (pre-training).
One example is fine-tuning, in which only a portion of a model is trained for a specific task. This requires far less data than pre-training and specializes the model for the target task – which improves performance on that task compared to the pre-trained "base model." A well-known form of fine-tuning is so-called instruction tuning. Using this method, models are trained on user instructions to better recognize and respond to them. Thanks to this, language models are able to interpret instructions like "Tell me about good restaurants nearby" as a request. Accordingly, they will generate a list of restaurants instead of simply trying to continue the text of the instruction, like a text in a novel.
In contrast, with so-called in-context learning, the user provides the model with a few examples directly in the prompt to show it how to solve a task. For example, before a difficult math problem, one can provide two or three similar solved problems. In this way, the model better understands how to proceed and often delivers more accurate results. Unlike fine-tuning, however, the model is not permanently altered – it only learns from the examples during the current request, without its internal parameters being adjusted through training.
Interestingly, instruction tuning and in-context learning are related to emergent capabilities. In a 2024 analysis, we showed that these capabilities are by no means unexpected – rather, they can be explained as a combination of knowledge acquired from training data and the way users reactivate this knowledge using ICL prompts (In-Content Learning, ICL is a form of learning where LLMs learn to perform new tasks by analyzing examples embedded within the prompt, without requiring any parameter updates) or instruction tuning. This is another example demonstrating that scaling alone is not the key to further improving language models – multiple methods must be combined.
6 New Post-Training Methods: The importance of the right post-training method for a language model's performance can be demonstrated by the success of reinforcement learning. The basic idea is that many tasks require multiple intermediate steps. We humans usually don't arrive at the solution immediately, but think step by step – sometimes even out loud. Language models, on the other hand, are often only asked to give the correct answer directly. And this is precisely where they often fail when it comes to complex problems.
This initially led to the conclusion that language models do not possess complex reasoning abilities. However, this assumption proved false – language models do indeed have such abilities; they simply need to be activated through the right method. One such method is Chain-of-Thought (CoT) prompting, where the model is prompted not to give the solution immediately, but to first explain the thought process, as a human would. Surprisingly, this method often leads to more correct answers than normal prompting. The reason for this success is that the language model generates more tokens before the final answer, which makes it possible to model the complex probabilistic relationship between the prompt and the correct answer through several less complex relationships between the tokens of the thought process. Of course, CoT doesn't always lead directly to the correct answer. However, an important observation is that even when the generated chain of thought leads to the wrong answer, a chain of thought that would have led to the correct answer was found by the model during the process, but was simply not chosen as the most probable. The model was therefore able to find the correct chain of thought, but did not recognize it as such. To correct this error, the model does not need to be retrained; instead, the decoding process must be modified so that the correct chains of thought are ranked higher than the incorrect ones.
Such optimization can be achieved through reinforcement learning. Models like OpenAI-o1 and Deepseek-R1 were trained using precisely this process. Many different training variations exist for optimally finding the correct chains of thought. Two well-known variations are represented by the abbreviations DPO and PPO. Both methods teach the model to better align its thought processes with the thought processes preferred by humans. The "rules" by which the model evaluates its outputs are called "policy"—and this is precisely what needs to be optimized for the model to behave correctly. The two methods differ in the specific details of how this training is conducted. In PPO (Proximal Policy Optimization), an explicit reward model is trained that provides feedback to the main model. In DPO (Direct Preference Optimization), the main model is trained with an implicit reward model.
Models like Deepseek and OpenAI-o1 are distinguished by their ability to generate their own "chains of thought." Deepseek is a particularly notable example, as it was trained exclusively through reinforcement learning. Models that are largely trained using reinforcement learning methods are called "Large Reasoning Models" (LRMs).
Another characteristic of these models is that they output their own chains of thought and also take more time to think through a task. Research has shown that more computation time during the generation of the response text leads to better results, similar to how humans often give better answers the longer they have thought about a problem.
Following the success of Deepseek and OpenAI-o1, there is increasing interest in LRMs. It is possible that LRMs will soon replace conventional language models in benchmarks and as the focus of research.
An important point, however, is that LRMs do not differ from conventional language models in their architecture, but rather in their training and post-training processes. The Transformer architecture, which underlies language models, remains the same and is therefore still the "state of the art."
7. Are Transformers generally reaching their limits, and are there things they fundamentally cannot do? Some scientists, such as Meta's chief AI scientist Yann LeCun, believe that large language models have a fundamental problem rooted in the autoregressive Transformer architecture itself. Specifically, the issue is that the process of autoregressive prediction of the next token always involves a margin of error – and this error cannot be completely eliminated.
(Autoregressive (AR) means predicting future values in a sequence (like a time series or text) by using a mathematical model that regresses the current value on its own past values, essentially "predicting itself" using its history.)
Therefore, there will always be a non-negligible probability that incorrect token sequences will be predicted. Known consequences include hallucinations, i.e., the generation of seemingly fabricated, nonsensical texts, as well as the generation of toxic or harmful responses. Furthermore, some researchers believe that these behaviors cannot be fully controlled and thus represent an inherent limitation of language models.
Only further research can clarify whether this assumption is true. Recent studies have shown that large language models are controllable to a certain extent – particularly through careful selection and preparation of the data. There are also many ongoing efforts to reduce toxicity and hallucinations through targeted training or fine-tuning procedures. Even though there is currently no universally effective solution to these problems, it is quite possible that language models will be controllable to a satisfactory degree for certain end applications – although this may require greater effort on the part of developers and careful use by users.
It is currently unclear whether an alternative architecture can be developed that could replace such LLMs, for example, by combining the specialized but limited capabilities of classical machine learning methods with the generalized but hallucination-prone capabilities of language models, thus combining the best of both worlds. However, it is also conceivable that something entirely new will be developed in the future that could have a similarly significant impact as language models have (or had) at their time.
Prof. Dr. Iryna Gurevych researches and teaches at the Technical University of Darmstadt and focuses on how computers can understand and process language.
Irina Bigoulaeva is a PhD student in the research laboratory led by Iryna Gurevych." [A]
A. Künstliche Intelligenz: Wo stehen wir wirklich? Frankfurter Allgemeine Zeitung; Frankfurt. 20 Oct 2025: 18. Von Iryna Gurevych und Irina Bigoulaeva
Komentarų nėra:
Rašyti komentarą