Sekėjai

Ieškoti šiame dienoraštyje

2024 m. rugsėjo 18 d., trečiadienis

Open AI : o1 is the beginning of a new paradigm

"Open AI's new o1 model represents a break with LLMs. Instead of making it as quick and cheap to use as possible, the model is optimized to solve complex tasks. This means slower and more expensive answers.

There are two periods in time in which the large AI language models take up computing time. These periods in the life cycle of an LLM couldn't be more different. 

The first period is the model building, commonly called training. During this time, the internal token structures of the models emerge from the processing of large amounts of data. These are the basis for the models to be able to recognize patterns and relationships in the language and to generate language. 

The second period is what is known as inference. Inference is another word for conclusion. Basically, inference is the moment when the LLM applies the internal structures built up in training to process an input and hopefully produce a meaningful output.

"Conventional" LLMs: Training becomes more expensive, inference cheaper

A comparison from the classic industry: The training phase is like the design and construction of a new factory with new production facilities. Inferences are the production that takes place in this factory after successful construction, i.e. the actual use of the production facilities.

To date, the industry has focused on the training phase when improving the models. This has become longer and more expensive over the years because the models have become larger. Larger models with more parameters, i.e. variables that the models learn during training, usually lead to better, more capable models.

There are no official figures on training costs. 

However, experts estimate that the training of Claude 3.5 cost between 100 and 200 million dollars. 

GPT-4 is said to have cost around 78 million dollars according to the Stanford AI Index Report 2024, among others. Gemini Ultra, the predecessor of Google's current Gemini 2, is said to have cost 191 million dollars to train. Dario Amodei, the CEO of Anthropic, the company behind Claude, has even publicly predicted that training costs could rise to up to ten billion dollars per model by 2025 or 2026. Whether this statement will be true or is just intended to deter potential new attackers remains to be seen. 

 

The trend in training is clear, however: the largest and therefore usually best models are getting bigger and therefore more expensive to train.

 

The rising investment costs are borne by the model providers, but they are unique per model and, due to the growing size, are increasingly representing a barrier to market entry. Only a few companies can actually afford to invest 100 million dollars or more in training models. In addition to capital, you also need access to rare computing power and equally rare specialists. 

 

There is currently no company in Germany that does this. In Europe, the only company that can get involved in this game is Mistral from Paris.

More expensive training does not automatically mean more expensive use, i.e. more expensive inference. Larger models are indeed more computationally intensive both in training and in use. But for almost a year now we have also seen a trend among the major model providers to make the use of their models cheaper and thus more attractive. This trend began with GPT-4 Turbo in November 2023, which was the first top model with lower usage costs. Since then, usage costs for the top models have been falling continuously.

This is also obvious. High training costs are a burden on the balance sheet, but as noted, they also provide competitive advantages. High, or at least perceived as high, inference costs, on the other hand, prevent LLMs from spreading as products.

o1: A closer look at inference

o1 is in many ways a significant break with the LLM trend described above. Inference takes longer with o1. So the model feels slower. Much slower. Open AI humanizes the longer computation time with “thinking”. 

But why is o1 a break? First, the model is not optimized for regular, run-of-the-mill requests like “reword this email in a more professional tone”. The “thinking time”, which is now longer and more expensive than other models, gives o1 new capabilities. It is better at performing logical tasks, such as mathematics or programming, than any other model. At the same time, it is no better and often even worse at text formulation than classic AI LLMs such as Claude or GPT-4o.

o1 is the first LLM that can perform complex tasks better than simple tasks, even if the user accidentally puts the tasks in the same area. If you give o1 a simple task, Open AI warns, the model may "think" too much about the solution and complicate the result. The LLM landscape as a whole is not intuitive, and with o1 this situation is exacerbated.

Secondly, o1 represents a break because the model shows very clearly that accepting increased inference time reveals new options. Up to now, the only axis for breakthroughs in LLMs was at the training level. Be it more computing power, more or better data or other architectural approaches, everything was focused on the training or construction phase of the models. With o1, inference time is transformed from an annoying cost factor to a potential pioneer of new approaches to language models.

Provided that users have a little patience. The maximum computation time between input and generated output before the model aborts seems to be just over 3 minutes for o1.

 

"Think step by step" as a model architecture

 

But why would o1 abort? What exactly is happening here?

 

This brings us to the third aspect of why o1 represents a break in LLMs. LLMs have so far worked strictly path-dependently. That is, they analyze the input and then begin to "predict" which words are most likely to be the most likely response to the input. This approach gave rise to the misleading term "stochastic parrot" last year, which ignored the level of complexity of LLMs and the resulting output quality. Errors in the output of LLMs arise not only, but also, from the sequential creation of the speech output. Once a token (words or parts of words) has been created, it determines from which direction the subsequent tokens can come.

 

In simple terms, this means that if you take a wrong turn, the LLM will run in the wrong direction for the rest of the output. Users have been able to mitigate this path dependency a bit with a few prompting tricks. "Think through your answer step by step" in the prompt and similar approaches to thought chains seem to nudge LLMs in a direction that seems to promote a more systematic output. This can produce noticeably better results. However, like model size, it only reduces the problem rather than eliminating it. Large models reduce the probability of a false token, but here too it does not disappear.

The term stochastic parrot is even less applicable to o1. This is the first time that Open AI has gone beyond this sequential generation in inference. Open AI does not say how exactly they built o1. But we do know this much: Semafor reported in January 2023 that Open AI hired over 1,000 software developers worldwide as subcontractors to break down multi-stage programming projects into individual stages. The result of these efforts is likely to be data sets that help LLMs in training to create patterns to complete multi-step tasks.

In May 2023, Open AI published a paper entitled Let's Verify Step by Step. In it, they describe, among other things, how they present data labelers with step-by-step solutions to math problems, as suggested in the Semafor article, and how they evaluate the individual steps. The goal of the paper: to build a "process-supervised reward model" (PRM). The PRM should check the probability of the correctness of an individual step after its last token created.

In summary, the following can be stated: o1 was trained with a view to solving multi-step logic problems. With this focus, o1 was designed to create several problem-solving processes within the inference time, evaluating each step individually and thus determining when it has "taken a wrong turn" and needs to start again.

o1's "thinking time" is longer because the model runs through several solution directions and can identify errors independently. This is why o1 can abort the calculation. The model determines that the previous result is wrong, but the maximum computing power allocated to it has expired.

Where Open AI is headed

Open AI has ten million subscribers. The higher-priced enterprise offering for companies, which is only a year old, already has one million subscribers. o1 offers enormous potential here. Solving multi-stage challenges increases the types of use. o1 is likely to be used in research in particular. But programming with an LLM also reaches a new level here. Think of our text on AI-supported programming. Open AI can also link o1 with its other models. o1 ers calculates a route to work, and the cheaper models do the "legwork". The biggest challenge remains on the actual product side of the model. Open AI needs to communicate better what can and cannot be achieved with this model. LLMs are difficult to grasp, and o1 seems to reinforce this difficulty.

 

At the same time, however, o1 shows that the time of autonomous and semi-autonomous agents is near. o1 can be the basis for the first well-functioning agents.

 

It is interesting that Open AI has put the additional computing time in the inference for API users into invisible tokens. API usage is calculated from input tokens and output tokens. Now an unpredictable variable is added to the costs. Open AI does not say why they do this. But our guess is that Open AI wants to prevent other models from being trained on the basis of o1. This use is prohibited according to the terms and conditions, but still takes place via the API. o1 does not show the user the steps it took before the output. You cannot see which directions the system took and rejected. All of these calculations cost money, but Open AI does not want to disclose them.

Where LLMs are headed

If the current GPT-4-based LLMs teach us anything, it is that Open AI is usually only the first, but not the only, to make LLM breakthroughs. We will see more models that work similarly to o1 in the coming months. 

Open source models from Meta or Mistral could reveal the internal processes in contrast to o1, which should open up further applications.

AI agents are now becoming just as tangible as sophisticated model mixes with division of labor between the LLMs, as we have described here.

Conclusion

o1 shows that the end of the line in LLM development is still a long way off.

However, with a stronger focus on inference in this new, thoughtful type of model, the chips and computing power available to us become even more important.

 

Finally, o1 also shows how regulation is lagging behind rapid technological development. The EU's AI Act has focused on computing power in the training phase in order to be able to distinguish "dangerous" from "safe" AI. The AI ​​Act sets a threshold of 1025 FLOPs for the computing power used to train AI models. Models that exceed this value are classified as systems with "high systemic risk".

 

With a simple, slight shift in priorities, o1 has made this already questionable regulatory approach even more questionable. Because according to o1, in the near future we will also see models that require far less training, use more computing time in inference and whose capabilities will exceed anything we know today. Also in open source. And also locally.

 

Marcel Weiß

Marcel Weiß is an independent analyst and strategy consultant in Berlin. He has been working on platform issues and other strategy-relevant digital economic dynamics since the early years of the new millennium. He advises companies and gives keynote speeches on these topics."

 


 


 

Komentarų nėra: