Sekėjai

Ieškoti šiame dienoraštyje

2024 m. spalio 30 d., trečiadienis

o1 Is The Beginning Of A New Paradigm

 

"Open AI's new o1 model represents a break with LLMs. Instead of making it as quick and cheap to use as possible, the model is optimized to solve complex tasks. This means slower and more expensive answers.

 

There are two periods in time in which the large AI language models take up computing time. These periods in the life cycle of an LLM couldn't be more different. The first period is the model building, commonly called training. During this time, the internal token structures of the models emerge from the processing of large amounts of data. These are the basis for the models to be able to recognize patterns and relationships in the language and to generate language. The second period is what is known as inference. Inference is another word for conclusion. Basically, inference is the moment when the LLM applies the internal structures built up in training to process an input and hopefully produce a meaningful output.

 

"Conventional" LLMs: Training becomes more expensive, inference cheaper

 

A comparison from the classic industry: The training phase is like the design and construction of a new factory with new production facilities. Inferences are the production that takes place in this factory after successful construction, i.e. the actual use of the production facilities.

 

To date, the industry has focused on the training phase when improving the models. This has become longer and more expensive over the years because the models have become larger. Larger models with more parameters, i.e. variables that the models learn during training, usually lead to better, more capable models.

 

There are no official figures on training costs. However, experts estimate that the training of Claude 3.5 cost between 100 and 200 million dollars. GPT-4 is said to have cost around 78 million dollars according to the Stanford AI Index Report 2024, among others. Gemini Ultra, the predecessor of Google's current Gemini 2, is said to have cost 191 million dollars to train. Dario Amodei, the CEO of Anthropic, the company behind Claude, has even publicly predicted that training costs could rise to up to ten billion dollars per model by 2025 or 2026. Whether this statement will be true or is just intended to deter potential new attackers remains to be seen. The trend in training is clear, however: the largest and therefore usually best models are getting bigger and therefore more expensive to train.

 

The rising investment costs are borne by the model providers, but they are unique per model and, due to the growing size, are increasingly representing a barrier to market entry. Only a few companies can even afford to invest 100 million dollars or more in training models. In addition to capital, you also need access to rare computing power and equally rare specialists. There is currently no company in Germany that does this. In Europe, the only company that can get involved in this game is Mistral from Paris.

 

More expensive training does not automatically mean more expensive use, i.e. more expensive inference. Larger models are indeed more computationally intensive both in training and in use. But for almost a year now, we have also seen a trend among the major model providers to make the use of their models cheaper and thus more attractive. This trend began with GPT-4 Turbo in November 2023, which was the first top model with lower usage costs. Since then, usage costs for the top models have been falling continuously.

 

This is also obvious. High training costs are a burden on the balance sheet, but as noted, they also provide competitive advantages. High, or at least perceived as high, inference costs, on the other hand, prevent LLMs from spreading as products.

 

o1: A stronger focus on inference

 

o1 is in many ways a significant break with the LLM trend described above. Inference takes longer with o1. So the model feels slower. Much slower. Open AI humanizes the longer computing time with "thinking". But why does o1 represent a break? First, the model is not optimized for regular, run-of-the-mill requests like "reword this email in a more professional tone." The "thinking time," which is now longer and more expensive than other models, gives o1 new capabilities. It is better at completing logical tasks, such as mathematics or programming, than any other model. At the same time, it is no better and often even worse at text formulations than classic LLMs like Claude or GPT-4o.

 

o1 is the first LLM that can complete complex tasks better than simple tasks, even if the user accidentally puts the tasks in the same area. If you give o1 a simple task, Open AI warns, the model can "think" too much about the solution and complicate the result. The LLM landscape as a whole is not intuitive, and with o1 this situation is exacerbated.

 

Secondly, o1 represents a break because the model shows very clearly that accepting increased inference time reveals new options. Up to now, the only axis for breakthroughs in LLMs was at the training level. Be it more computing power, more or better data or other architectural approaches, everything was focused on the training or construction phase of the models. With o1, inference time is transformed from an annoying cost factor to a potential pioneer of new approaches in language models.

 

Provided that users have a little patience. The maximum computing time between input and generated output before the model aborts seems to be just over 3 minutes with o1.

 

"Think step by step" as a model architecture

 

But why would o1 break? What exactly is happening here?

 

This brings us to the third aspect of why o1 represents a new paradigm in LLMs. LLMs have so far worked strictly path-dependently. That is, they analyze the input and then begin to "predict" which words are most likely to be the answer to the input. This approach gave rise to the misleading term "stochastic parrot" last year, which ignored the complexity level of LLMs and the resulting output quality. Errors in the output of LLMs arise not only, but also, from the sequential creation of the speech output. Once a token (words or parts of words) has been created, it determines from which direction the subsequent tokens can come.

 

In simple terms, this means that if you take a wrong turn, the LLM runs in the wrong direction for the rest of the output. Users have been able to mitigate this path dependency a little with a few prompting tricks. "Think through your answer step by step" in the prompt and similar approaches to thought chains seem to push LLMs in a direction that seems to promote more systematic output. This can produce noticeably better results. However, like model size, it only reduces the problem rather than eliminating it. Large models reduce the probability of a false token, but here too it does not disappear.

 

The term stochastic parrot is even less applicable to o1. This is the first time that Open AI has gone beyond this sequential generation in inference. Open AI does not say how exactly they built o1. But we do know this much: 

 

Semafor reported in January 2023 that Open AI hired over 1000 software developers worldwide as subcontractors to break down multi-stage programming projects into individual stages. The result of these efforts is likely to be data sets that help LLMs in training to create patterns to complete multi-stage tasks.

 

In May 2023, Open AI published a paper entitled Let's Verify Step by Step. In it, they describe, among other things, how they present step-by-step solutions to math problems to data labelers, as suggested in the Semafor article, and how they evaluate the individual steps. The aim of the paper: to build a "process-supervised reward model" (PRM). The PRM should check the probability of the correctness of an individual step according to its last created token.

 

In summary, the following can be stated: o1 was trained with a view to solving multi-step logic problems. 

 

With this focus, o1 was designed to create several problem-solving processes within the inference time, evaluating each step individually and thus determining when it has "taken a wrong turn" and needs to start again.

 

o1's "thinking time" is longer because the model goes through several solution directions and can recognize errors independently. 

 

This is why o1 can abort the calculation. The model determines that the previous result is wrong, but the maximum computing power allocated to it has expired.

 

Where Open AI is headed

 

Open AI has ten million subscribers. The higher-priced Enterprise offering for companies, which is only a year old, already has one million subscribers. o1 offers enormous potential here. Solving multi-stage challenges increases the types of use. o1 is likely to be used in research in particular. But programming with an LLM also reaches a new level here. Think of our text on AI-assisted programming. Open AI can also couple o1 with its other models. o1 creates a work path, and the cheaper models do the "legwork". The biggest challenge remains on the actual product side of the model. Open AI needs to better communicate what can be achieved with this model and what not. LLMs are difficult to grasp, and o1 seems to reinforce this difficulty in classifying.

 

At the same time, o1 shows that the time of autonomous and semi-autonomous agents is near. o1 can be the basis for the first well-functioning agents.

 

It is interesting that Open AI has put the additional computing time in the inference for API users into invisible tokens. API usage is calculated from input tokens and output tokens. Now an unpredictable variable is added to the costs. Open AI does not say why they do this. But our guess is that Open AI wants to prevent other models from being trained on the basis of o1. This use is prohibited according to the terms and conditions, but still takes place via the API. o1 does not show the user the steps it took before the output. You cannot see which directions the system took and rejected. All of these calculations cost money, but Open AI does not want to disclose them.

 

Where LLMs are headed

 

If the current GPT-4-based LLMs teach us anything, it's that Open AI is usually the first to make LLM breakthroughs, but not the only one. We'll see more models that work similarly to o1 in the coming months. Open source models from Meta or Mistral could reveal the inner workings that o1 doesn't, which should open up more uses.

 

AI agents are now becoming tangible, as are sophisticated model mixes with division of labor between LLMs, as we've described here.

 

Conclusion

 

o1 shows that the end of the road for LLM development is still a long way off.

 

As we focus more on inference in this new, thoughtful type of model, however, the chips and computing power we have at our disposal become even more important.

 

Finally, o1 also shows how regulation lags behind rapid technological development. The EU's AI Act has focused on computing power in the training phase in order to be able to distinguish between "dangerous" and "harmless" AI. The AI ​​ Act sets a threshold of 1025 FLOPs for the computing power used to train AI models. Models that exceed this value are classified as systems with "high systemic risk".

 

With a simple, slight shift in priorities, o1 has made this already questionable regulatory approach even more questionable. Because according to o1, in the near future we will also see models that require far less training, use more computing time in inference and whose capabilities will exceed everything we know today. Also in open source. And also locally." [1]

 

1. o1 ist der Anfang eines neuen Paradigmas. Frankfurter Allgemeine Zeitung (online) Frankfurter Allgemeine Zeitung GmbH. Sep 18, 2024. Von Marcel Weiß

Komentarų nėra: