"Open AI's new o1 model represents a break with LLMs. Instead
of making it as quick and cheap to use as possible, the model is optimized to
solve complex tasks. This means slower and more expensive answers.
There are two periods in time in which the large AI language
models take up computing time. These periods in the life cycle of an LLM
couldn't be more different.
The first period is the model building, commonly
called training. During this time, the internal token structures of the models
emerge from the processing of large amounts of data. These are the basis for
the models to be able to recognize patterns and relationships in the language
and to generate language.
The second period is what is known as inference.
Inference is another word for conclusion. Basically, inference is
the moment when the LLM applies the internal structures built up in training to
process an input and hopefully produce a meaningful output.
"Conventional" LLMs: Training becomes more
expensive, inference cheaper
A comparison from the classic industry: The training phase
is like the design and construction of a new factory with new production
facilities. Inferences are the production that takes place in this factory
after successful construction, i.e. the actual use of the production
facilities.
To date, the industry has focused on the training phase when
improving the models. This has become longer and more expensive over the years
because the models have become larger. Larger models with more parameters, i.e.
variables that the models learn during training, usually lead to better, more
capable models.
There are no official figures on training costs.
However,
experts estimate that the training of Claude 3.5 cost between 100 and 200
million dollars.
GPT-4 is said to have cost around 78 million dollars according
to the Stanford AI Index Report 2024, among others. Gemini Ultra, the
predecessor of Google's current Gemini 2, is said to have cost 191 million
dollars to train. Dario Amodei, the CEO of Anthropic, the company behind
Claude, has even publicly predicted that training costs could rise to up to ten
billion dollars per model by 2025 or 2026. Whether this statement will be true
or is just intended to deter potential new attackers remains to be seen.
The
trend in training is clear, however: the largest and therefore usually best
models are getting bigger and therefore more expensive to train.
The rising investment costs are borne by the model
providers, but they are unique per model and, due to the growing size, are increasingly
representing a barrier to market entry. Only a few companies can actually
afford to invest 100 million dollars or more in training models. In addition to
capital, you also need access to rare computing power and equally rare
specialists.
There is currently no company in Germany that does this. In
Europe, the only company that can get involved in this game is Mistral from
Paris.
More expensive training does not automatically mean more
expensive use, i.e. more expensive inference. Larger models are indeed more
computationally intensive both in training and in use. But for almost a year
now we have also seen a trend among the major model providers to make the use
of their models cheaper and thus more attractive. This trend began with GPT-4
Turbo in November 2023, which was the first top model with lower usage costs.
Since then, usage costs for the top models have been falling continuously.
This is also obvious. High training costs are a burden on
the balance sheet, but as noted, they also provide competitive advantages.
High, or at least perceived as high, inference costs, on the other hand,
prevent LLMs from spreading as products.
o1: A closer look at inference
o1 is in many ways a significant break with the LLM trend
described above. Inference takes longer with o1. So the model feels slower.
Much slower. Open AI humanizes the longer computation time with “thinking”.
But
why is o1 a break? First, the model is not optimized for regular,
run-of-the-mill requests like “reword this email in a more professional tone”.
The “thinking time”, which is now longer and more expensive than other models,
gives o1 new capabilities. It is better at performing logical tasks, such as
mathematics or programming, than any other model. At the same time, it is no
better and often even worse at text formulation than classic AI LLMs such as
Claude or GPT-4o.
o1 is the first LLM that can perform complex tasks better
than simple tasks, even if the user accidentally puts the tasks in the same
area. If you give o1 a simple task, Open AI warns, the model may
"think" too much about the solution and complicate the result. The LLM
landscape as a whole is not intuitive, and with o1 this situation is
exacerbated.
Secondly, o1 represents a break because the model shows very
clearly that accepting increased inference time reveals new options. Up to now,
the only axis for breakthroughs in LLMs was at the training level. Be it more
computing power, more or better data or other architectural approaches,
everything was focused on the training or construction phase of the models.
With o1, inference time is transformed from an annoying cost factor to a
potential pioneer of new approaches to language models.
Provided that users have a little patience. The maximum
computation time between input and generated output before the model aborts
seems to be just over 3 minutes for o1.
"Think step by step" as a model architecture
But why would o1 abort? What exactly is happening here?
This brings us to the third aspect of why o1 represents a
break in LLMs. LLMs have so far worked strictly path-dependently. That is, they
analyze the input and then begin to "predict" which words are most
likely to be the most likely response to the input. This approach gave rise to
the misleading term "stochastic parrot" last year, which ignored the
level of complexity of LLMs and the resulting output quality. Errors in the
output of LLMs arise not only, but also, from the sequential creation of the
speech output. Once a token (words or parts of words) has been created, it
determines from which direction the subsequent tokens can come.
In simple terms, this means that if you take a wrong turn,
the LLM will run in the wrong direction for the rest of the output. Users have
been able to mitigate this path dependency a bit with a few prompting tricks.
"Think through your answer step by step" in the prompt and similar
approaches to thought chains seem to nudge LLMs in a direction that seems to
promote a more systematic output. This can produce noticeably better results.
However, like model size, it only reduces the problem rather than eliminating
it. Large models reduce the probability of a false token, but here too it does
not disappear.
The term stochastic parrot is even less applicable to o1.
This is the first time that Open AI has gone beyond this sequential generation
in inference. Open AI does not say how exactly they built o1. But we do know
this much: Semafor reported in January 2023 that Open AI hired over 1,000
software developers worldwide as subcontractors to break down multi-stage
programming projects into individual stages. The result of these efforts is
likely to be data sets that help LLMs in training to create patterns to
complete multi-step tasks.
In May 2023, Open AI published a paper entitled Let's Verify
Step by Step. In it, they describe, among other things, how they present data
labelers with step-by-step solutions to math problems, as suggested in the
Semafor article, and how they evaluate the individual steps. The goal of the
paper: to build a "process-supervised reward model" (PRM). The PRM
should check the probability of the correctness of an individual step after its
last token created.
In summary, the following can be stated: o1 was trained with
a view to solving multi-step logic problems. With this focus, o1 was designed
to create several problem-solving processes within the inference time,
evaluating each step individually and thus determining when it has "taken
a wrong turn" and needs to start again.
o1's "thinking time" is longer because the model
runs through several solution directions and can identify errors independently.
This is why o1 can abort the calculation. The model determines that the
previous result is wrong, but the maximum computing power allocated to it has
expired.
Where Open AI is headed
Open AI has ten million subscribers. The higher-priced
enterprise offering for companies, which is only a year old, already has one
million subscribers. o1 offers enormous potential here. Solving multi-stage
challenges increases the types of use. o1 is likely to be used in research in
particular. But programming with an LLM also reaches a new level here. Think of
our text on AI-supported programming. Open AI can also link o1 with its other
models. o1 ers calculates a route to work, and the cheaper models do the
"legwork". The biggest challenge remains on the actual product side
of the model. Open AI needs to communicate better what can and cannot be
achieved with this model. LLMs are difficult to grasp, and o1 seems to
reinforce this difficulty.
At the same time, however, o1 shows that the time of
autonomous and semi-autonomous agents is near. o1 can be the basis for the
first well-functioning agents.
It is interesting that Open AI has put the additional
computing time in the inference for API users into invisible tokens. API usage
is calculated from input tokens and output tokens. Now an unpredictable
variable is added to the costs. Open AI does not say why they do this. But our
guess is that Open AI wants to prevent other models from being trained on the
basis of o1. This use is prohibited according to the terms and conditions, but
still takes place via the API. o1 does not show the user the steps it took
before the output. You cannot see which directions the system took and
rejected. All of these calculations cost money, but Open AI does not want to
disclose them.
Where LLMs are headed
If the current GPT-4-based LLMs teach us anything, it is
that Open AI is usually only the first, but not the only, to make LLM
breakthroughs. We will see more models that work similarly to o1 in the coming
months.
Open source models from Meta or Mistral could reveal the internal
processes in contrast to o1, which should open up further applications.
AI agents are now becoming just as tangible as sophisticated
model mixes with division of labor between the LLMs, as we have described here.
Conclusion
o1 shows that the end of the line in LLM development is
still a long way off.
However, with a stronger focus on inference in this new,
thoughtful type of model, the chips and computing power available to us become
even more important.
Finally, o1 also shows how regulation is lagging behind
rapid technological development. The EU's AI Act has focused on computing power
in the training phase in order to be able to distinguish "dangerous"
from "safe" AI. The AI Act sets a threshold of 1025 FLOPs for the
computing power used to train AI models. Models that exceed this value are
classified as systems with "high systemic risk".
With a simple, slight shift in priorities, o1 has made this
already questionable regulatory approach even more questionable. Because
according to o1, in the near future we will also see models that require far
less training, use more computing time in inference and whose capabilities will
exceed anything we know today. Also in open source. And also locally.
Marcel Weiß
Marcel Weiß is an independent analyst and strategy
consultant in Berlin. He has been working on platform issues and other
strategy-relevant digital economic dynamics since the early years of the new
millennium. He advises companies and gives keynote speeches on these topics."
Komentarų nėra:
Rašyti komentarą