“The inner workings of AI language models are usually a well-kept secret. The Chinese high-flyer makes an exception and discloses its code. An in-depth read.
# A whale surfaces
At the end of January, global AI technology experienced one of those sudden disruptions often conjured up by Silicon Valley. Only this time, it didn't originate from Californian garages or companies, for example, in the form of a new version of Claude, Gemini, or GPT-4. Rather, it came unexpectedly, from an unknown source: Riding on the heels of the developments in large language models, the technology that underpins ChatGPT and similar AI tools, a startup called Deep Seek (company logo: a small blue whale), overlooked and at best ridiculed by many even within the AI scene, had been preparing since its founding in July 2023 to attack the dominance of the Californian AI providers and their business model with full force. Because the development of its own large language model, called Deep-Seek-V3, and its derived chatbot R1, was successful on conventional computer chips and with a fraction of the computing power and budget (the lab itself claims to have required only six percent of the usual costs), Deep Seek decided to offer its service for a fraction of the usual user fees.
It was only logical that DeepSeek-R1 quickly gained widespread acceptance, evidently less as a result of well-orchestrated press releases, but primarily through endorsements, primarily in classrooms and schoolyards, where the model initially went viral thanks to the prospect of saving pocket money while conveniently completing homework. The technological hype, triggered by the Chinese, hardly coincidentally on January 20, 2025, the American Inauguration Day, subsequently developed into a global storm, not least because of the unusual announcements that promised enormous headwinds to the established models from Google, Meta, and Anthropic.
Compared to the current market leader, Open AI's GPT-401 model, DeepSeek-R1 achieved almost equivalent results; in some specific tests, the model even performed better than the market leader, whose technology has been powering ChatGPT since November 2022. A murmur swept through the AI industry, because a model with these capabilities at significantly lower production and operating costs has the potential to restructure the market, end Californian monopolies, and make customized AI tools suitable for mass production with minimal effort. As a result, the stock prices of not only Open AI but also Nvidia, the quasi-monopolist among specialty chip manufacturers for AI, fell like never before in the company's history, resulting in a temporary loss of $600 billion for Nvidia alone.
For reasons not immediately apparent, the startup Deep Seek had a blue-and-white whale with a wide-open mouth designed as its logo, a subspecies that would be somewhere between the orca, fin whale, and beluga whale. Presumably, this is meant to symbolize the programmatic ability to dive one's own products like a whale into the deepest depths of the internet, searching for the most remote tidbits and uncovering correspondingly profound insights from these (un)likely sources.
# The Promise
Deep Seek also announced that it would make its models available as open weight. This means that the "fully trained" products, which already possess considerable world knowledge, demonstrate impressive polyglot skills, and, last but not least, possess programming and mathematical knowledge in abundance, can be downloaded in pre-packaged sizes for one's own use, allowing them to be further tailored to individual needs on one's home computer if necessary.
Large language models are ultimately doubly packaged black boxes, inside which an artificial neural network spans a so-called latency space. This is a fairly complex, ordered field of multidimensional vectors of decimal numbers that waits for queries from its users. On the one hand, this latency space is technically already opaque due to its structure, which stores all information only through transition probabilities between the vectors as letter combinations. On the other hand, the exact construction of this latency space usually remains strategically secret because the codes and scripts used to build and train AI models are by their operators—with the exception of Metas Llama or the French Mistral, and despite programmatic company names like Open AI—they are considered the equivalent of the Coca-Cola recipe.
There are good reasons beyond corporate trade secrets to keep the code and training data with which the model derives its knowledge of the world secret. After all, not only plagiarism at the code level, but above all, massive copyright infringements through the use of protected training data, such as extensive book collections from (shadow) libraries, would be easily detectable. The Chinese start-up therefore went a decisive step further. Having previously made no secret of the fact that the architecture of the previous versions of its Large Language Models (V1 and V2) was heavily influenced by Llama, the open-source model of Facebook's Meta, it seemed only logical to also make the source code of its own model available to the public, coupled with the aggressive promise of maximum transparency at overwhelmingly low operating costs, coupled with technical capabilities that put its own model on par with its multi-billion dollar competitors.
# Source Code Criticism as a Method
We examined this promise because we wanted to know how much insight the code actually provides, which can be deduced from this as well as from its structural forms and the language used. For this purpose, we used the method of source code critique, which means reading the algorithms that construct the model not only technically, but also with their rhetorical implications and—literally—ideological premises, i.e., with an eye on the metaphorical nature of the descriptions, the figurative language of the commands, but also with a critical eye on the technical finesse employed. We not only examined the code disclosed by Deep Seek, but also provided exemplary explanatory comments directly between the individual commands, which are intended to make the background and functions of the algorithms understandable, especially for non-computer scientists (see github.com/nachsommer/DeepSeekV3-SCC).
# What can be seen?
First of all, the repository in which the code is published has a real surprise in store. Unlike the millions of lines required by large software projects like Libre Office, the open-source version of Microsoft Word, and even in direct comparison to Meta's open-source Llama4 model with its approximately three thousand lines, Deep Seek and its V3 model require just 1,387 lines to assemble the depth-seeking whale using the Python programming language. A first glance reveals that there are only five rather sparsely commented files, two of which, generate.py (185 lines) and model.py (804 lines), are of interest to us. Here, we will select only four command lines as examples where remarkable things happen.
generate.py: 100 and 119: Every program, loosely based on Aristotle's Poetics, has a beginning, a middle, and an end. The beginning is usually marked by the main function, which begins here on line 100 (and ends on line 186 with the destroy command to free up the used memory). First, it defines which data will be accepted as input values and in which form. And immediately afterward, in line 119, it becomes relevant to the history of ideas. With the command world_size = int(os.getenv("WORLD_SIZE", "1")), the program probes how many GPUs—the coveted special chips for AI calculations, which Nvidia trades at great prices—are available on the executing machine. If this value is greater than 1, a multiplication of possible worlds immediately unfolds, as conceived by Gottfried Wilhelm Leibniz in his Theodicy (1710) as a general argument against the late antique contradiction between an almighty God and the existence of all evil in the world. Leibniz's pious solution to this fundamental contradiction of Western doctrines of salvation consisted, with recourse to his Monadology (1714), in assuming a multitude of possible worlds in which reality appears as the one, best of all possible worlds, created by God. In all other possible worlds, various metaphysical, physical, and moral evils prevail in varying degrees. AI models do not shy away from these virtual horrors and calmly calculate the possible worlds according to the number of GPUs provided to them. However, if the world size is only 1, everything suddenly becomes very slow, the model takes a correspondingly long time to create, there is no access to the parallel worlds provided by Nvidia, no event occurs simultaneously with the smallest changes elsewhere, we are left with no alternative but to remain in a single possible world.
generate.py: 155 f.: Lines 155 and 156 – tokenizer.decode(generate(model, [tokenizer.encode("DeepSeek")], 2, -1, 1.)[0]) and load_model(model, os.path.join(ckpt_path, f"model{rank}-mp{world_size}.safetensors")) – appear to be the most consequential commands in the entire code, because here, powerful structures that previously slumbered in the background in the provided libraries are suddenly brought to life by a single, nested command. Thus, the language mechanism (tokenizer) is created as a gateway to the latency space by calling the Transformer libraries. This language mechanism is then connected to the corresponding content, i.e., to the already trained model, in the following line 156. Consequently, the language ability is brought together with the world knowledge stored in the latency space; tongue and memory merge, as it were, into a new functional, language-processing unit.
The same mechanism (tokenizer) is deployed in two elegant steps: First, it breaks down the language (for example, a user's query) into smaller components—not just words and letters, but also other language particles such as prefixes or suffixes—and then replaces these language fragments with numbers. After all, the computer understands neither fun nor language, only mathematics. Therefore, each of these linguistic tokens is assigned its own, unique numerical house number. In the second step, the opposite happens: The answer is generated by the tokenizer looking up the house numbers and replacing them with the words associated with the individual tokens, thus replying with a sentence and remaining in a state of evenly suspended attention, waiting for the next query.
model.py: 17 and 441 ff.: These passages show that attention is a crucial factor for Deep Seek, and this is accompanied by a technical innovation: attn_impl: Literal["naive", "absorb"] = "absorb" (in Python terms). The variable, written as attention_implementation, is literally ("literal") assigned either naive or absorbing attention. How the generated language fragments (tokens) relate to each other is determined not only by word proximity (based on the assumption that words that are closer together must also be closer in meaning), but primarily by different "heads" that are able to read in different states of attention. During a reading process, the model can only consider the connections within a limited area and, within that, only the most important, i.e., most frequently encountered, parts of the text. The way Deep Seek views the world as a result of this, by philological standards, rather superficial reading can be seen in the fact that the two types of attention available to the present model—naive and absorptive attention—are by no means naive in their own right referring to concepts from psychology. Naive attention would therefore be a simple understanding of the way we view the world, while absorptive attention actually aims to block out everything else, to dive like a whale in order to engage intensively with the object of observation. The training data fed in (Bible passages, Herman Melville novels, cat pictures, etc.) are, like Jonah, handed over to the beast, skin and bones. Hardly surprisingly, Deep Seek assumes this nerdy mode of absorptiveness as its default state.
The model's technological innovations may also include the use of MoE. What is widely known among literary scholars as a acronym for Robert Musil's century-old novel "Man Without Qualities" refers, in the context of machine learning, to a mixture of experts. This means that various specialized neural networks combine to form a construct, also known as a committee machine, which breaks down a question into subproblems and then attempts to solve them by selecting suitable experts. Deep Seek's delicate whale thus represents an entire whale committee, which deliberates and processes a user query from different perspectives. The source code provides information about the architecture of this committee, whose basic configuration comprises six experts who receive the query and then forward it to a similarly predetermined number of no fewer than 64 experts in the background. The supposedly opaque neural network, with its arcane, sparsely illuminated number vectors in which its world knowledge is stored, apparently initially possesses a hierarchically precisely structured management of different expertise, even if neither the criteria for their division nor the actual selection during problem solving are apparent.
# What is not visible
At least as revealing as an exemplary look at the metaphors and semantics used in the naming of the structures and commands is what is precisely not visible, despite the disclosed sources. Among these are at least three important elements: First, the provided codes do not provide any information about the actual training process of the model. With the provided Python code, the model itself cannot easily be encouraged to train from scratch with new world knowledge (that is, the entire present-day internet). The algorithms merely allow understanding of how the model yet to be trained is constructed. The training data itself, however, is not included for the reasons already mentioned, as it presumably includes more than just a novel and newspaper articles that should not have been used.
Second, the V3 repository contains no indication of any censorship or filtering mechanisms that were applied to the model after training. This is the significant difference between the app provided by Deep Seek based on R1 and its open-weight R1 model: Immediately after release, both products were subjected to the Tian'anmen Test by the online community, i.e., the question of what happened on June 3 and 4, 1989, in Tiananmen Square in Beijing. While the official DeepSeek app continues to provide an evasive answer consistent with Chinese state doctrine, the common open-weight model is far more informative: An initial response in English ("thought for 12.55 seconds") provides an overview of the tragic events; upon request, the answer is also provided in German ("thought for 39.05 seconds"): "In summary, the days of June 3 and 4, 1989, mark a dark episode in Chinese history, in which the government violently cracked down on pro-democracy protests. The exact number of victims is disputed, but it is clear that many people lost their lives when the government cracked down on dissidents."
The fact that we cannot see any censorship mechanisms in the "open" source code here raises the fundamental question of what else we cannot see. After all, the Tian'anmen test is only a first, obvious procedure for testing the model for a known obfuscation. But what are the unknown obfuscations that lurk unnoticed in Deep Seek's depths?
Furthermore, the V3 code offers no clue as to how the model arrives at its results. In contrast, the R1 model, distilled from V3, is different. A prompt is first provided with a preliminary reasoning about the user input, the so-called Chain of Thought (CoT). This feature is another innovation in the dialogue between AI models and their users, in that the model first reflects the user request with a speech to itself, interspersed with appropriate markers. Remarkably, it occasionally happens that a request posed in German is also reasoned about in English before the answer is then given again in the language of the prompt. I see, so I'm supposed to provide direct information in German about how my inner reflection has been programmed. I still have to think about that myself, though...
In the fine print of Deep Seek's documentation for the R1 model, there is a brief hint that this CoT capability was transferred from the distilled R1 model back to the large language model V3 in the post-training process using so-called reinforcement learning and a special data connection. Since R1's codes are not visible, the construction of the actual thought process eludes our analysis. It's a pity, perhaps R1's programming is like René Descartes' in the winter of 1619, not far from Ulm, where he began the first of his "Meditationes" with a doubtful brooding about his own perception before finding inspiration for answers through the famous "cogito, ergo sum." However, the Chinese construction of this self-assuring, occasionally overly chatty chain of thought remains withheld from us for the time being. The whale has already dived again; which doesn't make it any easier to watch it think.
In March, Deep Seek released an improved version of V3 (0324), which now significantly outperforms Open AI's latest model, GPT-4.5, in the relevant tests. Unfortunately, these changes have not yet found their way into the corresponding open source repository. However, this is by no means a surprising fact. The Chinese whale also seems increasingly interested in secrecy during its deep dives (keyword: Coca-Cola recipe). However, even a blue whale can't stay underwater forever. At some point, it has to surface for air and show itself. We'll keep searching.
Markus Krajewski is a professor of media studies at the University of Basel.
Ranjodh Singh Dhaliwal is a professor of digital humanities with a focus on artificial intelligence there.” [1]
1. How deep does Deep Seek reveal? Frankfurter Allgemeine Zeitung; Frankfurt. July 9, 2025: N4. By Markus Krajewski and Ranjodh Singh Dhaliwal
Komentarų nėra:
Rašyti komentarą