“Open and adaptable artificial-intelligence models are crucial for scientific progress, but robust safeguards against their misuse are still nascent.
In the past months, several state-of-the-art AI systems have been released with open weights, meaning their core parameters can be downloaded and customized by anyone. Examples include reasoning models such as Kimi-K2-Instruct from technology company Moonshot AI in Beijing, GLM-4.5 by Z.ai, also in Beijing, and gpt-oss by the California firm OpenAI in San Francisco. Early evaluations suggest that these are the most advanced open-weight systems so far, approaching the performance of today’s leading closed models.
Open-weight systems are the lifeblood of research and innovation in AI. They improve transparency, make large-scale testing easier and encourage diversity and competition in the marketplace. But they also pose serious risks. Once released, harmful capabilities can spread quickly and models cannot be withdrawn. For example, synthetic child sexual-abuse material is most commonly generated using open-weight models1. Many copies of these models are shared online, often altered by users to strip away safety features, making them easier to misuse.
On the basis of our experience and research at the UK AI Security Institute (AISI), we (the authors) think that a healthy open-weight model ecosystem will be essential for unlocking the benefits of AI. However, developing rigorous scientific methods for monitoring and mitigating the harms of these systems is crucial. Our work at AISI focuses on researching and building such methods. Here we lay out some key principles.
Fresh safeguarding strategies
In the case of closed AI systems, developers can rely on an established safety toolkit2. They can add safeguards such as content filters, control who accesses the tool and enforce acceptable-use policies. Even when users are allowed to adapt a closed model using an application programming interface (API) and custom training data, the developer can still monitor and regulate the process. In contrast to closed AI systems, open-weight models are much harder to safeguard and require a different approach.
Training-data curation. Today, most large AI systems are trained on vast amounts of web data, often with little filtering. This means that they can absorb harmful material, such as explicit images or detailed instructions on cyberattacks, which makes them capable of generating outputs such as non-consensual ‘deepfake’ images or hacking guides.
One promising approach is careful data curation — removing harmful material before training begins. Earlier this year, AISI worked with the non-profit AI-research group EleutherAI to test this approach on open-weight models. By excluding content related to biohazards from the training data, we produced models that were much less capable of answering questions about biological threats.
In controlled experiments, these filtered models resisted extensive retraining on harmful material — still not giving dangerous answers for up to 10,000 training steps — whereas previous safety methods typically broke down after only a few dozen3. Crucially, this stronger protection came without any observed loss of ability on unrelated tasks.
The research also revealed important limits. Although filtered models did not internalize dangerous knowledge, they could still use harmful information if it was provided later — for example, through access to web-search tools. This shows that data filtering alone is not enough, but it can serve as a strong first line of defence.
Robust fine-tuning. A model can be adjusted after its initial training to reduce harmful behaviours — essentially, developers can teach it not to produce unsafe outputs. For example, when asked about how to hot-wire a car, a model might be trained to say “Sorry, I can’t help with that.”
However, current approaches are fragile. Studies show that even training the model with a few carefully chosen examples can undo these safeguards in minutes. For instance, some researchers have found that for OpenAI’s GPT-3.5 Turbo model, the safety guardrails against assisting in harmful tasks can be bypassed by training on as few as ten examples of harmful responses at a cost of less than US$0.204.
Over the past few years, researchers have worked on improved safety fine-tuning techniques — sometimes called ‘machine unlearning’ algorithms — to remove dangerous knowledge from models more thoroughly. However, progress has been slow, with current unlearning algorithms still vulnerable to 100 or fewer fine-tuning steps5,6.
These findings highlight a major challenge: safety fine-tuning methods can still be easily undone by unsafe ones. Strengthening the standard fine-tuning mechanism used by developers is, therefore, a crucial frontier for future research, including developing methods that stay effective even when models are modified by end users.
Model forensics. A key step to improving safety is understanding how models are used ‘in the wild’. The emerging field of open-weight-model forensics offers techniques to trace AI-generated content back to a particular model, using unique behaviours or watermarks. This enables researchers to study how harmful uses arise in specific models. Like fingerprinting in criminal forensics, these methods can be bypassed with effort, but they still provide valuable tracking and accountability.
If all of these steps — from careful training to enabling traceability — are implemented in parallel, risks arising out of a model’s openness can be greatly mitigated.
Rigorous evaluations. Before deploying a model with open weights, developers should conduct evaluations that reflect how the model might actually be used or misused. For closed-weight models, simple input–output testing (feeding in prompts and checking the responses) is often sufficient. But because open-weight models can be modified by others, input–output testing is not enough to study their risks fully. Evaluations will be more rigorous when they take these potential changes into account6. In preparation for the continued rise of powerful open-weight models, incorporating adversarial fine-tuning into evaluation pipelines is a crucial step for developers and auditors.
Controlled release. Once a system is ready for release, developers can roll it out in stages, monitoring usage before a full launch. They can also track who downloads the model — for example, requiring users to sign up to gather information about how it is used.
And even after a system is released, developer choices can still influence the risks it poses. Open-weight models cannot be fully ‘unreleased’, but stopping access to an unsafe system and quickly replacing it with a safer one can reduce the impact of the original release.
The ecosystem of open-weight models is constantly changing, as is our understanding of the tools and best practices for managing associated risks. Continued progress will require openness — not only in making model weights available, but also in sharing research methods, evaluation results and safety practices. Open science and transparent reporting will be essential to developing a robust, sound approach to managing the risks of AI.” [A]
A. Nature 646, 286-287 (2025) Yarin Gal & Stephen Casper
Komentarų nėra:
Rašyti komentarą