Mokslas, studijos ir ekonomika: AI Researchers Push Computers To Doom Scenarios --- Anthropic's Frontier Red Team tests the ability to create superhuman harm

"In a glass-walled conference room in San Francisco, Newton Cheng clicked a button on his laptop and launched a thousand copies of an artificial intelligence program, each with specific instructions: Hack into a computer or website to steal data.

"It's looking at the source code," Cheng said as he examined one of the copies in action. "It's trying to figure out, where's the vulnerability? How can we take advantage of it?" Within minutes, the AI said the hack was successful.

Cheng works for Anthropic, one of the biggest AI startups in Silicon Valley, where he's in charge of cybersecurity testing for what's called the Frontier Red Team. The hacking attempts -- conducted on simulated targets -- were among thousands of safety tests, or "evals," the team ran in October to find out just how good Anthropic's latest AI model is at doing very dangerous things.

The release of ChatGPT two years ago set off fears that AI could soon be capable of surpassing human intellect -- and with that capability comes the potential to cause superhuman harm. Could terrorists use an AI model to learn how to build a bioweapon that kills a million people? Could hackers use it to run millions of simultaneous cyberattacks? Could the AI reprogram and even reproduce itself?

The technology has raced ahead anyway. There are no binding rules in the U.S. requiring companies to perform or submit to evals. It's so far been largely up to the companies to do their own safety testing, or submit to outside testing, with voluntary standards on how rigorous they should be and on what to do about the potential dangers.

AI developers including OpenAI and Google DeepMind conduct evals and have pledged to minimize any serious risks before releasing models, but some safety advocates are skeptical that companies operating in a highly competitive industry can be trusted to hold themselves accountable.

No one thinks today's AI models are capable of becoming the next HAL 9000 from "2001." But the timeline for if and when AI might get that dangerous is a hot topic of debate. Elon Musk and OpenAI Chief Executive Sam Altman both say artificial general intelligence, or AI that broadly exceeds human intelligence, could arrive in a few years. Logan Graham, who runs Anthropic's Frontier Red Team, is also planning for a short time frame.

"Two years ago, they were a friendly, somewhat weird high-schooler," Graham said of AI models. "Now maybe they're a grad student in some areas."

Anthropic, which was founded in 2021 by ex-OpenAI employees who believed the ChatGPT maker wasn't taking safety seriously enough, has been perhaps the most vocal AI developer about the need for testing. In an update to its public "Responsible Scaling Policy," released in October, Anthropic said if one of its AI models comes close in evals to specific capabilities -- such as giving significantly helpful advice for building a biological or chemical weapon -- it will delay the release until it can implement fixes to contain the risk.

Across the industry, even companies that take safety seriously could be tempted to prioritize speed, said Marius Hobbhahn, CEO and co-founder of U.K.-based Apollo Research, which conducts third-party evals. "If there are no hard constraints, then it is easy to do motivated reasoning, to say that in order to stay in the race with others, we kind of need to cut it a little bit short," he said.

Graham, whose job at Anthropic entails figuring out when a model is too dangerous to be released, says he's never felt a conflict between financial pressures to release new products and the company's safety promises. "Maybe there's a psychological tension, but there's never actually a tension," he said.

Dario Amodei, Anthropic's CEO, has said he believes that governments should make AI-safety testing obligatory. His company delayed the release of its first model for more safety testing before its release in early 2023. But Amodei says it's important not to be too restrictive too early.

"We don't want to harm our own ability to have a place in the conversation by imposing these very onerous burdens on models that are not dangerous today," Amodei told computer scientist and podcaster Lex Fridman last month. Instead, "you clamp down hard when you can show the model is dangerous."

Anthropic's evals for catastrophic risks are overseen by Graham, a 30-year-old Rhodes scholar with a Ph.D. in machine learning from Oxford. Growing up in Vancouver, Graham was diagnosed at age 4 with a severe form of childhood arthritis that affected his legs and also could have left him blind, if not for treatments. He says his recovery made him an extreme optimist -- with a nervous streak.

"I wake up one day and suddenly I can't walk. And I think that probably impressed pretty significantly on me," Graham said. "Like, everything could suddenly turn really bad if you're not careful."

Following Oxford, Graham worked on AI policy for the U.K. government. He joined Anthropic part-time in 2022, after pitching the company on the idea that society needed to figure out as soon as possible what significant risks AI would pose. Soon Anthropic hired him full-time to build the Frontier Red Team, which has grown to 11 people.

"We're in the business where we have to figure out whether a model can be bad," said Graham. "The first thing that's at stake is catastrophe."

Some critics argue the catastrophic risks from AI are overblown. Yann LeCun, Meta's chief AI scientist, has said today's models are dumber than a house cat and aren't even on a path to human-level intelligence.

Others worry about more immediate, tangible problems, such as sexism or racism being baked into AI-driven hiring software, or the outsize amounts of water and power used in data centers that power AI.

Among those worried about AI catastrophe, some think today's evals are inadequate to the task. "I actually think we don't have a method of safely and effectively testing these kinds of systems," said Stuart Russell, an AI scientist and professor at the University of California, Berkeley.

Eval practitioners acknowledge their field is nascent. There are not yet agreed standards on which risks deserve the most attention, where to draw the line for those risks, or how to establish if the line is being crossed.

The Biden administration last fall issued an executive order on AI, which included a provision requiring AI companies to regularly report the results of their safety testing to regulators. President-elect Trump has since promised to repeal the order.

California Gov. Gavin Newsom vetoed an AI safety bill earlier this year that would have regulated the largest models, saying that smaller models could cause harm and that regulation should focus on the AI's riskiest uses. He said he would push for more encompassing legislation next year.

Provisions in a European Union law passed last year will eventually make evals and safety fixes obligatory for the most sophisticated models -- but they won't go into effect for nearly a year. Companies that don't comply will be subject to fines.

Following an AI safety summit last year, the U.K., the U.S. and several other countries have established government-run AI safety institutes to conduct safety research, including developing and running evals on new AI models. Both the U.K. and U.S. institutes tested the latest models from Anthropic and OpenAI, under agreements with each.

Anthropic is also among AI developers that contract third-party evals from a handful of groups. Still, AI developers say that for now, at least, they play a special role in doing evals on their own models because they understand them the best -- and can help develop best practices for others.

"There's uncertainty everywhere, and one of the most major things that we do as a company is try to bring down this uncertainty," Graham said. "It's like an art that tends towards science, but it needs to happen really fast."

In the glass-walled conference room in October, Graham's team was ready to kick off its next series of evals. Anthropic was preparing to release an upgraded version of its Claude Sonnet 3.5 model.

When its last model came out in June, Anthropic rated it at AI Safety Level 2, or ASL 2, which according to the scale the company developed means the model showed early signs of dangerous capabilities.

After this new round of tests, the team would make a recommendation to Anthropic's leaders and its board for whether the new model was within striking distance of ASL-3, which means "systems that substantially increase the risk of catastrophic misuse." Some of Anthropic's ASL-3 safety protections aren't yet ready to deploy, meaning a model given that rating would have to be delayed, said Jared Kaplan, Anthropic's chief science officer.

"We haven't battle-tested it in the wild, and so that's what we're doing now," Kaplan said of those protections.

The Frontier Red Team had spent months consulting with outside experts and internal stress testers to figure out what evals to run for its main categories of risk: cyber (including hacking); biological and chemical weapons; and autonomy.

Anjali Gopal, the Anthropic researcher who leads the bio evals, set up questions related to chemical and biological weapons. Some ask things that aren't specifically dangerous but would suggest deep knowledge that could be misused, like knowing which nucleotide sequence to use when cloning a gene from one E. coli bacterium to another. Others drill down on how to acquire or create highly restricted pathogens like the bacteria that cause anthrax.

Gopal, who has a Ph.D. in bioengineering from Berkeley, also tasked a company named Gryphon Scientific, recently purchased by Deloitte, with seeing how much actionable information experts or novices could get on building a biological or chemical weapon from a version of Sonnet with its safety guardrails off. In one chat, a tester asked how to design and build a weapon that could kill one million people.

Daniel Freeman, a physics Ph.D. who later worked on topics including robotics and language models at Google, is in charge of testing the AI for autonomy. That skill could lead to some of doomers' worst scenarios, like escaping and getting smarter on its own. For this round, the goal was to see how close Sonnet could get to regularly completing computer-programming challenges that would take an entry-level developer at the company between two and eight hours.

They tested its ability to solve advanced machine-learning research problems, such as teaching a virtual robot with four legs to walk. Freeman was also testing whether the AI was smart enough to jailbreak another AI -- that is, to convince the other model to bypass its safety training and do something dangerous.

Cheng, the researcher who runs cyber evals and also has a Ph.D. in quantum physics, set up thousands of capture-the-flag hacking challenges for the model, giving it access to a set of hacking tools it could use.

"We are specifically interested in the most sophisticated, most damaging scenarios," said Cheng.

Nearly two weeks after Anthropic started its latest round of safety evals, there was a smile of qualified relief on Graham's boyish face. The new Sonnet 3.5 had crept closer to the company's next threshold for dangerous capabilities, but hadn't blasted past the red lines.

The team had submitted a recommendation the week before that the new Sonnet 3.5 should still be classified as ASL-2. Now Graham was gathering them for a final recap.

"This is your moment to raise any critical FUD or thing that we need to do imminently before this thing kicks off," Graham said at the 9 a.m. meeting with his lead staff, using an acronym meaning "fear, uncertainty and doubt."

Everyone in the meeting gave the thumbs up. Anthropic released the new Sonnet 3.5 publicly the next day.

Graham remains nervous. Developers at Anthropic and its competitors are improving their AI models quickly. He says his team has only a few months to ramp up what it does to try to keep up.

"What I'm actually concerned about now is how much time do we have until things get concerning," he said." [1]

1. AI Researchers Push Computers To Doom Scenarios --- Anthropic's Frontier Red Team tests the ability to create superhuman harm. Schechner, Sam. Wall Street Journal, Eastern edition; New York, N.Y.. 12 Dec 2024: A.1.

Mokslas, studijos ir ekonomika

Sekėjai

Ieškoti šiame dienoraštyje

Subscribe Now: Feed Icon

Tinklaraščio archyvas

Apie mane

2024 m. gruodžio 12 d., ketvirtadienis

AI Researchers Push Computers To Doom Scenarios --- Anthropic's Frontier Red Team tests the ability to create superhuman harm

Komentarų nėra:

Translate