Sekėjai

Ieškoti šiame dienoraštyje

2024 m. gruodžio 5 d., ketvirtadienis

Amazon Plans A Cheap Supercomputer Powered by Homegrown AI Chips


"Amazon's cloud-computing arm Amazon Web Services on Tuesday announced plans for an "Ultracluster," a massive AI supercomputer made up of hundreds of thousands of its homegrown Trainium chips, as well as a new server, the latest efforts by its AI chip design lab based in Austin, Texas.

The chip cluster will be used by AI startup Anthropic, in which the retail and cloud-computing giant recently invested an additional $4 billion. The cluster, called Project Rainier, will be located in the U.S. When ready in 2025, it will be one of the largest in the world for training AI models, according to Dave Brown, Amazon Web Services' vice president of compute and networking services.

Amazon Web Services announced a new server called Ultraserver, made up of 64 of its own interconnected chips, at its annual re:Invent conference in Las Vegas on Tuesday. AWS also unveiled Apple as one of its newest chip customers.

Combined, Tuesday's announcements underscore AWS's commitment to Trainium, the in-house-designed silicon the company is positioning as a viable alternative to the graphics processing units, or GPUs, sold by chip giant Nvidia.

The market for AI semiconductors was an estimated $117.5 billion in 2024, and will reach an expected $193.3 billion by the end of 2027, according to research firm International Data Corp. Nvidia commands about 95% of the market for AI chips, according to IDC's December research. "Today, there's really only one choice on the GPU side, and it's just Nvidia," said Matt Garman, chief executive of Amazon Web Services. "We think that customers would appreciate having multiple choices."

A key part of Amazon's AI strategy is to update its custom silicon so that it can not only bring down the costs of AI for its business customers, but also give the company more control over its supply chain. That could make AWS less reliant on Nvidia, one of its closest partners, whose GPUs the company makes available for customers to rent on its cloud platform.

But there is no shortage of companies angling for their share of Nvidia's chip revenues, including AI chip startups such as Groq, Cerebras Systems and SambaNova Systems. Amazon's cloud peers, Microsoft and Alphabet's Google, also are building their own chips for AI and aiming to reduce their reliance on Nvidia.

Amazon has been working on its own hardware for customers since well before 2018, when it released a central processing unit called Graviton based on processor architecture from British chip-designer Arm. Amazon executives say the company aims to run the same playbook that made Graviton a success -- proving to customers that it is a lower-cost but no less capable option than the market leader.

As AI models and data sets have gotten larger, so, too, have the chips and chip clusters that power them. Tech giants aren't just buying up more chips from Nvidia, or designing their own; they're now trying to pack as many as they can in one place.

That's one goal of Amazon's chip cluster, which was built as a collaboration between Amazon's Annapurna Labs and Anthropic: for the AI startup to use the cluster to train and run its future AI models. It is five times larger, by exaflops, than Anthropic's current training cluster, AWS said. By comparison, Elon Musk's xAI recently built a supercomputer it calls Colossus with 100,000 Nvidia Hopper chips.

Amazon's Ultraserver links 64 chips into a single package, combining four servers, each containing 16 Trainium chips. Certain Nvidia GPU servers, by comparison, contain eight chips, Brown said. To link them together to work as one server, which can reach 83.2 petaflops of compute, Amazon's other secret sauce is its networking: creating a technology it calls NeuronLink that can get all four servers to communicate.

That's as much as Amazon could pack into the Ultraserver without overheating it, the company said. But the message isn't strictly, "Choose us or Nvidia," Amazon executives say. Amazon says it is telling customers they can stick with whatever combination of hardware they prefer on its cloud platform.

Eiso Kant, co-founder and chief technology officer of AI coding startup Poolside, said it is getting roughly 40% price savings compared with running its AI models on Nvidia's GPUs. But a downside is that the startup needs to spend more of its engineers' time to get Amazon's associated chip software to work.

However, Amazon fabricates its silicon directly through Taiwan Semiconductor Manufacturing and puts it into its own data centers, making it a "safe bet" for the AI startup, Kant said. Where it places its bets is key, because even a six-month hardware delay could mean the end of its business, he said.

Benoit Dupin, a senior director of machine learning and AI at Apple, said that the smartphone giant is testing Trainium2 chips, and expects to see savings of about 50%.

For most businesses, the choice of Nvidia versus Amazon isn't a pressing question, analysts say. That's because large companies are mostly concerned with how they can get value out of running AI models, rather than getting into the nitty-gritty of actually training them.

The trend is a good thing for Amazon, because it doesn't really need customers to peek under the hood. It can work with firms like cloud-data company Databricks to put Trainium beneath the covers, and most businesses won't notice a difference because computing should just work -- ideally at a lower cost.

Amazon, Google and Microsoft are building their own AI chips because they know their custom designs save time and cost while improving performance, said Chirag Dekate, an analyst at market research and IT consulting firm Gartner. They customize the hardware to offer very specific parallelization functions, he said, which could beat the performance of more general-purpose GPUs.

Company leaders, though, are realistic about how far AWS's chip ambitions can go.

"I actually think most will probably be Nvidia for a long time, because they're 99% of the workloads today, and so that's probably not going to change," AWS CEO Garman said. "But, hopefully, Trainium can carve out a good niche where I actually think it's going to be a great option for many workloads."

---

Lab Team Has 'Scrappy Mindset'

The heart of AWS's efforts is in Austin, Texas, home to an AI chip lab run by Annapurna Labs, an Israeli microelectronics company Amazon acquired for about $350 million in 2015.

The chip lab has been there since Annapurna's startup days, when it was seeking to land in a location where chip giants already had offices, said Gadi Hutt, a director of product and customer engineering who joined the company before the Amazon acquisition.

Inside, engineers might be on the assembly floor one day, while soldering the next, said Rami Sinno, the lab's director of engineering. They do anything that needs to be done, right away -- the sort of scrappy mindset more commonly found among startups than trillion-dollar companies like Amazon.

That's by design, Sinno said, because Annapurna doesn't look for specialists like the rest of the sector. It looks for a board designer, for instance, who is also fluent in signal integrity and power delivery, and who can also write code.

"We design the chip, and the core, and the full server and the rack at the same time. We don't wait for the chip to be ready so we can design the board around it," Sinno said. "It allows the team to go super, super fast."

AWS announced Inferentia in 2018, a machine-learning chip dedicated to inference, which is the process of running data through an AI model so it generates an output. The team went after inference first, because it's a slightly less demanding task than training, said James Hamilton, an Amazon senior vice president and distinguished engineer.

By 2020, Annapurna was ready to go with Trainium, its first chip for customers to train AI models on. Last year, Amazon announced its Trainium2 chip, which the company said is now available for all customers to use. AWS also said it is now working on Trainium3 and Trainium3-based servers, which will be four times more powerful than its Trainium2-based servers." [1]

1. Amazon Plans Supercomputer Powered by Homegrown AI Chips. Lin, Belle.  Wall Street Journal, Eastern edition; New York, N.Y.. 04 Dec 2024: B.4.

 

Komentarų nėra: