Microsoft investment in ChatGPT doesn’t just involve money sunk into its maker, OpenAI, but a massive hardware investment in data centers as well which shows that for now, AI solutions are just for the very top tier companies.
The partnership between Microsoft and OpenAI dates back to 2019, when Microsoft invested $1 billion in the AI developer. It upped the ante in January with the investment of an additional $10 billion.
But ChatGPT has to run on something, and that is Azure hardware in Microsoft data centers. How much has not been disclosed, but according to a report by Bloomberg, Microsoft had already spent “several hundred million dollars” in hardware used to train ChatGPT.
In a pair of blog posts, Microsoft detailed what went into building the AI infrastructure to run ChatGPT as part of the Bing service. It already offered virtual machines for AI processing built on Nvidia’s A100 GPU, called ND A100 v4. Now it is introducing the ND H100 v5 VM based on newer hardware and offering VM sizes ranging from eight to thousands of NVIDIA H100 GPUs.
In his blog post, Matt Vegas, principal product manager of Azure HPC+AI, wrote customers will see significantly faster performance for AI models over the ND A100 v4 VMs. The new VMs are powered by Nvidia H100 Tensor Core GPUs (“Hopper” generation) interconnected via next gen NVSwitch and NVLink 4.0, Nvidia’s 400 Gb/s Quantum-2 CX7 InfiniBand networking, 4th Gen Intel Xeon Scalable processors (“Sapphire Rapids”) with PCIe Gen5 interconnects and DDR5 memory.
Just how much hardware he did not say, but he did say that Microsoft is delivering multiple exaFLOPs of supercomputing power to Azure customers. There is only one exaFLOP supercomputer that we know of, as reported by the latest TOP500 semiannual list of the world’s fastest: Frontier at the Oak Ridge National Labs. But that’s the thing about the TOP500; not everyone reports their supercomputers, so there may be other systems out there just as powerful as Frontier, but we just don’t know about them.
In a separate blog post, Microsoft talked about how the company started working with OpenAI to help create the supercomputers that are needed for ChatGPT’s large language model(and for Microsoft’s own Bing Chat. That meant linking up thousands of GPUs together in a new way that even Nvidia hadn’t thought of, according to Nidhi Chappell, Microsoft head of product for Azure high-performance computing and AI..
“This is not something that you just buy a whole bunch of GPUs, hook them together, and they’ll start working together. There is a lot of system-level optimization to get the best performance, and that comes with a lot of experience over many generations,” Chappell said.
To train a large language model, the workload is partitioned across thousands of GPUs in a cluster and at certain steps in the process, the GPUs exchange information on the work they’ve done. An InfiniBand network pushes the data around at high speed, since the validation step must be completed before the GPUs can start the next step of processing.
The Azure infrastructure is optimized for large-language model training, but it took years of incremental improvements to its AI platform to get there. The combination of GPUs, networking hardware and virtualization software required to deliver Bing AI is immense and is spread out across 60 Azure regions around the world.
ND H100 v5 instances are available for preview and will become a standard offering in the Azure portfolio, but Microsoft has not said when. Interested parties can request access to the new VMs.
Copyright © 2023 IDG Communications, Inc.