The integration of CPU, GPU, and DPU is the inevitable architecture of the data center in the future

News Center

Feb 02 75

As an important resource and production factor, data has become a worldwide consensus. And the fulcrum behind all this-the data center-the area for computing and storing data, will surely be the holy land for future technology companies to compete.

The data center is no longer in the original mainframe era, that is, it handles a single critical task, and it has also gone through the problem of how to optimize the use of resources when running multiple services under a software-defined data center. At present, the data center is moving from a vertical extension. With horizontal expansion, the existing computing power has become a bottleneck, said Song Qingchun, senior director of market development for the Asia Pacific region of Nvidia's network business unit.

GPU is a good solution to the problem of computing power bottlenecks, but it is only for a single machine, and then for a wider data center, especially for security and performance isolation, how to solve it?

Nvidia chose DPU. "CPU, GPU, DPU 3U are indispensable in the data center now. This is the basis for the data center to become the computing unit, and also the basis for the computing power to become the service." Song Qingchun pointed out.

DPU, or Data Processing Unit, is a processor for the infrastructure of the data center. From a certain perspective, the emergence of DPU has released the resources of CPU and GPU well. In the eyes of NVIDIA, its emergence has brought different ideas to the new data-centric computing architecture. The DPU executes the communication framework, storage framework, security framework and business isolation, and "decompresses" the CPU and GPU computing resources to the application, so that the performance can be better released. Song Qingchun said that with the DPU, the communication and calculation will be overlapped, so that the communication in the HPC business can be accelerated through the DPU, and the CPU and GPU can perform real floating-point calculations.

He pointed out that the emergence of DPU has made up for the lack of basic service acceleration capabilities in the data center, and has realized a new data center architecture integrating 3U, making the data center a new computing unit, which is also an inevitable architecture.

Nvidia released Quantum-2, a new generation of InfiniBand network platform on GTC 2021. Including NVIDIA Quantum-2 switch, ConnectX-7® network card, BlueField-3® data processor DPU (data processor) and all software supporting this new architecture. This is also the most advanced end-to-end network platform so far.

Song Qingchun said that Quantum-2 is a computing network that truly meets the needs of supercomputing and cloud native networks. When supercomputers and cloud-native supercomputing systems need to achieve high performance, all resources must be involved in the calculations.

In the process of data communication in the network, many communication models will restrict the development of the performance of the entire system, and the traditional von Neumann architecture calculation model will cause network congestion. Neither increasing the bandwidth nor reducing the delay can solve this problem. How to continue to improve the performance of the data center has become a new challenge facing the industry.

Where the data is, the calculation is there, Song Qingchun pointed out. The new data-centric architecture can solve the problem of packet loss and other bottlenecks in network transmission. The new architecture can reduce the communication delay by more than 10 times, so network computing has become one of the key technologies of the current data-centric architecture.

With a high throughput of 400Gbps per second, NVIDIA Quantum-2 InfiniBand doubled the network speed and tripled the number of network ports. While improving the performance by 3 times, it also reduces the number of switches required for the data center network by 6 times. At the same time, the energy consumption and space of the data center are each reduced by 7%.

The NVIDIA Quantum-2 platform achieves performance isolation between multiple tenants, which allows one tenant's behavior to not interfere with other tenants. At the same time, it uses advanced telemetry-based and cloud-native congestion control mechanisms to ensure reliable data throughput And is not affected by peak user or application demand.

NVIDIA Quantum-2 SHARPv3 network computing technology can provide 32 times more acceleration engines for AI applications than the previous generation. With the help of NVIDIA UFM® Cyber-AI platform, it will provide data centers with advanced InfiniBand network management functions, including predictive maintenance, etc. .

The NVIDIA Quantum-2 platform integrates a nanosecond precision timing system that can synchronize distributed applications, such as in database processing, which helps reduce waiting and idle time. This new feature makes the cloud data center part of the telecommunications network and can host software-defined 5G wireless services.

Compared with traditional supercomputing platforms, Song Qingchun introduced that Quantum-2 allows the network to directly participate in the calculation. In Quantum-2 platform, the performance of services is isolated through advanced network computing technology, dynamic routing, and congestion control technology. When running multiple businesses, each business can achieve the best performance, maximize the performance of the supercomputer when it is on the cloud, and can maintain the performance of Bare-metal. Even the Quantum-2 InfiniBand DPU can realize the overlap of calculation and communication. Through the overlap of calculation and communication, another new optimization idea is provided, which is to put the calculation on the CPU and GPU, and put the communication framework on the DPU. For some businesses, it can even achieve better performance than Bare-metal. For a business such as fast Fourier transform and 3D FFT, it can achieve even better performance than Bare-metal. Therefore, if it is to push a cloud-native technology platform, Quantum-2 is the best network platform to support cloud-native.

Regarding the concept of cloud native, Song Qingchun said that from the perspective of NVIDIA, cloud native may change its name in the future, but it will definitely go in the direction of related technologies. The current computing power has become a resource, including the energy saving and emission reduction called by the government. , Improve performance, reduce power consumption, all of which hope that the data center can provide the maximum performance with the lowest power consumption and the least equipment, so there is definitely no doubt in the direction of cloud native performance improvement, and it is right .