Data center architecture talking to Jensen Huang



[ad_1]

The deal for Nvidia to acquire Mellanox, which was announced last March for $ 6.9 billion, was finally approved with all the regulatory bodies in the world and closed today. The combination of the two companies presents many possibilities, some of which we have explored here and there in The next platform During the past year.

We all have ideas, but the ones that matter most are the ones that have been conceived and thought through by Nvidia co-founder and CEO Jensen Huang. It was a very busy day for Huang, to be sure, but he did take time to talk to us about how Nvidia and Mellanox would help create the system architecture of the future. It was a long conversation between two paragraph-thinking people who love systems, so have a coffee.

Timothy Prickett Morgan: I am dying to have this conversation with you since March last year. It was pretty clear in 2018 that someone was going to buy Mellanox. It could have been AMD, Intel, or IBM. I’m personally glad it was Nvidia. I think network companies have historically struggled when they merged with Intel, that’s the polite way to put it, but it looks like it may be different with Barefoot Networks. A long time ago I argued that what IBM should do is put the OpenPower Consortium into one company and put it all together to become a really strong single competitor to counter Intel in the market. That obviously did not happen.

But what I’m trying to solve now is this: You have Mellanox, and you have it for what I think is a very good price. It turned out to be a much stronger company than we’ve seen, which is interesting in its own right. And there are all kinds of really good technology that you can implement. So what is it that makes Mellanox really a good fit for Nvidia?

Jensen Huang: The first thing we know is that Amdahl’s Law is obeyed. And one of the things we do, as you know, is speed up computing. So we took a problem and refactored it from software to system, to chips to interconnects. And as a result, we accelerate the application by orders of magnitude. It was almost illogical when we started doing it, that somehow enough performance remains on the table for a company to speed up the application by a couple of orders of magnitude sometimes, and in some cases we were delivering a 10X, 20X, 30x more fast, taking something that would have taken weeks to run and cutting it down to hours.

It took a decade and a half for people to realize that this computing model makes perfect sense for problems that are difficult to solve and will remain difficult to solve for a long time to come. And so we created accelerated computing and it has taken a long time, but the tipping point is past.

Now when you take a large-scale problem that spans the entire data center, it doesn’t fit on a single computer, and you speed up the computational part by several orders of magnitude, then the problem will become everything else. And then everything else we start to solve, piece by piece by piece. But the only piece you can never solve is connecting multiple computers together. Because we will always have problems that are bigger than a computer, hopefully. And when a problem is bigger than a computer, the network becomes the problem and must be very fast. And that is the reason why our relationship with Mellanox goes back a decade and we have been working with them for quite some time.

The network problem is much, much more complex than simply having a faster and faster network. And the reason for this is due to the amount of data you’re transmitting, syncing, collecting, and downsizing through this distributed data center-scale computer, and calculating on the structure itself is complicated.

TPM: When you say that, do you mean that you are referring to the calculation that you are embedding in the switches or are you saying that you are also taking on SmartNIC? And I want to talk about that in a second, because, in my opinion, the SmartNICs and the ideas that Mellanox has developed are probably more important right now than how much bandwidth we can get next year. Taking things out of the CPU that don’t belong there, or the GPU, is key to boosting computational efficiency for the entire system.

Jensen Huang: So is. So for example you never want to copy the same data twice. Ideally, it never moves the data at all. And if you wanted to move the data, ideally you compressed it, somehow reduced it, before moving it. And so, the combination of information, the intelligence about what information to shuffle, when to shuffle it and in what format it shuffles it, and what calculation you did before moving it, all that is network computing. And we do some of that. We did some of that, which was called NCCL, which is the breakthrough that allowed us to do our RDMA directly in the memory of the GPU and do collective and downstream on the network using our GPUs. They do the same on the side of the network switch.

And the point is, when you move data, it’s not just about brute force moving a ton of data because it’s too much data. And when you move that amount of data around a big computer, you want to be smart about it. So the idea of ​​calling it SmartNIC is great because it could preprocess the data, it could compress the data, or it could avoid doing it entirely.

Putting intelligence into network computing, and processing on the network, is vital to performance. And it’s not just about data rate. Exactly. Because the data rate can only get there, and it only moves as fast as Moore’s Law, if that’s the case. You want to trick the laws of physics, you don’t want to confront them.

TPM: I think my point was that it was so important for a SmartNIC to do a GPU download job. I can explain why you are sacrificing a third of your cores when you buy an X86 processor if you keep all of these networking features there. There must be some work that the GPU is doing on its own that is somehow network related or it could do a much cheaper version of preprocessing on a SmartNIC and lighten the load on the GPU and do more work through it. I don’t know if that’s logically true or not, that’s one of the things I’m trying to understand.

Jensen Huang: It’s not super logical to do that part of it. But everything else you said is perfect. I mean, the point is, we don’t want to run the network software on the CPU, it doesn’t make sense. Much of the data movement is done on the CPU. Has no sense. You need to download that to a data processing unit, or DPU, which is what a SmarttNIC is. Today, many data centers have each packet that is transmitted securely because you want to reduce the attack surface of the data center until it is basically about every transaction. There is no way for you to do that on the CPU. Therefore, you must move the network stack. You want to move the security stack and you want to move the data processing and the data movement stack. And this is something you want to do on the NIC even before it enters the computer and on the NIC before it leaves the computer.

TPM: There are cases that Mellanox is demonstrating with storage, for example, where it does not need to host the processor as we know it. And it is very aerodynamic. They use the Arm “Bluefield” chip in the NIC, and you put Ceph or whatever in it, you group them together, you have NVM-Express over Fabrics, and bam, you have a distributed storage system and there is no host as we know. that. There is no X86 processor. And I envision a world where you could make broken GPU compute blocks and then composable with these kinds of Bluefield hosts doing some of the cleaning work for them. GPUs may require a host, but a full server is not necessarily required.

Jensen Huang: Onion, celery, and carrots, you know, the holy trinity of computer soup, are the CPU, GPU, and DPU. These three processors are fundamental for computing. And if you had if you had a world class processor in each, you will have a really great computer. And what you want to do is do the right job on the right processor. There is a place for CPUs. In fact, there are three types of processors that are necessary. The first is the CPU. The CPU is a trap for everything that doesn’t fit anywhere else. And it is good to have it. If I had to bet my life on it, I always, always want to have a CPU. And the reason for this is because I will think of an idea that needs CPU and is always there for you. However, once you find out what algorithm you want to run, once you know what data formats it is and how you want to stream it, the best way to do it is on the other two processors.

In the case of Mellanox, of course, things move between computers, they move between storage – the bits and bytes that traverse the network need to be secured with deep packet inspection and all that, and all that processing has to be done in SmartNIC, which will eventually become a DPU. A DPU will be programmable, it will do all the processing that you and I have already talked about, and it will unload the movement of data into granular processing of the data as it is transmitted and will prevent CPUs and GPUs from ever bothering and avoiding redundant copies of data. That is the architecture of the future. And that is why we are so excited about Mellanox.

And the combination of Mellanox and Nvidia makes more sense because we push computing to the limits more than anyone and expose the weaknesses of all the other elements of the computer more severely and faster than anyone else. And if we can solve problems, we solve them for everyone.

TPM: Yes i understand. I have rarely seen a server, like the DGX-2, which has eight 100 Gb / sec NIC cards. And soon you will be able to double that up to 200 Gb / sec and not too long from now to double it back to 400 Gb / sec.

Jensen Huang: And even with that, the amount of algorithms used to reduce memory copies, to compress memory, to preprocess before any transmission is done, is extreme.

In addition to the amount of software being made, we call that entire layer Magnum I / O. And Magnum I / O includes NCCL, includes RDMA in our GPUs, RDMA in NICs: on the GPU side we call NCCL and on the switch side they call it UCX and all that software for efficient copying and efficient transmission and copying data is really complex, it’s inside the layer we call Magnum I / O. So the amount of software on top of silicon is really quite complex. This was one of the reasons why it makes so much sense for us to be together.

I think the first strategic reason, the strategic reason, of course, is that we have now combined the forces of two companies that are intensely focused on high-performance computing. We work on two of the biggest problems. One of which is computing, the other is networking. So if these two problems could be solved in harmony, we could make significant progress in computing.

The second reason, and you mentioned the idea of ​​disaggregation and composition earlier, is that the trend is moving very fast.

You know well that the most powerful computing revolution in the past two decades was cloud computing. And what made it possible was the simplistic scale with hyperconverged servers, where everything fits into one. If you want more storage, buy another server. If you want more CPU or memory, buy another server. That was easy to manage, easy to program, easy to scale. And that started the cloud computing revolution. What has happened in the past ten years and is accelerating particularly now is the emergence of artificial intelligence and the explosive growth of data. And the hyperconverged form of scaling became very inefficient and then we got to this idea of ​​disaggregation and composibility.

Unbundling was really a concept that would not have been practical had it not been for the work Mellanox did with RDMA and with storage providers. That disaggregation and composibility logic applies perfectly to GPUs. So when cloud data centers started moving towards AI, they needed to have servers that were good at speeding up AI and CPUs weren’t right for that. And instead of installing GPUs on each server and waiting until the data center is upgraded with a new hyperconverged infrastructure with GPUs on the machines, they could unbundle the GPU and put the GPU servers anywhere, and also put the servers on Storage anywhere, and they could orchestrate everything using Kubernetes.

And so, in this new world of microservices and containers, we were now composing your data center from computing elements broken down into the shape and size of the shape that makes perfect sense for the workload. And when you think about this, it is that fabric that made it possible to do this. And that’s why Mellanox has taken him out of the park. They allowed unbundling and because of that, East-West traffic became intensely high. But the data center became much easier to compose, and utilization increases and performance increases because you can now place accelerators like GPUs anywhere you want. And so it all came together in this new data center style that is broken down, composable, and accelerated.

TPM: I think they just don’t want the capacity stranded anywhere. That is what I am opposed to. And also that you will never be able to adjust CPU capacity to GPU capacity to FPGA capacity, whatever you need in the mix for the application matching workflow while running in this data center. And this should be able to change on the fly, and we really aren’t there yet. They do a good job of unbundling. I would say that the composibility of hyperscalers is not something that mere mortals play with. I don’t think they’re good at it yet. Otherwise, we could all instantiate the components we wanted in a public cloud, and we can’t. So there are types of instances.

Jensen Huang: There are still some parts missing, and I’m eager to show you some of the technology we’re creating that makes composing easier. But I will say this: the pieces are coming together. I think there is the fundamental ability of Kubernetes to compose a disaggregated data center. The networks are being updated. That’s one of the reasons Mellanox is doing so well: People are updating to 25 Gb / sec as fast as they can. It took them a long time to move beyond 10 Gb / sec. But people are moving very fast now, and the reason for this is because of these compound and orchestrated microservices and containerized applications, which really absorb a lot of traffic from East to West. And once you upgrade the switch and NICs, the performance per data center really increases. And the added benefit is that if your east-west traffic was that high, you could communicate with a GPU server anywhere in the data center and incorporate it into your composition. Once you incorporate that into your composition, your deep learning performance simply skyrockets.

So two things have to happen. We need to upgrade data centers much faster to allow for much more East-West traffic, which then puts all Nvidia accelerators anywhere in the data center available for all workloads. And Nvida has to make these AI accelerators much, much better at transforming between training and inference, scaling and scaling. They just have to be much more fungible. If they are much more fungible, any workload can use them. Today, our Volta GPUs are truly designed for scaling training, and our Turing GPUs are designed for scale-out inference. They are well at the start of the AI ​​revolution. But if you want your data center to be fully programmable, then the processors there, including GPUs, really want to be much more flexible.

TPM: How do you do how do you do that? How to reconcile that with NVSwitch under the masks to make a memory addressable atomic interconnect? In other words, what a NUMA GPU server is essentially, is a shared memory, shared computation, you run it as a unit. Can you spread it over an InfiniBand or Ethernet fabric to give those atomic ones? Can you ever do that? Or is it just stupid because of latency and other issues? In other words, will you always need something like an NVSwitch to scale the GPU calculation and then you will need something like InfiniBand or Ethernet to scale it, or will you need both? Its own Saturn-V supercomputer does both, but “Summit” and “Sierra” do not because NVSwitch was not available at the time the bids entered the US Department of Energy. USA

Jensen Huang: That is the challenge. A scalable computer is designed in such a way that it is inefficient for scaling horizontally.

TPM: But it is easier to program, so you get some benefit from that.

Jensen Huang: We want to find a solution. And, of course, a solution will never be simultaneous. You don’t have a virtualized system that expands and expands simultaneously.

TPM: I have never seen one in all these years. I’ve seen people claim it, but there’s always a little asterisk at the bottom: oh wait, this is only good for messaging apps, don’t run a database and SAP apps on this. Software-based NUMA, for example, has had many limitations, generally. I feel like the same kinds of problems apply to what we’re talking about here.

Jensen Huang: If we narrow the problem down a bit, and don’t consider it as a multi-tenant, but instead consider it as a configurable computer, it’s probably possible to create something.

I think it is a solvable problem. Mellanox, like all great companies and their products, is not universal in everything, but it is useful in the things it promises to do. And I think that the combination of Mellanox and Kubernetes and the trend towards disaggregation, we could, in combination, create a new style of data centers that could be good in today’s world but that helps us get into a data center a lot more composable in The World of Tomorrow.

TPM: I need to ask you a cleaning question. How do you execute this? I mean, you have partnerships with a lot of your competitors. It is the nature of business. Mellanox also has partnerships with many of its competitors. Know that you don’t sell computing without networks. Do you run this with your arm outstretched? Or do you just merge it into your data center group? IBM has taken a non-intervention attitude so far with Red Hat, and I think it’s working for them. But I don’t think it’s necessary in this case either. What do you think about how to integrate Mellanox?

Jensen Huang: It will be a business unit, and Mellanox will be our network brand and Israel will be our network hub. We are going to use Mellanox technology in tricky games, in high-performance computing, in hyperscale, on the edge, in robotics, in driverless cars. Remember, with data processing, high-speed data is essential to everything and anything related to high-performance computing and artificial intelligence. They have such a deep experience in networking, storage and security. My enthusiasm is to use Mellanox in all areas.

Regarding working with the industry, we will continue to be open. We work closely with Intel, for example, building laptops. If you look at our Max Q laptop, it is very thin, but it is a game console in a thin and small laptop. And putting an RTX 2080 in it is a technical miracle. And we work very closely with AMD. Relationships at the managerial and engineering level are much more collaborative than people think. We give them are our first samples. We get your first samples. We are all very good at maintaining confidentiality. We have teams that work with Intel, we have teams that work with AMD, and we have teams that work with other companies. And so we will continue like this. The industry is not conquered by one, it is advanced by all. Interoperability is important in the construction of computers, and that is our sensitivity.

TPM: I have one last question and then I will let you go, and it goes against the spirit of what you just said. Something like. But I have to ask, because I am always curious. It has GPU computing, and basically owns that market. Yes, you have some AMD online competitors with Radeon Instincts improving, and who knows what Intel Xme It is going to be but it is coming and we will see. You could put a matrix math unit, not a GPU, any day you want to get out of Tensor Cores and have something that looks and smells like TPU or some of these other neural network engines. You have the network now. I wanted you to have CPUs on the server a long time ago, and the Denver Project excited me a long time ago.

Now I know you don’t need it. You do not have to do it. But it sure would be interesting if you did. So do you think there is a place for an Nvidia server CPU? You already make Tegra Arm chips for your client devices, so you can make an Arm server processor. You can easily access the Arm Neoverse roadmap. You have a close relationship with TSMC. You could really do it all if you wanted to and still be open to everything else. Nothing changes. But the question is, can you get market share? Does it make money? Can you do a better job than the people who are already there? I can even see you doing RISC-V. You could see it as the first credible RISC-V server chip provider if you wanted to.

So what do you think about that when you think about the CPU?

Jensen Huang: That’s a great question. And there are all kinds of ways to dissect that. But I think, really, in a single lens, since I think about almost everything. And here is the question: What unique contribution can we make? I always start there.

You know, I have no problem working with other people’s technologies as long as doing so we can make a unique contribution that will move the world forward. And it’s something people on the Nvidia conference call hear me say all the time: We shouldn’t waste our incredible time, resources, and experience, and not do something that someone else already has, for the sole purpose of sharing. Participating is not as constructive for the world as creating something new. And I prefer not to waste our resources if possible.

If we are caught in a situation where the only way to advance our state of the art is to become a world-class memory designer, and it turns out that Nvidia is a world-class designer as a SRAM designer, we will, for example . And the reason for this is because, and people don’t know, a GPU has a ton more cache and bandwidth distributed across the GPU than any other processor. And so we had to learn that to create something new. And I have no problem or problem to do it. But in general, I have to ask myself: What are the new things we can do?

Now, in the case of Mellanox, it allows us to create something that the world does not have. And you and I have spent a lot of time talking about it. This is the new giant architecture. The really exciting thing right now is not to build another server anymore. The exciting thing for the world is that the server is no longer the computing unit. The data center is the computing unit. You are going to program a data center, not a server.

TPM: Well, this is where I would go back, and if I asked you to make a CPU for me, personally, I would say that we need a processor that has no memory and I / O as well defined by it. This is the secret. And recently I said to Renee James, the executive director of Ampere Computing, the same thing. Stop putting PCI-Express controllers and Ethernet controllers and memory controllers in the die and start using more generic and faster SerDes like IBM is doing in part with its Power chips. IBM is right. Can be done. Once we have these SerDes, they can be the network interface, they can be a NUMA link or part of an extended structure. Now we can increase or decrease the amount of memory and I / O that we need within the server or on all servers; make this also composable. The problem is that we have this old way of doing CPU. It needs to be broken. I want the CPU to crash from memory literally. I want unbundled main memory and unbundled I / O, not just flash pools and GPU pools. I think we are stagnant, and this is what makes the infrastructure not composable.

Jensen Huang: I think your dream will come true, okay? It is a great dream and it is the correct dream. And it is not an easy dream to achieve. It turns out that building a new CPU might not be the answer to do it. And in fact, you and I have already wandered around. One of the most important things to unbundle from the server node and its CPU is data processing. That’s a giant when the amount of unnecessary CPU cores running unnecessary software in the data center. I don’t know how much, maybe 30 to 50 percent.

TPM: I think you’re right. Well, it is likely that 30 percent of the computing cycles are offloaded from the CPU and there is probably another 20 percent that is never done because the clock cycles are simply spinning waiting for the data.

Jensen Huang: Absorbed by doing things that can be done almost infinitely fast on a DPU in a SmartNIC. My attitude is not to think of the server as a computer, but to think of a CPU server, a GPU server, a storage server, and a programmable switch as computing elements that are inside the computer, and the computer drive is now the data center. And in that world, networking is very important. And in that world, knowing how to build an end-to-end computer and recognizing the importance of the software stack, which is so complicated from top to bottom, is where we focus. And I think in a new world where the data center is the computer, it’s really exciting and I think we have the essential pieces.

TPM: I think we can both retire when I’m done. [Laughter]

Jensen Huang: So, you know, we’ll go build something else.

TPM: It’s true, I’m not going to stop working, that’s stupid, that’s what kills you.T

Jensen Huang: But that is the radical change that we are seeing right now. And we are about to do it. The people running these data centers are smart and recognize the incredible underutilization of today’s data center. I really think that when you download data processing into SmartNIC, when you can unbundle the converged server, you can put accelerators anywhere in the data center and then you can compose and reconfigure that data center for this specific workload, that’s a revolution .

[ad_2]