I’ve been writing about VMware and AI for several years. From basic can you virtualize AI workloads to how to architect these workloads to everything in between. It’s been about four years now. What’s new with VMware AI?
The changing AI Infrastructure Landscape
When we first started talking about virtualizing AI, the pushback from traditional HPC teams (ok, sales teams) was immediate. There were some good reasons for this. Historically, HPC, DL and ML were in the academic and military domains. These folks ran on-premises only, and they believed virtualizing these workloads would only add to the performance load. And in the beginning, they were correct.
However, as ESXi became more in-tune with accelerators (Mellanox, NVIDIA, etc.) it was able to provide the same sort of performance and management gains it had done decades before with Oracle.
I’ve been fond of pointing out: AI workloads are workloads. And anyone in operations knows that workloads can be optimized. Virtualization is simply one tool for optimization. At this point, even if there are workloads that you pay a performance tax if for virtualization, the management simplifications more than make up for the tax.
Think about it: why would people go to the cloud if this wasn’t true? When you request resources on a public cloud, they are all based on VMs. But not all data needed for AI workloads will go to the cloud, so someone has to build for the masses on prem.
The academic problem is symposiums who have built out the architectural requirements for their members, and they don’t include virtualization. I imagine this will change as time goes on.
So what’s new
Since I started writing about this, there honestly isn’t much new to report. You can virtualize AI workloads! VMware has just improved on what you can do.
For instance, this great VMware blog post talks about general categories where ESXi shines:
- Attaching GPUs to a VM (there are at least three ways!)
- Advice for running containers (of course Tanzu is front and center!)
- The NVIDIA partnership
Lots of new activity in 2021
That post also points to this sizing guide. This is so great! Recommendations on apps, sizing, networking, storage, GPUs, and VMs.
In addition to a sizing guide for AI, there is a vGPU operations guide. How do you virtualize GPUs with vSphere, and what tasks and considerations are there? This guide is great for ops people managing the infrastructure on which data scientists build. How do you upgrade ESXi versions? Can you vMotion during a compute-intense operation if the hardware is failing? This guide explains it for you.
There are sizing guidelines, deployment guides, and overviews to help you understand accelerators that don’t come from NVIDIA (oh hai RoCE).
There’s lots of current information, so to me that’s an indication that VMware is serious about supporting AI workloads.
Project Radium was announced at VMworld 2021. It seems to be led by the Bitfusion team. It is an xLabs project that builds on Bitfusion to expend that feature set beyond NVIDIA GPUs. There are many other accelerator methods (AMD, Graphcore, Intel, etc.). Shouldn’t virtualization democratize access to them all?
The project is fascinating. There is an application monitor that operates within the context of an application, allowing Radium to split the application in half and run each half on physically different systems. The monitor portion maintains the application state and virtualizes interactions with the machine.
They are calling it ESXi-like features within a user space application. This is really groundbreaking. The example they give is configuring Radium to remote an entire Python imported TensorFlow module to a remote side.
They are promising comprehensive performance benchmarks later in the year (this year?). Additionally they are working with ThirdAI and the Apache TVM teams to improve performance on existing infrastructures.
The Bitfusion team knows their way around what customers want AI architectures to do, so it’s great to see the progress they are making after the VMware acquisition. Also, ya gotta root for the home team (Bitfusion team is Austin-based).
The AI space only continues to accelerate. Despite all the hype, we’re still talking about workloads. How do you build the best infrastructure for AI? VMware seems to continue to not only support the basics, with things like Project Radium they are also innovating for the future.