Why Aren’t There Software-Defined NUMA Servers Everywhere?

For decades, we have been using software to chop up servers with virtualization hypervisors to run many small workloads on a relatively big piece of iron.

This was a boon for X86 servers during the Great Recession, when server virtualization technology was mature enough to be put into production (although not yet perfect), allowing for server consolidation and higher server utilization and even helping some companies skip one or two generations of server upgrades at a time when they could not really spend the money on new iron.

So how come we don’t also use software to tightly couple many cheap servers together to create chunks of shared memory and associated compute to run workloads larger than a single machine? It is a lot easier to program for a shared memory space than it is for a loosely coupled distributed computing system, so this is a bit perplexing to us. Wouldn’t you like to have a machine with handfuls of sockets and thousands of threads and have low-level software manage resource allocation? Why scale out your applications and databases when you could be lazy and scale up?

The technology to create NUMA systems on the fly has existed for many years – remember Virtual Iron, RNA Networks, and ScaleMP? – and has perhaps culminated with TidalScale, which has been taking another run at the idea with its own twists on essentially the same story and which was founded back in March 2012 by Ike Nassi, Kleoni Ioannidou, and Michael Berman.

By the way, those other three software companies that created virtual NUMA servers out of commodity X86 machines are all gone.

Virtual Iron, founded in 2001, disappeared into the gaping maw of Oracle in May 2009. RNA Networks, founded in 2006, was eaten by Dell in June 2011, and that server maker messed around with it a bit and then we never heard about it again. And ScaleMP, founded in 2003, was still going pretty strong in the HPC sector with its eponymous NUMA hypervisor when The Next Platform was founded in 2015, ScaleMP was quietly acquired by SAP in June 2021. (There was no announcement of this as far as we are aware. )

Enter TidalScale

Nassi is the heavy hitter at TidalScale and is still chairman and chief technology officer.

After getting bachelor, master, and PhD degrees in computer science from Stony Brook University in New York, Nassi was an engineer at minicomputer innovator Digital Equipment, and in the 1980s was a vice president at supercomputing NUMA system innovator Encore Computer before joining Apple to help develop its MacOS operating system and languages. (Nassi is one of the creators of the Ada programming language and helped create the Dylan programming language for the Apple Newton handheld.) He founded mesh networking pioneer Firetide in 2001, and then moved on to be chief scientist at the SAP Research arm of the ERP software giant, serving from 2005 through 2011.

Nassi is also a long-time adjunct professor of computer science at the University of California Santa Cruz.

TidalScale has raised $43 million in two rounds of venture funding, and in 2016 brought in Gary Smerdon as chief executive officer. Smerdon is a longtime semiconductor executive from AMD who subsequently did stints at Marvell, Greenfield Networks (acquired by Cisco Systems), Tarari (acquired by LSI Logic where he stayed in various executive roles for six years), Fusion-io, and Pavalion Data Systems.

We spoke with Smerdon when TidalScale 3.0 was launched, back when we were doing Next Platform TV during the height of the coronavirus pandemic. And to our chagrin, we have not given the company the proper coverage of its TidalScale 4.0 release last year. But we are doing so during the TidalScale 4.1 release, which just came out.

The company launched its initial HyperKernel and related system management stack for virtual NUMA servers in 2016 and followed up with a 2.0 release in 2017 but it is with the TidalScale 3.0 product in September 2019 that improved the scalability of the NUMA machines that could be composed with HyperKernel, with a maximum of 64 TB of memory across a software-defined NUMA server, a maximum of 12 TB in any node in the NUMA cluster, up to the point that 128 TB is addressable. This is a limit in the Xeon SP server architecture from Intel, which at the moment is the only CPU that TidalScale supports.

There is nothing, in theory, that keeps TidalScale from supporting AMD Epyc, IBM Power, or Arm server CPU architectures with its HyperKernel. But in practice, a startup has to focus and despite all of the competition Intel is facing, the Xeon SP architecture is still the dominant one in the datacenter.

In a TidalScale NUMA cluster, the server nodes do not have to have the same configuration in terms of compute or memory. With TidalScale 3.0, a maximum of 128 sockets were permitted across the NUMA cluster. Obviously, with these maximums, you can create NUMA clusters with a wide variety of server nodes in an asymmetrical fashion or you can build them like NUMA hardware-based cluster sellers like HPE, IBM, and others do with every node in the cluster being the same.

With the TidalScale 4.0 release, all of these capacities were doubled up. So you could have a maximum of 24 TB for any given node, up to 256 sockets, and up to 128 TB of total memory across a software-defined NUMA cluster.

Pushing Against The Tide, But Also Riding A Rising Wave

In a way, TidalScale is pushing against the hardware tide a little bit. And in several other ways of looking at it, there has never been a better time for an “ubervisor” or a “hyperkernel” or whatever you want to call the kind of hypervisor that Virtual Iron, RNA Networks, ScaleMP, and TidalScale have all individually created.

Moreover, some of the innovations inside of TidalScale’s HyperKernel make it unique compared to its predecessors, and so long as someone doesn’t buy the company and sit on it, there is a chance this software-based NUMA clustering technology could become more pervasive – particularly with PCI-Express, Ethernet, and InfiniBand networks offering high bandwidth and low latency.

First, let’s talk about the nature of compute and NUMA clustering in the datacenter today and how it mitigates against the need for NUMA in a lot of cases.

For one thing, the few CPU designers left in the datacenter market have been implementing better and better NUMA hardware designs over the decades, and this really has involved adding more and faster coherent interconnects as smaller transistors allowed more things to be crammed onto the die. At this point, there is hardly any overhead at all in a two-socket NUMA system, and the operating systems that run on X86 iron – Linux and Windows Server – have long since been tuned up to take advantage of NUMA. Intel supports four-socket and eight-socket systems gluelessly – meaning without any additional external chipset.

With its upcoming “Sapphire Rapids” Xeon SPs, we expect at least 60 usable cores per socket and eight DDR5 memory controllers that will be able to support maybe 4 TB per socket with 256 GB memory sticks, possibly 8 TB per socket with 512 GB memory sticks. This is a very dense socket compared to the prior several generations of Xeon SPs. And Intel will support eight-way NUMA gluelessly, which should mean supporting up to 480 cores, 960 threads, and 64 TB gluelessly. That’s a pretty fat node by modern standards.

With its “Denali” Power10-based Power E1080 server, IBM can support 240 cores that do roughly 2X the work of an X86 core, with 1,920 threads (IBM has eight-way simultaneous multithreading on the Power10, as it did with Power8 and Power9 chips), across 16 sockets and up to 64 TB of physical memory across those sockets. IBM has up to 2 PB of virtual addressing on the Power10 architecture, so it can do more memory whenever it feels like it, and as far as we know, it doesn’t feel like it and is not going to make 512 GB differential DIMMs based on DDR4 memory; 256 GB DDIMMs are the max at the moment.

These hardware-based NUMA servers do not come cheap, which is why HPE and IBM still make them – often to support databases underneath massive, back office enterprise application stacks powered by relational databases. The SAP HANA in-memory database is driving a lot of NUMA iron sales these days.

But the expense and rigid configuration of hardware NUMA clusters is what presents an opportunity for TidalScale even as each socket is getting more powerful and a certain amount of NUMA is built into the chips. And so does an invention at the heart of the TidalScale HyperKernel called the wandering virtual CPU.

In hardware-based NUMA architectures providing a shared memory space and carved up by a server virtualization hypervisor, virtual machines are configured with compute, memory, and I/O and pinned down to those resources. Any data outside of that VM configuration has to be paged into that VM, and this creates latency.

With the HyperKernel, Nassi and his team created a new kind of virtual CPU, one that is not a program running on a carve-up hypervisor, but rather a low-level computational abstraction that has a set of registers and pointers assigned to it. This is a very small abstraction – a page or two of memory that fits inside of an Ethernet packet. There are zillions of these virtual CPUs created as applications are running, and instead of moving data to a fat virtual machines, these virtual CPUs flit around the server nodes and across the networks linked nodes in the software-defined NUMA cluster, moving to where the data is that they need to perform the next operation in an application.

This wandering vCPU architecture is described in detail in this whitepaper, but here is the crux of it: “Each vCPU in a TidalScale system moves to where it can access the data it needs, wherever that data physically sits. The larger the TidalScale system, the more physical CPUs and RAM are available for its wanderings, and the more efficient the wandering vCPU becomes compared to the other options for processing lots of data.”

Efficiencies are gained, in part, with machine learning algorithms that study computation and data access patterns on running applications and databases and vCPUs can be aggregated and dispatched to data, much like a vector math unit can bundle up data and chew on them in parallel.

The net of this is that for those who need a shared memory space of a big NUMA server, to run a fat database management system or in-memory analytics, for instance, customers can build one from normal Xeon SP servers and Ethernet switching, and save money over hardware-based NUMA systems.

“Customers have told us that we lower their overall cost by around 50 percent,” Smerdon tells The Next Platform. “And versus IBM Power Systems, I believe the all-in savings is much higher – probably on the order of 60 percent to 75 percent lower cost overall.”

We can think of other use cases, like the ones that ScaleMP was chasing a decade ago. In a lot of HPC clusters, there are hundreds or thousands of two-socket X86 nodes and then dozens of fat memory X86 nodes based on heftier NUMA iron for those parts of the workload that require more memory capacity to run well.

Imagine if these fat nodes could be configured on the fly as needed from commodity two-socket servers? Better still, you could make a cluster of only really fat memory nodes, or a mix of skinny and fat memory nodes, as needed. As workloads change, the NUMA cluster configuration could change.

Surviving Increasing Memory Instability

What is the most unreliable part of a server? It’s processor? It’s system software? It’s network? Nope. It’s main memory. And as that main memory is getting denser and denser, it is getting easier for a cosmic ray or alpha particle to flip bits and cause all kinds of havoc.

DRAM is getting less and less reliable, says Smerdon, as the capacity on a DRAM chip increases, citing a figure of 5.5X lower reliability from an AMD study from 2017. At the same time, the amount of main memory in the server has gone up by a factor of 10X, and weirdly, because Intel had to tweak memory controllers to support Optane 3D XPoint main memory, one of the things it did was back off error correction bits so it could shoehorn Optane into the architecture, which resulted in a higher rate of runtime uncorrectable errors in memory for the Xeon SP chips. When you multiply that all out, says Smerdon, memory has a failure rate in a Xeon SP server that is about 100X higher than it was a decade ago.

And there are all kinds of horror stories relating to this, which cannot be easily mitigated by saying that these errors can be covered by the fault tolerance mechanisms inherent in modern distributed computing platforms. Without naming names, Smerdon says one hyperscaler had to replace 6,000 servers with over 40 PB of main memory because of high failure rates. (Small wonder, then, that the AMD Epyc architecture is eating market share among the hyperscalers and cloud builders.)

The good news is that DRAM correctable errors, power supply and fan failures, and network and storage loss of connection and device failures all have a high probability of occurring and they all provide some kind of warning by way of system telemetry that a crash is coming. And on a virtual NUMA cluster with many nodes, when such a warning occurs, you therefore have a chance to move workloads and data off that server to an adjacent machine before that crash happens. You fix the server, or replace it, plug it in, let the HyperKernel take over, and you are back in action.

That, in a nutshell, is what the TidalGuard feature of TidalScale 4.1, just released, does. Downtime is expensive, so avoiding it is important. Smerdon says that 85 percent of corporations need to have a minimum of “four nines” of availability for their mission critical systems, and that means having redundant and replicated systems. This is expensive, but worth it given that a third of the enterprises polled by TidalScale say that an hour of downtime costs $1 million or more, and even smaller firms say they lose at least $100,000 per hour that their systems are offline.

By using TidalGuard on the HyperKernel, the mean-time to repair a system is about ten minutes, a factor of 10X to 1,000X improvement over other methods, and customers can add another “two nines” of availability to their infrastructure.

And TidalGuard can also take on a problem we haven’t seen a lot of talk about, which is server wear leveling in the datacenter.

“You know you have to rotate your tires, and we are all familiar with the fact that flash memory has to balance out wear leveling or it doesn’t work,” explains Smerdon. “Now servers – their main memory, really – are wearing out at an ever-increasing rate, but there has been no monitoring of this. But we can monitor this in our clusters, and based on the usage health, we are able to estimate amongst a pool which servers have the most life left and prioritize usage on them first.”

And with this feature, you can start to predict when you will need a new fleet and how long it might last.

Interestingly, reliability, uptime, and agility are the main reason companies are deploying TidalScale today, and scalability and cost savings are secondary priorities.

If software-defined NUMA clustering finally does take off – and we think it could be an important tool in the composable infrastructure toolbox – then TidalScale could become an acquisition target. And given how long it has been in the field and that we have no heard of widespread deployments of its technology, we would bet that TidalScale would be eager to be picked up by a much larger IT supplier to help push adoption more broadly.

If Intel has not returned some memory bits back to the DRAM with Sapphire Rapids Xeon SPs, Intel might be interested in acquiring TidalScale to help correct this problem. But that would mean admitting that it created a problem in the first place, so that seems unlikely.

Given that AMD is limited to machines with one and two sockets, it may be interested. The Arm collective could also use virtual NUMA servers, too, as could one of the cloud suppliers not wanting to buy distinct four-socket and eight-socket servers for their fleets. (Amazon Web Services has been playing around with TidalScale internally, and no doubt other hyperscalers and cloud builders have, too.) VMware, looking for a new angle on server virtualization, might be interested, and ditto for server makers like HPE, Dell, and Lenovo.

Why Aren’t There Software-Defined NUMA Servers Everywhere?

Why Aren’t There Software-Defined NUMA Servers Everywhere?

Enter TidalScale

Pushing Against The Tide, But Also Riding A Rising Wave

Surviving Increasing Memory Instability

Recommend

Update your iPhone ASAP to fix iOS 16's worst bugs

First Apple Watch Ultra and AirPods Pro 2 Orders Arriving to Customers in Austra...

Costco Won't Raise Prices for Membership or $1.50 Hot Dog/soda Combo

Capitol Rioter Who Testified at Jan. 6 Hearing Avoids Jail Time

Nvidia’s “Lovelace” GPU Enters The Datacenter Through The Metaverse

The Effects of Nuclear Weapons

Moving Java Forward with Java 19

电子工程师的新宠：MSO24-平板网络示波器

Viva Suite customers gain expert knowledge management with new Answers feature

‘You’re starting to see all the classic early signs’: Legendary investor Ray Dal...

About Joyk