I love chatting about tech and anytime I get to do so with folks like Josh Odgers, Cody Hosterman, Exchange Goddess, Dan the builder, Duncan Epping, Frank Denneman, Chrissy LeMair, etc. it’s a good day. They’re all IT celebrities in my books, and I failed to even scratch the surface of all the folks I’m following / interact with. Some folks follow their sports heroes, I follow my IT hero’s.
Roland Dreier had posted what I felt like was a pretty clever tweet here https://twitter.com/rolanddreier/status/1093371982240862208 which in turn was rebutted by Josh. I have mad respect for both of them, but I’m no fan of HCI, so I felt the need to point out that most SAN vendors met a lot of the differentiators that Josh pointed out. Josh pointed out that he had already schooled me, and while it is true to some degree, Twitter isn’t the best way to have a good discussion. I told him I might whip something together quick, so here we are.
Let’s start with HCI in general. I’m not a fan overall, but like any tool, it certainly has its place. I would never deny that HCI is perfect for VDI and web scale apps. More so, I wouldn’t even deny that it’s a pretty great solution for SMB’s either. Where I start to disagree with the value of HCI is when we talk about larger generalized / high performance workloads.
There are a lot of general points I could make about HCI, but being fair, Nutanix has overcome a lot of them. Scaling storage and compute independently, has always been my chief complaint of HCI. Josh Odgers and Mike Webber have pointed out that Nutanix can in fact scale both independently if needed. So, what my issue then? In no particular order, here we go.
The capacity and to a degree performance is scaled independent from the compute. More disks = more capacity, performance and potential for resiliency. My problem comes in not with this, but with my understanding that Nutanix needs CPU and memory from the host. In a world where Microsoft has forced companies to license per core instead of per socket (not a bad thing), there’s now a hard cost to every core in that host, beyond the HW itself. My company runs what I would consider a healthy sized set of SQL clusters. And we need every single CPU core that’s in each host, and every GB of memory. Not only for the performance, but because SQL enterprise costs big $$$. To me, I would consider HCI to impose a tax of its own. I wish Microsoft, Oracle and other related vendors would let you exclude certain cores in a host from being licensed, but they don’t, and thus an indirect HCI tax. I am of course anticipating a rebuttal from Josh pointing out that improved storage latency and performance overall, will make up for this, but it won’t. SQL will perform better when IT has access to the memory for intelligent cache, and the more CPU cores SQL has access to, the more parallel queries it can execute. Ultimately, I look at the compute tax for storage performance unnecessary, when SAN’s can deliver excellent / consistent performance, as well as scale capacity.
There is another point related to performance. Your storage is now to a degree restricted by the configuration of your compute. If you are running lots of low clock rate cores, it’s unlikely to perform as well as a dedicated storage system, with perhaps a better balance of clock rate and cores. The storage system being converged with the compute, also now means, the storage is limited in the resources it can consume because it has a potential impact on the workload it’s trying to serve. Truly separating storage from compute, enabled both tiers to perform at their absolute best.
Josh may also rightly point out that having all that IO coming from singular storage systems ultimately becomes a bottleneck on its own, but I disagree to a point. First, let’s start by acknowledging that HCI can probably deliver better throughput overall, if there is storage in the host nodes. Essentially, in this case, Nutaix has the potential to deliver throughput that SAN / NAS can’t, or at least not without each host having massive uplinks. The reason I’m going to put that aside as a non-issue, is most SAN’s are FAST now a day, and 10G is on the map to be replaced by 25Gb or even 100Gb. It’s also worth noting that my experience has been that other than SQL, most of my VM’s don’t drive a ton of throughput. Sure, backup’s might, but I never see my SANs stressed to a point where I wish my storage was local. So, for me the case of the SAN bottleneck is a theoretical issue, not a reality. What is a reality though, is that I now reap the benefits of having fewer larger SANs delivering my data. Like virtualization which served the purpose of piling lots of underutilized servers on a single host to better utilize resource, I find that SAN solves a similar problem for storage. With SAN we get to pile 100’s of VM’s that might only require 50 IOPS on a busy day, on to a few larger shared systems. To me, this is a better TCO than having all my storage in a host or even in a scaled-out storage system. This is also completely ignoring the port costs associated with scale out storage designs.
My next big concern with Nutanix design is having truly discrete fault zones. The Nutanix software its self is ultimately a singular fault zone. All it takes is a bad bug. A condition that a developer didn’t consider having the potential to wreak havoc on your entire cluster. TMK, Nutanix clusters have to own all the storage they access. It’s not that they don’t have fault zones. In fact, I love the concept of per host / rack fault zones. But a Nutanix cluster itself is still a singular fault zone. When I utilize SAN + compute. I now have a true fault zone per SAN. More so, each compute cluster can have access to each SAN, thus allowing me to have multiple SAN fault zones AND multiple compute fault zones. Nutanix is effectively forced into each cluster being a singular fault zone. While this is probably fine for really large customers, this is cost inefficient for those of us with much less resources, but still concerned about resiliency.
Now Josh may point out that it’s unlikely that a SW bug will cause a massive outage, and I would suspect based on statistics, he’s probably right. Still, we’ve all heard about the HPe 3Par issue in Australia, I also seem to recall an EMC Symetrics issue in Virginia, and I too have experienced software issues in Equallogic and DotHill that have taken down storage. My point is if you are truly paranoid about resiliency and fault zone, the best way to mitigate that is by separating as many dependencies as possible. HCI is not immune to SW bugs, in fact, one could argue there’s a higher likely hood given the increased integration.
There are other concerns I have with HCI, but these are the two I wanted to point out. Knowing the rebuttals, I’ve seen from Josh in the past, I suspect he’ll tear me to shreds, but I’m cool with that. I have no problem being wrong. Ultimately, I love technology, and love chatting about it. We’re here to solve business challenges, so if Josh can sway me, I’m all for it.