XtremIO Demonstrates World-Class Field-Proven Reliability and Availability

[Editor’s Note – the author of this post, Ehud Rokach, is co-founder and General Manager of XtremIO]

A recent article from our friends at The Register has prompted some speculation relative to SSD reliability in XtremIO arrays.  In the absence of any real data, nor having the full context (see below), the article presented speculations that may lead to false conclusions.

To start, let me clear up any speculation.  I’m proud to confirm that since initiating Directed Availability of XtremIO (formally announced in March ’13) — during which time we have shipped hundreds of XtremIO X-Bricks to customers worldwide, XtremIO’s system quality level is exceeding our ambitious goals in all dimensions of Reliability and Availability including, specifically, SSD reliability.

There is no need to speculate.

In this post I’ll present actual data we continually collect from our installed base. We monitor every XtremIO all-flash array that is deployed in the field in order to offer top quality service to our customers. It also allows us to accurately track field reliability.

So, below is the full context. No speculation, only facts and numbers.

XtremIO delivers world class 99.9999% (Six Nines) field-proven availability.

During the past 9 months (since initiating Directed Availability), as we deployed hundreds of XtremIO X-Bricks at customer sites globally and across all major verticals, XtremIO Arrays demonstrated 99.9999% (Six Nines) in actual field-measured product availability. That’s less than 32 Seconds of unavailability in a year, and less than 3 minutes of unavailability over the lifetime of the product.

This is world-class Availability.

XtremIO’s outstanding field-measured SSD reliability

  • Our SSD Mean Time Between Part Replacement (MTBPR) was field-measured to be 922,240 hours, or 105 years.
  • Our Annual Replacement Rate (ARR) for SSDs was field-measured to be 0.009. For an entire X-Brick (holding 25 SSDs), the probability of encountering SSD failure at any time during a 1-year period equals (1-0.991^25), or 0.2.
  • A 0.2 ARR means that on average, based on our actual field data, you’ll need to replace a failed SSD (with XtremIO, SSD endurance limitations have ZERO impact to reliability as we further explain below) in an X-Brick roughly once every 5 years.

This is world-class SSD Reliability.

XtremIO is engineered so that SSDs will never get close to their endurance limits—even under the most write-intensive workloads, even after 5 years in production.

Every bit of the XtremIO architecture, algorithms, and software implementation is designed to minimize write amplification, minimize write operations to flash, and ensure optimal wear leveling. We completely avoid system-level garbage collection, deduplicate 100% inline, protect data with XDP – the most write-efficient data protection scheme, and perfectly wear-level through our content addressing engine.

In addition, XtremIO only uses enterprise-grade SSDs—built and tested for enterprise workloads, stress, and reliability.

These are non-trivial design choices. In fact, some of our competitors employ consumer-grade SSDs. Some of them present huge amounts of write-amplification (through garbage collection, post-processing, re-balancing, off-line deduplication, standard RAID). Such consumer-SSD based solutions and write-amplifying designs from certain vendors may prove to be a deadly ticking time-bomb inside of Data Centers. Beware.

Superb quality and the unique EMC advantage

Since EMC acquired XtremIO in 2012, we have been busy integrating and leveraging EMC’s Quality practices across a wide range of disciplines:

  • We work closely with EMC’s Global Hardware Engineering organization. We leverage EMC-qualified hardware, gaining immediate benefit from years of experience and massively deployed hardware platforms we share with other EMC product lines such as VMAX, VNX, and DataDomain.
  • We rely on EMC’s supply chain management and close relationship with the industry’s top quality suppliers.
  • We leverage EMC’s SSD qualification engineering which runs what’s rightfully considered to be the toughest, most rigorous SSD qualification process in the industry.
  • We build the XtremIO All-Flash Array in multiple EMC manufacturing facilities. The very same facilities that manufacture massive volume of high end Enterprise storage arrays every quarter. They know a thing or two about building top-quality Enterprise gear.

To close, our actual measured field performance demonstrates exceptional SSD and array-level reliability. Since initiating XtremIO’s Directed Availability program we have  seen a grand total of single-digit SSD failures out of thousands of deployed SSDs.  Chris’ Register story referenced (unknowingly) a couple of early DOA SSD failure events during Beta, prior to product being released for Directed Availability. The two failures during pre-release Beta were analyzed, and corrective action applied (firmware update). Not surprisingly, ever since we started Directed Availability and to this very day, we have seen no excess SSD failures of any kind (in the field or DOA).  This is indeed a non-issue.

Sorry, the only sensation I can report relative to XtremIO Reliability is us demonstrating second-to-none Six Nine’s in field-measured product availability.

I sincerely hope this clarifies any misconceptions and am more than happy to respond to any comments or questions.

About the Author: Dell Technologies