Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

Article Number: 000145351


Scalability of Dell EMC Ready Solution for HPC Lustre Storage on PowerVault ME4

Article Content


Symptoms

Article written by Jyothi Bhaskar of HPC and AI Innovation Lab in March 2019

Resolution

Introduction 


In our recent blog of DellEMC Ready Solution for HPC Lustre Storage, we introduced the refrence architecture of the solution using DellEMC PowerVault ME4084 storage arrays and presented performance characterization of the same. This original "base configuration" depicts how four PowerVault ME4084 arrays could be attached to a pair of DellEMC PowerEdge R740 servers used as an Object Storage Server pair. To cater to a wider range of performance and capacity requirements, we are now introducing flexibilty to this configuration.

This blog describes how to implement and scale the Lustre configuration in samller increments by using fewer than four fully populated ME4084 arrays.. These configurations can scale both in performance as well as capacity. For these flexible base configurations as well as scaled up configurations we have a single management server R640 used to run the Integrated Manager for Lustre web interface GUI , and a single Metadata Server ( MDS )  pair, a pair of PowerEdge R740s attached to a single ME4024 array.

The metadata and object storage components of the Ready Soultion for Lustre can scale independently. The medata component as introduced in the initial blog already includes flexibility options like having a half-populated array with 12 drives - with a single MDT or a fully populated ME4024 array with 24 drives- with two MDTs in DNE. This metadata portion of the stack remains the same for all the below flexible and as scaled up configurations described in this blog. 

Flexible Base Configurations 


The base configuration can now be consumed as a Small , Medium or Large model. The smallest of the three base configurations will have a single OSS pair , a pair of DellEMC PowerEdge R740s attached to a single fully-populated PowerVault ME4084 array shown in Figure 1 which is a Small Base Configuration.                   


SLN316413_en_US__1image(9067)
 

                              Figure 1:  Small Base Configuration  

    
The next size up is a Medium Base Configuration which has an OSS pair , a pair of R740s attached to two fully-populated ME4084 arrays as shown below in Figure 2.  
 

SLN316413_en_US__2image(9115)                     

Figure 2:  Medium Base Configuration   


Sizing up we next have the Large Base Configuration as shown in Figure 3, which has an OSS pair , a pair of R740s with four fully populated ME4084 arrays. The large base configuration is the largest of the base configurations considering the fact that the maximum number of ME4084 arrays an OSS pair with a server hardware configuration as described in Table 1 of blog [1] can accommodate is four , all the while keeping in mind load balancing and high availability.

                                   
SLN316413_en_US__3image(9117)

 Figure 3:  Large Base Configuration  


Scaling Guidance 


Once we reach the large base configuration,  futher scaling in capacity and performance, requires an additional OSS pair . This additional OSS pair can either have one, two or four ME4084 arrays as described in the  above flexible configurations. This pattern for sizing a configuration follows for any further scaling.  The base configurations as well examples of scaled up configurations using the base configurations are shown below in Figure 4.  In all of the base configurations and scaled configuration examples above, the ME4084 arrays are fully popluated with 7.2K RPM NL SAS HDDs. These HDDs have four size options 4TB,8TB, 10TB or12TB.  The metadata component for each configuration is a pair of R740s attached to a single ME4084 array with the two options as described in the introduction section and the initial blog


SLN316413_en_US__4image(9066) 

            Figure 4 :  Base configurations and Scaling examples with projected performance and usable space

 

Performance Estimate


The performance has been measured and evaluated for the Large Base Configuration in Figure 3 and is described in detail in the initial blog . For the rest of configurations the performance numbers shown in Figure 4 are estimates or extrapolations based on the fact that scaling up is linear with additon of ME4084 arrays, and scaling down by removing arrays is assumed to be linear as well.

The sustained performance is the steady state performance of the solution stack over a longer period of time or for more thread counts after the saturation has been attained. The more conservative approach would be to consider the sustained performance to size a system rather than occasional peaks. But the peak performance of the system as such shows the maximum extent to which the system could be pushed in terms of performance or in other words the point where the performance hits the bottleneck and cannot grow any further. All these numbers shown assume minimized caching effects since they are based off of actual measurements from the Large Base Configuration which were performed by minimizing caching more details to be found in the initial blog .

Lustre Usable Space Calculation

The most accurate way to obtain the usable space with Lustre would be a real time monitoring of Lustre stats . The file system itself reports a result state by being aware of the number of available blocks instead of a pre-calculated value. There is no simple or accurate way to manually calculate the Lustre file system usable space ahead of time.  One of the ways to estimate usable space ahead of time is by assuming a fixed overhead from ldiskfs on top of the RADI6 usable space calculation . On the live system validated in the initial blog, the overhead from Lustre was found to be slightly less than 1%. Of course this overhead would change with different HDD sizes. Although not completely accurate, the 1% overhead could be conservatively assumed to be fairly consistent for large arrays sizes ( considering multi-TB range as large ) with different HDD sizes. 

With these assumptions the formula to estimate usable space with Lustre and RAID 6 ( 8 + 2 ) volumes is shown below.  The chart shown in Figure 3 uses this formula to estimate the usable space of every configuration for each supported capacity of the 7.2K RPM NL SAS HDDs
               
    Estimate of Lustre Usable Space in TiB =  0.99 * Number of ME4084 arrays * 80 RAID HDDs per array * 0.8 * HDD size in TB * 10^12/2^40

The usable space is calculated in TiB since most tools including IML show usable capacity in powers of two units. 0.99 is the factor considering  the 1% overhead from the file system. 80 is the number of HDDs per ME4084 excluding the hot spares . 0.8 is 80% of the RAID 6 ( 8 + 2 ) HDDs being the data drives ( the remaining 20% in the RAID volume are parity drives and are not taken into consideration for usable space ).  The last factor in the fomula 10^12/2^40 is to convert the usable space from TB to TiB. 

References 

[1] Initial Blog describing DellEMC Ready Solution for HPC Lustre Storage with ME4 storage line : https://www.dell.com/support/article/us/en/19/sln314777/ 
[2] White paper describing DellEMC Ready Solution for HPC Lustre Storage with ME4 storage line ( both EDR and OPA ) : https://www.dellemc.com/resources/en-us/asset/white-papers/solutions/h17632_ready_hpc_lustre_wp.pdf

Article Properties


Affected Product

High Performance Computing Solution Resources

Last Published Date

10 Apr 2021

Version

3

Article Type

Solution