Article Number: 124151


Dell EMC Ready Solutions for HPC Life Sciences: BWA-GATK Pipeline performance tests with BeeGFS

Article Content


Resolution

Overview

The purpose of this blog is to provide valuable performance information for BWA-GATK pipeline benchmark with Dell EMC Ready Solutions for HPC BeeGFS Storage. Unfortunately, we were not able to setup enough compute nodes and BeeGFS storage large enough to compare to the previous performance results published for a Lustre storage. However, the results will be helpful to estimate the amount of computational resource for a given variant calling workload.

 

The test cluster configurations are summarized in Table 1.
 

Table 1 Tested compute node configuration

Dell EMC PowerEdge C6420

CPU

2x Xeon® Gold 6248 20 cores 2.5 GHz (Cascade Lake)

RAM

12x 16GB at 2933 MTps

OS

Red Hat Enterprise Linux Server release 7.4 (Maipo)

Interconnect

Mellanox EDR InfiniBand

BIOS System Profile

Performance Optimized

Logical Processor

Disabled

Virtualization Technology

Disabled

BWA

0.7.15-r1140

Sambamba

0.7.0

Samtools

1.6

GATK

3.6-0-g89b7209

 

The tested compute nodes were connected to the BeeGFS storage via Mellanox EDR InfiniBand switches. The BeeGFS storage is connected to a bridge EDR switch, and this bridge is connected to additional EDR switch where all compute nodes are communicating. The summary configuration of the storage is listed in Table 2.

 

Table 2 BeeGFS solution hardware and software specifications

Component

Specification

Management server

1 x Dell EMC PowerEdge R640

MDS

2 x Dell EMC PowerEdge R740

Storage servers

2 x Dell EMC PowerEdge R740

Processors

Management server: Dual Intel Xeon Gold 5218

MDS and SS servers: Dual Intel Xeon Gold 6230

Memory

Management server: 12 x 8 GB 2666 MT/s DDR4 RDIMMs

MDS and SS servers: 12 x 32 GB 2933 MT/s DDR4 RDIMMs

Local disks and RAID controller

Management server: PERC H740P Integrated RAID, 8GB NV cache, 6x 300GB 15K SAS hard drives (HDDs) configured in RAID10

MDS and SS servers: PERC H330+ Integrated RAID, 2x 300GB 15K SAS HDDs configured in RAID1 for OS

InfiniBand HCA

Mellanox ConnectX-6 HDR100 InfiniBand adapter

External storage controllers

On each MDS: 2 x Dell 12 Gb/s SAS HBAs

On each SS: 4 x Dell 12 Gb/s SAS HBAs

Object storage enclosures

4 x Dell EMC PowerVault ME4084 fully populated with a total of 336 drives

Metadata storage enclosure

1 x Dell EMC PowerVault ME4024 with 24 SSDs

RAID controllers

Duplex RAID controllers in the ME4084 and ME4024 enclosures

HDDs

On each ME4084 Enclosure: 84 x 8 TB 3.5 in. 7.2 K RPM NL SAS3

ME4024 Enclosure: 24 x 960 GB SAS3 SSDs

Operating system

CentOS Linux release 8.1.1911 (Core)

Kernel version

4.18.0-147.5.1.el8_1.x86_64

Mellanox OFED version

4.7-3.2.9.0

BeeGFS file system version

7.2 (beta2)


 

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage, and it actually reaches to ~53x.

 

Performance Evaluation

Multiple Sample/Multiple Nodes Performance

A typical way of running NGS pipeline is to process multiple samples on a compute node and use multiple compute nodes to maximize the throughput. The number of compute nodes used for the tests was eight C6420 compute nodes, and the number of samples per node was seven samples. Hence, up-to 56 samples are processed concurrently to estimate the maximum number of genomes per day without a job failure.

As shown in Figure 1, single C6420 compute node can process 3.69 of 50x whole human genomes per day when 7 samples are processed together. For each sample, 5 cores and 20 GB memory are allocated.

SLN322212_en_US__1firgure1-BeeGFS

Figure 1 Throughput tests with up-to 8x C6420s with BeeGFS

56 of 50x whole human genomes can be processed with 8 of C6420 compute nodes in ~54 hours.  In other words, the performance of the test configuration summarizes as 25.11 genomes per day for whole human genome with 50x depth of coverage.

 

Conclusion

As the data size of WGS has been growing constantly. The current average size of WGS is about 55x. This is 5 times larger than a typical WGS 4 years ago when we started to benchmark BWA-GATK pipeline. The increasing data size does not strain storage side capacity since most applications in the pipeline are also bounded by CPU clock speed. Hence, the pipeline runs longer with larger data size rather than generating heavier IOs.

However, more temporary files are generated during the process due to the larger data needs to be parallelized, and this increased number of temporary files opened at the same time exhausts the open file limit in a Linux operating system. One of the applications silently fails to complete by hitting the limit of the number of open files. A simple solution is to increase the limit to >150K.

The results in Figure 1 shows that the throughput tests did not hit the maximum capacity of the system. Since there was not any sign of significant slowdown by adding more samples, it must be possible to process more than 7 samples if compute nodes are setup with larger memory. Overall, the BeeGFS storage is a suitable scratch storage for NGS data processing.

Article Properties


Last Published Date

23 Nov 2020

Version

2

Article Type

Solution

Rate This Article


Accurate
Useful
Easy to Understand
Was this article helpful?

0/3000 characters