Some article numbers may have changed. If this isn't what you're looking for, try searching all articles. Search articles

Dell EMC Ready Solution for HPC Life Sciences: BWA-GATK Pipeline throughput tests with Cascade Lake CPU and Lustre ME4 Refresh

Summary: Dell EMC Ready Solution for HPC Life Sciences: BWA-GATK Pipeline throughput tests with Cascade Lake CPU and Lustre ME4 Refresh

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Check out other resources

Symptoms

64-compute node configuration of Dell EMC Ready Solutions for HPC Life Sciences can process 194 genomes per day (50x depth of coverage).

Overview

Variant calling is a process by which we identify variants from sequence data. This process helps determine if there are single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and or structural variants (SVs) at a given position in an individual genome or transcriptome. The main goal of identifying genomic variations is linking to human diseases. Although not all human diseases are associated with genetic variations, variant calling can provide a valuable guideline for geneticists working on a particular disease caused by genetic variations. BWA-GATK is one of the Next Generation Sequencing (NGS) computational tools that are designed to identify germline and somatic mutations from human NGS data. There are a handful of variant identification tools, and we understand that there is not a single tool that performs perfectly (1). However, we chose GATK which is one of most popular tools as our benchmarking tool to demonstrate how well the Dell EMC Ready Solutions for HPC Life Sciences can process complex and massive NGS workloads.
The purpose of this blog is to provide valuable performance information about the Intel® Xeon® Gold 6248 processor for BWA-GATK pipeline benchmark with Dell EMC Ready Solutions for HPC Lustre Storage (ME4 series refresh) (2). The Xeon® Gold 6248 CPU features 20 physical cores or 40 logical cores when using hyper threading. The test cluster configurations are summarized in Table 1.

Table 1 Tested compute node configuration

Dell EMC PowerEdge C6420
CPU	2x Xeon® Gold 6248 20 cores 2.5 GHz (Cascade Lake)
RAM	12x 16GB at 2933 MTps
OS	RHEL 7.6
Interconnect	Intel® Omni-Path
BIOS System Profile	Performance Optimized
Logical Processor	Disabled
Virtualization Technology	Disabled
BWA	0.7.15-r1140
Samtools	1.6
GATK	3.6-0-g89b7209

The tested compute nodes were connected to Dell EMC Ready Solutions for HPC Lustre Storage via Intel® Omni-Path. The summary configuration of the storage is listed in Table 2.
Table 2 Solution hardware and software specifications

Dell EMC Ready Solution for Lustre Storage
Number of nodes	1x Dell EMC PowerEdge R640 as Integrated Manager for Lustre (IML) 2x Dell EMC PowerEdge R740 as Metadata Server (MDS) 2x Dell EMC PowerEdge R740 as Object Storage Server (OSS)
Processors	IML server: Dual Intel Xeon Gold 5118 @ 2.3 GHz MDS and OSS servers: Dual Intel Xeon Gold 6136 @ 3.00 GHz
Memory	IML server: 12 x 8 GB 2,666 MT/s DDR4 RDIMMs MDS and OSS servers: 24 x 16 GiB 2,666 MT/s DDR4 RDIMMs
External storage controllers	2 x Dell 12 Gb/s SAS HBAs (on each MDS) 4 x Dell 12 Gb/s SAS HBAs (on each OSS)
Object storage enclosures	4x ME4084 with a total of 336 x 8TB NL 7.2K rpm SAS HDDs
Metadata storage enclosure	1x ME4024 with 24x 960GB SAS SSDs. Supports up to 4.68 B inodes
RAID controllers	Duplex SAS RAID controllers in the ME4084 and ME4024 enclosures
Operating system	CentOS 7.5 x86_64 Red Hat Enterprise Linux (RHEL) 7.5 x86_64
BIOS version	1.4.5
Intel Omni-Path IFS version	10.8.0.0
Lustre file system version	2.10.4
IML version	4.0.7.0

The test data was chosen from one of Illumina’s Platinum Genomes. ERR194161 was processed with Illumina HiSeq 2000 submitted by Illumina and can be obtained from EMBL-EBI. The DNA identifier for this individual is NA12878. The description of the data from the linked website shows that this sample has a >30x depth of coverage.

Performance Evaluation

Single Sample Multiple Nodes Performance

In Figure 1, the runtime in various number of samples and compute nodes with 50x Whole Genome Sequencing (WGS) data are summarized. The tests performed here are designed to demonstrate performance at the server level, not for comparisons on individual components. The data points in Figure 1 are calculated based on the total number of samples, one sample per compute node (X axis in the figure) that are processed concurrently. The details of BWA-GATK pipeline information can be obtained from the Broad Institute web site (3). The maximum number of compute nodes used for the tests are 64x C6420s. C6420s with Lustre ME4 show a better scaling behavior than Lustre MD3.

Figure 1 Performance comparisons between Lustre MD3 and Lustre ME4

Multiple Sample Multiple Nodes Performance

A typical way of running NGS pipeline is to run multiple samples on a compute node and use multiple compute nodes to maximize the throughput of NGS data process. The number of compute nodes used for the tests are 64 of C6420 compute nodes, and the number of samples per node is five samples. Up-to 320 samples are processed concurrently to estimate the maximum number of genomes per day without a job failure.
As shown in Figure 2, single C6420 compute node can process 3.24 of 50x whole human genomes per day when 5 samples are processed concurrently. For each sample, 7 cores and 30 GB memory are allocated.

Throughput Tests with up to 64 C6420s and the Lustre ME4

Figure 2 Throughput Tests with up-to 64 C6420s and the Lustre ME4

320 of 50x whole human genomes can be processed with 64 of C6420 compute nodes in 40 hours. In other words, the performance of the test configuration summarizes as 194 genomes per day for whole human genome with 50x depth of coverage.

Conclusion

As the data size of WGS has been growing constantly. The current average size of WGS is 50x. This is 5 times larger than a typical WGS 4 years ago when we started to benchmark BWA-GATK pipeline. The increasing data does not strain storage side capacity since most applications in the pipeline are also bounded by CPU clock speed. Hence, with growing data size, the pipeline runs longer rather than generating more writes.
However, there are a greater number of temporary files are generated during the process due to the more data needs to be parallelized, and this increased number of temporary files opened at the same time exhausts the open file limit in a Linux operating system. One of the applications silently fails to complete by hitting the limit of the number of open files. A simple solution is to increase the limit to >150K.
Nonetheless, the Ready Solution with Lustre ME4 as a scratch space has a better throughput capacity than the previous version. Now, 64 nodes Ready Solution marks 194 genomes per day processing power for 50x WGS.

Resources

1. A survey of tools for variant analysis of next-generation genome sequencing data. Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z. 2, s.l. : Brief Bioinform, 2014 Mar, Vol. 15 (2). 10.1093/bib/bbs086.
2. Dell EMC Ready Solution for HPC Lustre Storage. (Article no longer available for reference, pulled by HPC team)
3. Genome Analysis Toolkit. https://software.broadinstitute.org/gatk/

This hyperlink is taking you to a website outside of Dell Technologies.

Affected Products

ME Series, Dell EMC Ready Solution Resources, PowerEdge C6420, Dell EMC PowerVault ME4024, Dell EMC PowerVault ME4084, Red Hat Enterprise Linux Version 7

Article Number: 000176939

Article Type: Solution

Last Modified: 11 Jan 2024

Version: 6

Check if your device is covered by Support Services.

Dell EMC Ready Solution for HPC Life Sciences: BWA-GATK Pipeline throughput tests with Cascade Lake CPU and Lustre ME4 Refresh

Summary: Dell EMC Ready Solution for HPC Life Sciences: BWA-GATK Pipeline throughput tests with Cascade Lake CPU and Lustre ME4 Refresh

Symptoms

Performance Evaluation

Single Sample Multiple Nodes Performance

Multiple Sample Multiple Nodes Performance

Conclusion

Resources

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

Dell EMC Ready Solution for HPC Life Sciences: BWA-GATK Pipeline throughput tests with Cascade Lake CPU and Lustre ME4 Refresh

Summary: Dell EMC Ready Solution for HPC Life Sciences: BWA-GATK Pipeline throughput tests with Cascade Lake CPU and Lustre ME4 Refresh

Detailed Article

Symptoms

Affected Products

Symptoms

Performance Evaluation

Single Sample Multiple Nodes Performance

Multiple Sample Multiple Nodes Performance

Conclusion

Resources

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services