Dell EMC Ready Solution for HPC Life Sciences: Tuxedo Pipeline with Cascade Lake CPU and Lustre/ME4 Refresh

Summary: This article covers the whitepaper titled "Dell EMC Ready Solution for HPC Life Sciences: Tuxedo Pipeline with Cascade Lake CPU and Lustre/ME4 Refresh".

This article applies to This article does not apply to

Instructions

Note: Article written by Kihoon Yoon of HPC and AI Innovation Lab in December 2019
New hardware and updated pipeline together increase the throughput 3 times more from the previous Ready Solution.

Overview
Gene expression analysis is as important as identifying Single Nucleotide Polymorphism (SNP), insertion/deletion (indel) or chromosomal restructuring. Eventually, the entire physiological and biochemical events depend on the final gene expression products, proteins. Although most mammals have an additional controlling layer before protein expression, knowing how many transcripts exist in a system helps to characterize the biochemical status of a cell. Ideally, a technology enables us to quantify the entire proteins in a cell that could excel in the progress of Life Science significantly; however, we are far from achieving it.
Here, in this blog, we test one popular RNA-Seq data analysis pipeline known as the Tuxedo pipeline (1). The Tuxedo pipeline suite offers a set of tools for analyzing a variety of RNA-Seq data, including short-read mapping, identification of splice junctions, transcript and isoform detection, differential expression, visualizations, and quality control metrics. The detailed steps in the pipeline are shown in Figure 1. This updated version of the Tuxedo pipeline includes Cuffquant step comparing to the old version tested in the previous blog (2).

Figure 1 Updated Tuxedo Pipeline with Cuffquant Step

The configurations of the test cluster are summarized in Table 1.

Table 1 Tested compute node configuration
Dell EMC PowerEdge C6420
CPU	2x Xeon® Gold 6248 20c 2.5GHz (Cascade Lake)
RAM	12x 16GB @2933 MT/s
OS	RHEL 7.6
Interconnect	Intel® Omni-Path
BIOS System Profile	Performance Optimized
Logical Processor	Disabled
Virtualization Technology	Disabled
tophat	2.1.1
bowtie2	2.2.5
R	3.6
bioconductor-cummerbund	2.26.0

The tested compute nodes were connected to Dell EMC Ready Solution for Lustre Storage via Intel^® Omni-Path (3). The summary configuration of the storage is listed in Table 2.

Table 2 Lustre Storage Solution hardware and software specifications
Dell EMC Ready Solution for Lustre Storage
Number of nodes	1x Dell EMC PowerEdge R640 as Integrated Manager for Lustre (IML) 2x Dell EMC PowerEdge R740 as Metadata Server (MDS) 2x Dell EMC PowerEdge R740 as Object Storage Server (OSS)
Processors	IML server: Dual Intel Xeon Gold 5118 @ 2.3 GHz MDS and OSS servers: Dual Intel Xeon Gold 6136 @ 3.00 GHz
Memory	IML server: 12 x 8 GB 2,666 MT/s DDR4 RDIMMs MDS and OSS servers: 24 x 16 GiB 2,666 MT/s DDR4 RDIMMs
External storage controllers	2 x Dell 12 Gb/s SAS HBAs (on each MDS) 4 x Dell 12 Gb/s SAS HBAs (on each OSS)
Object storage enclosures	4x ME4084 with a total of 336 x 8TB NL 7.2K rpm SAS HDDs
Metadata storage enclosure	1x ME4024 with 24x 960GB SAS SSDs. Supports up to 4.688B files/inodes
RAID controllers	Duplex RAID in the ME4084 and ME4024 enclosures
Operating system	CentOS 7.5 x86_64 Red Hat Enterprise Linux (RHEL) 7.5 x86_64
Kernel version	3.10.0-862.el7.x86_64
BIOS version	1.4.5
Intel Omni-Path IFS version	10.8.0.0
Lustre file system version	2.10.4
IML version	4.0.7.0

A performance study of RNA-Seq pipeline is not trivial because the nature workflow requires non-identical input files. 185 RNA-Seq paired-end read data are collected from a public data repository. All the read data files contain around 25 Million Fragments (MF) and have similar read lengths. The samples for a test randomly selected from the pool of 185 paired-end read files. Although these randomly selected data will not have any biological meaning, certainly these data with high level of noise will put the tests on the worst-case scenario.
Performance Evaluation
Two-Sample Test
In Figure 2, the runtime of each step is plotted. The test was run in two compute nodes with two samples containing approximately 25 million read RNA-Seq data. Tophat step starts for each sample on a compute node in parallel. Subsequently, Cufflinks begins upon the completion of Tophat. Cuffmerge step combines the results from the two Cufflinks runs. Cuffquant step is added to quantify gene expressions in each sample, and the results are examined further in Cuffdiff and Cuffnorm steps. Last step, CummeRbund is a statistical analysis step from CummeRbund R-package, and it generate a visualized report as shown in Figure 2.

Figure 2 Total runtime for Tuxedo pipeline with two samples: SRR1608490 and SRR934809. Figure 2 Total runtime for Tuxedo pipeline with two samples: SRR1608490 and SRR934809.

Figure 3 shows differentially expressed genes from 8 sample run (each sample consists of 4 duplicates) in red with significantly lower p-values (Y-axis) compared to other gene expressions illustrated in black 1. X-axis is fold changes in log base of 2, and these fold changes of each genes are plotted against p-values. More samples will bring a better gene expression estimation. The right upper plot are gene expressions in sample 2 in comparisons with sample 1 whereas the left lower plot are gene expressions in sample 1 compared to sample 2. Gene expressions in black dots are not significantly different in both samples.

Figure 3 Volcano plot of the Cuffdiff results
Throughput Test – Single pipeline with more than two samples, biological and technical duplicates
Typical RNA-Seq studies consist of multiple samples, sometimes 100s of different samples, normal versus disease or untreated versus treated samples. These samples tend to have high level of noise due to their biological reasons; hence, the analysis requires vigorous data preprocessing procedure.

We tested various numbers of samples (all different RNA-Seq data selected from 185 paired-end reads data set) to see how much data can be processed by 8 nodes in a PowerEdge C6420 cluster. As shown in Figure 4, the runtimes with 2, 4, 8, 16, 32 and 64 samples grow exponentially when the number of samples increases. The number of Billion Fragments/Day increased nearly three times with Cascade Lake 6248/LustreME4 storage and the updated pipeline.

Figure 4 Throughput comparisons with 8x C6420s between Cascade Lake 6248/LustreME4 and Skylake 6148/H600

Cuffmerge step does not slow down as the number of samples grows while Cuffdiff and Cuffnorm steps slow down significantly. Especially, Cuffdiff step becomes a bottleneck for the pipeline since the running time grows exponentially (Figure 5). Although Cuffnorm’s runtime increases exponentially like Cuffdiff, it is ignorable since Cuffnorm’s runtime is bounded by Cuffdiff’s runtime. Adding Cuffquant step improved the runtime of Cuffdiff significantly. 30 hours of runtime reduction on Cuffdiff step, and Cuffnorm completed 20 hours faster with Cuffquant step. Although the performance gain from Cuffnorm is not visible since Cuffdiff and Cuffnorm start at the same time.

Figure 5 Runtime increment on Cuffdiff and Cuffnorm
Figure 5 Runtime increment on Cuffdiff and Cuffnorm
Conclusion
The throughput test results show that 8 node PowerEdge C6420s with the Lustre storage can process roughly 2.7 Billion Fragments from 64 samples with ~50 million paired reads each (25 MF) through Tuxedo pipeline illustrated in Figure 1. Since Tuxedo pipeline is relatively faster than other popular pipelines, it is hard to generalize or utilize these results for sizing an HPC system accurately. However, the results can help to make a rough estimation on the size of HPC system.

Resources
1. RNA-Seq Differential Gene Expression: Basic Tutorial. [Online] https://melbournebioinformatics.github.io/MelBioInf_docs/tutorials/rna_seq_dge_basic/rna_seq_basic_tuxedo/.
2. RNA-Seq pipeline benchmark with Dell EMC Ready Bundle for HPC Life Sciences. [Online] https://downloads.dell.com/manuals/all-products/esuprt_software/esuprt_it_ops_datcentr_mgmt/high-computing-solution-resources_white-papers86_en-us.pdf.
3. Dell EMC Ready Solution for HPC Lustre Storage. [Link dead as of 07/2024]

Note: these are randomly selected from a pool of samples without any meaningful associations among them.

Affected Products

ME Series, OEMR ME40XX and ME4XX, Dell EMC PowerVault ME4012, Dell EMC PowerVault ME4024, Dell EMC PowerVault ME4084, Dell EMC PowerVault ME412 Expansion, Dell EMC PowerVault ME424 Expansion, Dell EMC PowerVault ME484

Dell EMC Ready Solution for HPC Life Sciences: Tuxedo Pipeline with Cascade Lake CPU and Lustre/ME4 Refresh

Summary: This article covers the whitepaper titled "Dell EMC Ready Solution for HPC Life Sciences: Tuxedo Pipeline with Cascade Lake CPU and Lustre/ME4 Refresh".

Instructions

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services

Welcome

Welcome to Dell

Dell EMC Ready Solution for HPC Life Sciences: Tuxedo Pipeline with Cascade Lake CPU and Lustre/ME4 Refresh

Summary: This article covers the whitepaper titled "Dell EMC Ready Solution for HPC Life Sciences: Tuxedo Pipeline with Cascade Lake CPU and Lustre/ME4 Refresh".

Detailed Article

Instructions

Affected Products

Instructions

Affected Products

Article Properties

Find answers to your questions from other Dell users

Support Services

Article Properties

Find answers to your questions from other Dell users

Support Services