Falcon Accelerated Genomics Pipeline with a single Intel FPGA Programmable Acceleration Card can process 50x whole human genomes in less than 3 hours through Alternative Variant Calling Pipeline.
Overview, Market Challenge (need), Falcon solution answers the need
Precision medicine, genomics, and epigenetics are using genomic sequencing to conduct research, improve diagnosis, develop pharmaceuticals, increase the quality of care for healthcare providers, and optimize crop production. For life sciences, genome analysis is now a key application, due in part to the large cost reduction of data collection from advances in next-generation sequencing (NGS). In addition to increased data collection, there has also been significant growth in the range of genomic applications used across universities, genomic research centers, pharmaceutical companies, and healthcare organizations.
Every seven months the amount of genome data is doubling (1). Consequently, data processing in an efficient and cost-effective manner has become critical. The computational power of processor-only solutions is not scaling fast enough to keep up with genomic data growth. This has led to the need for hardware acceleration. Accelerators such as FPGAs are becoming pivotal in matching the computational demands of this genomic data explosion. Compared to other hardware-accelerated solutions, the Falcon Accelerated Genomics Pipeline (FAGP) offers flexibility, high throughput, and a lower cost per sample.
What is FPGA, Intel PAC offering & Advantage
FPGAs are silicon devices that can be dynamically reprogrammed with a data path that exactly matches your workloads, such as Genomic Sequencing, Data Analytics, or Compression as illustrated in Figure 1. This versatility enables the provisioning of faster processing, more power-efficient computation, and lower latency service – lowering your total cost of ownership and maximizing compute capacity within the power, space, and cooling constraints of your data centers.
Traditionally, FPGAs require deep domain expertise to program. To simplify the development flow and enable rapid deployment across the data center, Intel offers an Acceleration Platform that includes PCI Express* (PCIe*) - based Intel FPGA Programmable Acceleration Cards (Intel FPGA PAC) and the Intel® Acceleration Stack for Intel Xeon® CPU with FPGAs. These Intel platforms are qualified, validated, and deployed through Dell EMC. Together with ecosystem partners like Falcon Computing, Intel Acceleration Platform offers a reliable and ready-to-go solution with transparent hardware under-the-hood.
Figure 1 Improved accuracy and speed on standard GATK pipeline
Genome Analysis Toolkit (GATK) is the gold standard for genomic data processing accepted by the genomics community (2). Its Best Practice Workflow (BPW) is well-known for its slowness in computation to generate results for large samples such as Whole-Genome (WGS). To address this issue, Falcon Computing Solutions has developed a flexible software package of tools that follows the BPW and can be easily implemented in multiple platforms and architectures. It is fast by several orders of magnitude when compared to CPU-based GATK pipelines.
Falcon Solution Details:
FAGP provides an end-to-end solution to cost-effectively analyze genomic data using the GATK pipeline with high performance, accuracy, and reproducibility. The solution delivers up to 15x speedup with the same accuracy as GATK (3). This means an analysis that typically takes 50 to 60 hours can be conducted in under 4 hours (3). FAGP provides exceptional levels of acceleration and accuracy in conjunction with high-performance, reliable Intel Arria 10 FPGAs and Intel® Xeon® processors.
FAGP follows GATK BPW. It implements acceleration in many components of the pipelines from alignment (BWA) to variant calling (HaplotypeCaller) (4). In addition to the accelerated BWA it also includes an accelerated version of the aligner Minimap2 that is part of the Alternate Genomic Pipeline from Falcon (5). The alternate pipeline provides an even faster solution. It can complete 50x Whole Genome Sequencing within 3 hours. Both aligners have the feature to generate marked duplicates and sorted reads without the need to use additional tools.
FAGP achieves high performance/throughput by accelerating intensive computation in GATK pipeline using Intel FPGA PAC platforms. This is different from scale-out solutions that achieve high throughput by adding more CPU resources. Such scale-out solutions have limited ability to reduce costs or per-sample latency.
Another advantage of Falcon solution is that it is an open pipeline as GATK. Users can control individual steps in the pipelines. Intermediate data are saved and can be accessed.
Table 1 Advantages of Falcon Accelerated Genomics Pipeline
|Falcon Accelerated Genomics Pipeline (FAGP) Advantages
||Support for multiple GATK versions, including 4.0
||Run five whole genomes or 24 whole exomes in one day
||< 3-hour turnaround time on-prem for WGS (50X)
||Execute GATK best practices pipeline up to >15x times faster
||No need to rewrite working algorithms
Dell Hardware Configuration
Table 2 Dell EMC PowerEdge R740xd as a testbed
|Dell EMC PowerEdge R740xd
||2x Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
||384GB @ 32x 16GB RDIMM, 2666MT/s, Dual Rank
||4x 1.2TB 10K RPM SAS 12Gbps 512n 2.5in Hot-plug Hard Drive in RAID 0 2x INTEL SSDPEDMD020T4 DC P3700 1.8T in software RAID 0
||Intel Programmable Acceleration Card with Intel Arria® 10 GX FPGA (Intel Acceleration Stack 1.1)
||Red Hat Enterprise Linux Server release 7.4 (Maipo) (3.10.0-693.el7.x86_64)
In our benchmark testing, we used whole human genome sequencing data at 10x, 30x, and 50x depth of coverage.
Table 3 Tested whole-genome sequencing data
Table 4 summarizes the time taken to complete the GATK 4.0 Best Practices Pipeline over three test cycles using FAGP and the Intel FPGA PAC housed in the DELL EMC PowerEdge R740xd server.
Table 4 Total runtimes from Best Practice Pipeline version 2.1.1
||Depth of Coverage
Table 5 summarizes the time (in minutes) taken to complete the alternative pipeline: Falcon Germline over three test cycles using FAGP and the Intel FPGA PAC housed in the DELL EMC PowerEdge R740xd server.
Table 5 Total runtimes from Alternative Variant Calling Pipeline
||Depth of Coverage
Summary of Falcon Genomic Solution
The Falcon Accelerated Genomics Pipeline offers high throughput, low cost/sample/day benefit. Together with the Intel FPGA Programmable Acceleration Card and certified DELL server, FAGP provides a complete solution that can be easily adopted for your genomic sequencing applications.
"At TCGB, we provide genome sequencing services to our nationwide clients. The Falcon Accelerated Genomics Pipeline* has enabled us to cut our turnaround from days into few hours while maintaining the accuracy of industry-standard GATK pipelines."
— Dr Xinmin Li, Director of Technology Center for Genomics & Bioinformatics (TCGB) UCLA
1. Sequencing the genome creates so much data we don’t know what to do with it. [Online] https://www.washingtonpost.com/news/speaking-of-science/wp/2015/07/07/sequencing-the-genome-creates-so-much-data-we-dont-know-what-to-do-with-it.
2. GATK. [Online] https://software.broadinstitute.org/gatk/
3. Accelerated Genomics. [Online] http://www.falconcomputing.com/falcon-accelerated-genomics-pipeline
4. BWA. [Online] http://bio-bwa.sourceforge.net/bwa.shtml
5. Minimap2. [Online] https://github.com/lh3/minimap2