Simplifying Big Data for Life Sciences Organizations

Isilon recently participated in a Bio-IT World Web Symposium with two life sciences organizations: BioTeam, a small, nine-person shop; and the Wellcome Sanger Trust Institute with over 700 employees. Both offered tips on how to best organize, store and share next-gen sequencing (NGS) data that continues to grow at a fast clip.

BioTeam’s Chris Dwan suggests that, if you understand your data flow, you might not need backups, and that alternative solutions can be implemented daily or as needed. He says it’s more cost effective to build your own compute farm, but to buy data storage. Additional “decent” (not “best”) practices from Dwan include:

  • Retain descriptive, structured filenames and paths for instrument data.
  • Insist on annual audits for data retention conducted by scientific staff.
  • Keep open lines of communication between IT and scientific staff.

Guy Coates of the Wellcome Sanger Trust Institute said that no matter the size of your organization it’s easy to be overtaken by information and data storage, whereby it’s critical to understand your needs. With storage and compute requirements doubling every 12 months, he says to be aggressive about throwing away data that is no longer needed. For example, the Institute only keeps BAM files–not raw images, SRF or FASTQ, which helps to create an automated analysis pipeline. Coates also suggests developing IT and sequencing budgets side-by-side to help ensure no surprises.

To watch the 90-minute presentation, download the Bio IT World Web Symposium on “Simplified Big Data for Life Sciences” online.

About the Author: Nick Kirsch