Next-generation genome sequencing technology has reached a point at which it is almost cost-effective to sequence all patients. Biologists, doctors, and researchers face an oncoming deluge of genomic data, whose processing requires new and scalable bioinformatics architectures and systems. Processing raw genetic sequence data is computationally expensive, and datasets are large. Current software systems can require many hours to process a single genome and generally run only on a single computer.
To address these challenges, we initially built Persona, a cluster-scale, high-throughput bioinformatics framework. Persona currently supports paired-read alignment, sorting, and duplicate marking using well-known algorithms and techniques. Persona can significantly reduce end-to-end processing times for bioinformatics computations. In a case study on sequence alignment, Persona sustains 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and can align a full genome in ~16.7 seconds using the SNAP algorithm. Our results demonstrate that: (1) alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads; (2) on a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster; (3) with AGD, a 7-node COTS network storage system can serve up to 60 alignment nodes; (4) server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.
Protein clustering is an essential technique in the field of proteomes; the analysis of the protein sequences contained in an organism’s genome. Clustering reduces the number of comparisons needed to find similar pairs in a set of elements, such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. ClusterMerge is a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach.
We apply ClusterMerge to the problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8% of similar pairs found by a full O(n2) comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604x on 768 cores (1400x faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90%, and a weak scaling efficiency of nearly 100%.
Papers
- Parallel and Scalable Precise Clustering
Stuart Byma, Akash Dhasade, Adrian Altenhoff, Christophe Dessimoz, and James Larus.
2020 International Conference on Parallel Architectures and Compilation Techniques (PACT ’20), Georgia, October 2020. - IMPACT: Interval-based Multi-pass Proteomic Alignment with Constant Traceback
Sahand Kashani, Stuart Byma, James Larus
2nd HPCA Workshop on Accelerator Architectures in Computational Biology and Bioinformatics, Washington DC, February 2019. - Persona: A High-Performance Bioinformatics Framework
Stuart Byma; Sam Whitlock; Laura Flueratoru; Ethan Tseng; Christos Kozyrakis; Edouard Bugnion; James Larus
USENIX Annual Technical Conference (ATC) 2017, Santa Clara, California, USA, July 12-14, 2017