ResearchHub | Open Science Community

Accelerating genomic workflows using NVIDIA Parabricks

Kyle O’Connell et al.Jul 21, 2022

ABSTRACT Background As genome sequencing becomes a more integral part of scientific research, government policy, and personalized medicine, the primary challenge for researchers is shifting from generating raw data to analyzing these vast datasets. Although much work has been done to reduce compute times using various configurations of traditional CPU computing infrastructures, Graphics Processing Units (GPUs) offer the opportunity to accelerate genomic workflows by several orders of magnitude. Here we benchmark one GPU-accelerated software suite called NVIDIA Parabricks on Amazon Web Services (AWS), Google Cloud Platform (GCP), and an NVIDIA DGX cluster. We benchmarked six variant calling pipelines, including two germline callers (HaplotypeCaller and DeepVariant) and four somatic callers (Mutect2, Muse, LoFreq, SomaticSniper). Results For germline callers, we achieved up to 65x acceleration, bringing HaplotypeCaller runtime down from 36 hours to 33 minutes on AWS, 35 minutes on GCP, and 24 minutes on the NVIDIA DGX. Somatic callers exhibited more variation between the number of GPUs and computing platforms. On cloud platforms, GPU-accelerated germline callers resulted in cost savings compared with CPU runs, whereas somatic callers were often more expensive than CPU runs because their GPU acceleration was not sufficient to overcome the increased GPU cost. Conclusions Germline variant callers scaled with the number of GPUs across platforms, whereas somatic variant callers exhibited more variation in the number of GPUs with the fastest runtimes, suggesting that these workflows are less GPU optimized and require benchmarking on the platform of choice before being deployed at production scales. Our study demonstrates that GPUs can be used to greatly accelerate genomic workflows, thus bringing closer to grasp urgent societal advances in the areas of biosurveillance and personalized medicine.

FAIRshake: toolkit to evaluate the findability, accessibility, interoperability, and reusability of research digital resources

Daniel Clarke et al.Jun 3, 2019

As more datasets, tools, workflows, APIs, and other digital resources are produced by the research community, it is becoming increasingly difficult to harmonize and organize these efforts for maximal synergistic integrated utilization. The Findable, Accessible, Interoperable, and Reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge by making these digital resources follow common standards and best practices so that they can become more integrated and organized. Faced with the question of how to make digital resources more FAIR, it has become imperative to measure what it means to be FAIR. The diversity of resources, communities, and stakeholders have different goals and use cases and this makes assessment of FAIRness particularly challenging. To begin resolving this challenge, the FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual, semi- and fully-automated FAIR assessment capabilities. The FAIRshake toolkit contains a database that lists registered digital resources, with their associated metrics, rubrics, and assessments. The FAIRshake toolkit also has a browser extension and a bookmarklet that enables viewing and submitting assessments from any website. The FAIR assessment results are visualized as an insignia that can be viewed on the FAIRshake website, or embedded within hosting websites. Using FAIRshake, a variety of bioinformatics tools, datasets listed on dbGaP, APIs registered in SmartAPI, workflows in Dockstore, and other biomedical digital resources were manually and automatically assessed for FAIRness. In each case, the assessments revealed room for improvement, which prompted enhancements that significantly upgraded FAIRness scores of several digital resources.