ResearchHub | Open Science Community

Challenges and recommendations to improve installability and archival stability of omics computational tools

Serghei Mangul et al.Oct 25, 2018

Abstract Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through URLs published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all due to problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.

Molecular Biology

Software

0

Paper

Save

Systematic evaluation of transcriptomics-based deconvolution methods and references using thousands of clinical samples

Brian Nadel et al.Mar 10, 2021

Abstract Estimating cell type composition of blood and tissue samples is a biological challenge relevant in both laboratory studies and clinical care. In recent years, a number of computational tools have been developed to estimate cell type abundance using gene expression data. While these tools use a variety of approaches, they all leverage expression profiles from purified cell types to evaluate the cell type composition within samples. In this study, we compare ten deconvolution tools and evaluate their performance while using each of eleven separate reference profiles. Specifically, we have run deconvolution tools on over 4,000 samples with known cell type proportions, spanning both immune and stromal cell types. Twelve of these represent in vitro synthetic mixtures and 300 represent in silico synthetic mixtures prepared using single cell data. A final 3,728 clinical samples have been collected from the Framingham Cohort, for which cell populations have been quantified using electrical impedance cell counting. When tools are applied to the Framingham dataset, the tool EPIC produces the highest correlation while GEDIT produces the lowest error. The best tool for other datasets is varied, but CIBERSORT and GEDIT most consistently produce accurate results. In terms of reference choice, we find that the Human Primary Cell Atlas (HPCA) and references published by the EPIC authors produce accurate results for the largest number of tools and datasets. When applying deconvolution to blood samples, the leukocyte reference matrix LM22 is also a suitable choice, usually (but not always) outperforming HPCA and EPIC. Running time varies substantially across tools. For as many as 5052 samples, SaVanT and dtangle reliably finish in under one minute, while slower tools may require up to two hours. However, when using custom references, CIBERSORT can run very slowly, taking over 24 hours to complete for large datasets. We conclude that combining the best tools with optimal reference datasets can provide significant gains in accuracy when carrying out deconvolution tasks.

Genetics

Artificial Intelligence

30

Paper

Save

Benchmarking of computational error-correction methods for next-generation sequencing data

Keith Mitchell et al.May 20, 2019

Background Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error-correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.Results In this paper, we evaluate the ability of error-correction algorithms to fix errors across different types of datasets that contain various levels of heterogeneity. We highlight the advantages and limitations of computational error correction techniques across different domains of biology, including immunogenomics and virology. To demonstrate the efficacy of our technique, we apply the UMI-based high-fidelity sequencing protocol to eliminate sequencing errors from both simulated data and the raw reads. We then perform a realistic evaluation of error correction methods.Conclusions In terms of accuracy, we find that method performance varies substantially across different types of datasets with no single method performing best on all types of examined data. Finally, we also identify the techniques that offer a good balance between precision and sensitivity

Genetics

Molecular Biology

0

Paper

Save

PUMA: A tool for processing 16S rRNA taxonomy data for analysis and visualization

Keith Mitchell et al.Nov 29, 2018

Microbial community profiling and functional inference via 16S rRNA analysis is quickly expanding across various areas of microbiology due to improvements to technology. There are numerous platforms for producing 16S rRNA taxonomic data which often vary in file and sequence formatting, creating a common barrier in microbiome studies. Additionally, many of the methods for analyzing and visualizing this sequencing data each require their own specific formatting. As a result, efficient and reproducible comparative analysis of taxonomic data and corresponding metadata in multiple programs remains a challenge in the investigation of microbial communities. PUMA, the Program for Unifying Microbiome Analysis, alleviates this problem in microbiome studies by allowing users to take advantage of numerous 16S rRNA taxonomic identification platforms and analysis tools in an efficient manner. PUMA accepts sequencing results from several taxonomic identification platforms and then automates configuration of data and file types for analysis and visualization via many popular tools. The protocol accomplishes this by producing a variety of properly configured, annotated, and altered files for both analysis and visualization of taxonomic community profiles and inferred functional profiles. PUMA provides an easy and flexible interface to accommodate for a variety of users to produce all files needed for all-inclusive analysis of targeted amplicon sequencing studies. PUMA is an unprecedented open-source solution for unifying multiple microbiome analysis softwares and uses an adaptable implementation with the potential to improve and consolidate the state of microbiome research.

Genetics

Ecology

0

Paper

Genetics

Ecology

0

Save