ResearchHub | Open Science Community

Critical Assessment of Metagenome Interpretation – a benchmark of computational metagenomics software

Alexander Sczyrba et al.Jan 9, 2017

Abstract In metagenome analysis, computational methods for assembly, taxonomic profiling and binning are key components facilitating downstream biological data interpretation. However, a lack of consensus about benchmarking datasets and evaluation metrics complicates proper performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on datasets of unprecedented complexity and realism. Benchmark metagenomes were generated from ~700 newly sequenced microorganisms and ~600 novel viruses and plasmids, including genomes with varying degrees of relatedness to each other and to publicly available ones and representing common experimental setups. Across all datasets, assembly and genome binning programs performed well for species represented by individual genomes, while performance was substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below the family level. Parameter settings substantially impacted performances, underscoring the importance of program reproducibility. While highlighting current challenges in computational metagenomics, the CAMI results provide a roadmap for software selection to answer specific research questions.

Genetics

Ecology

0

Paper

Save

Computational Pan-Genomics: Status, Promises and Challenges

Tobias Marschall et al.Mar 12, 2016

Abstract Many disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens , the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic datasets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics , a new sub-area of research in computational biology. In this paper, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies, and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains.

Genetics

Artificial Intelligence

0

Paper

Save

Critical Assessment of Metagenome Interpretation - the second round of challenges

Fernando Meyer et al.Jul 12, 2021

Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the community-driven initiative for the Critical Assessment of Metagenome Interpretation (CAMI). In its second challenge, CAMI engaged the community to assess their methods on realistic and complex metagenomic datasets with long and short reads, created from ∼1,700 novel and known microbial genomes, as well as ∼600 novel plasmids and viruses. Altogether 5,002 results by 76 program versions were analyzed, representing a 22x increase in results. Substantial improvements were seen in metagenome assembly, some due to using long-read data. The presence of related strains still was challenging for assembly and genome binning, as was assembly quality for the latter. Taxon profilers demonstrated a marked maturation, with taxon profilers and binners excelling at higher bacterial taxonomic ranks, but underperforming for viruses and archaea. Assessment of clinical pathogen detection techniques revealed a need to improve reproducibility. Analysis of program runtimes and memory usage identified highly efficient programs, including some top performers with other metrics. The CAMI II results identify current challenges, but also guide researchers in selecting methods for specific analyses.

Genetics

Ecology

82

Paper

Save

Clustering de Novo by Gene of Long Reads from Transcriptomics Data

Camille Marchet et al.Jul 30, 2017

Abstract Long-read sequencing currently provides sequences of several thousand base pairs. This allows to obtain complete transcripts, which offers an un-precedented vision of the cellular transcriptome. However the literature is lacking tools to cluster such data de novo , in particular for Oxford Nanopore Technologies reads, because of the inherent high error rate compared to short reads. Our goal is to process reads from whole transcriptome sequencing data accurately and without a reference genome in order to reliably group reads coming from the same gene. This de novo approach is therefore particularly suitable for non-model species, but can also serve as a useful pre-processing step to improve read mapping. Our contribution is both to propose a new algorithm adapted to clustering of reads by gene and a practical and free access tool that permits to scale the complete processing of eukaryotic transcriptomes. We sequenced a mouse RNA sample using the MinION device, this dataset is used to compare our solution to other algorithms used in the context of biological clustering. We demonstrate its is better-suited for transcriptomics long reads. When a reference is available thus mapping possible, we show that it stands as an alternative method that predicts complementary clusters.

Genetics

Artificial Intelligence

0

Paper

Save

kmtricks: Efficient and flexible construction of Bloom filters for large sequencing data collections

Téo Lemane et al.Feb 17, 2021

Abstract When indexing large collections of short-read sequencing data, a common operation that has now been implemented in several tools (Sequence Bloom Trees and variants, BIGSI, ..) is to construct a collection of Bloom filters, one per sample. Each Bloom filter is used to represent a set of k-mers which approximates the desired set of all the non-erroneous k-mers present in the sample. However, this approximation is imperfect, especially in the case of metagenomics data. Erroneous but abundant k-mers are wrongly included, and non-erroneous but low-abundant ones are wrongly discarded. We propose kmtricks , a novel approach for generating Bloom filters from terabase-sized collections of sequencing data. Our main contributions are 1/ an efficient method for jointly counting k-mers across multiple samples, including a streamlined Bloom filter construction by directly counting, partitioning and sorting hashes instead of k-mers, which is approximately four times faster than state-of-the-art tools; 2/ a novel technique that takes advantage of joint counting to preserve low-abundant k-mers present in several samples, improving the recovery of non-erroneous k-mers. Our experiments highlight that this technique preserves around 8x more k-mers than the usual yet crude filtering of low-abundance k-mers in a large metagenomics dataset. Availability https://github.com/tlemane/kmtricks Funding The work was funded by IPL Inria Neuromarkers, ANR Inception (ANR-16-CONV-0005), ANR Prairie (ANR-19-P3IA-0001), ANR SeqDigger (ANR-19-CE45-0008).

Genetics

Molecular Biology

62

Paper

Save

fimpera: drastic improvement of Approximate Membership Query data-structures with counts

Lucas Robidou et al.Jun 29, 2022

Abstract Motivation High throughput sequencing technologies generate massive amounts of biological sequence datasets as costs fall. One of the current algorithmic challenges for exploiting these data on a global scale consists in providing efficient query engines on these petabyte-scale datasets. Most methods indexing those datasets rely on indexing words of fixed length k , called k -mers. Many applications, such as metagenomics, require the abundance of indexed k -mers as well as their simple presence or absence, but no method scales up to petabyte-scaled datasets. This deficiency is primarily because storing abundance requires explicit storage of the k -mers in order to associate them with their counts. Using counting Approximate Membership Queries (cAMQ) data structures, such as counting Bloom filters, provides a way to index large amounts of k -mers with their abundance, but at the expense of a sensible false positive rate. Results We propose a novel algorithm, called fimpera , that enables the improvement of any cAMQ performance. Applied to counting Bloom filters, our proposed algorithm reduces the false positive rate by two orders of magnitude and it improves the precision of the reported abundances. Alternatively, fimpera allows for the reduction of the size of a counting Bloom filter by two orders of magnitude while maintaining the same precision. fimpera does not introduce any memory overhead and may even reduces the query time. Availability https://github.com/lrobidou/fimpera Supplementary information Supplementary data are available at Bioinformatics online.

Artificial Intelligence

Computer Networks And Communications

27

Paper

Artificial Intelligence

1

0

Save

0

metaVaR: introducing metavariant species models for reference-free metagenomic-based population genomics

Romuald Laso-Jadart et al.Jan 31, 2020

Abstract Motivation The availability of large metagenomic data offers great opportunities for the population genomic analysis of uncultured organisms, especially for small eukaryotes that represent an important part of the unexplored biosphere while playing a key ecological role. However, the majority of these species lacks reference genome or transcriptome which constitutes a technical barrier for classical population genomic analyses. Results We introduce the metavariant species (MVS) model, a representation of the species only by intra-species nucleotide polymorphism. We designed a method combining reference-free variant calling, multiple density-based clustering and maximum weighted independent set algorithms to cluster intra-species variant into MVS directly from multisample metagenomic raw reads without reference genome or reads assembly. The frequencies of the MVS variants are then used to compute population genomic statistics such as F ST in order to estimate genomic differentiation between populations and to identify loci under natural selection. The MVSs construction was tested on simulated and real metagenomic data. MVs showed the required quality for robust population genomics and allowed an accurate estimation of genomic differentiation (Δ F ST < 0.0001 and < 0.03 on simulated and real data respectively). Loci predicted under natural selection on real data were all found by MVSs. MVSs represent a new paradigm that may simplify and enhance holistic approaches for population genomics and evolution of microorganisms. Availability The method was implemented in a R package, metaVaR . https://github.com/madoui/MetaVaR Contact amadoui@genoscope.cns.fr

Genetics

Ecology

0

Paper

Save

ELECTOR: Evaluator for long reads correction methods

Camille Marchet et al.Jan 7, 2019

The error rates of third-generation sequencing data have been capped above 5%, mainly containing insertions and deletions. Thereby, an increasing number of diverse long reads correction methods have been proposed. The quality of the correction has huge impacts on downstream processes. Therefore, developing methods allowing to evaluate error correction tools with precise and reliable statistics is a crucial need. These evaluation methods rely on costly alignments to evaluate the quality of the corrected reads. Thus, key features must allow the fast comparison of different tools, and scale to the increasing length of the long reads. Our tool, ELECTOR, evaluates long reads correction and is directly compatible with a wide range of error correction tools. As it is based on multiple sequence alignment, we introduce a new algorithmic strategy for alignment segmentation, which enables us to scale to large instances using reasonable resources. To our knowledge, we provide the unique method that allows producing reproducible correction benchmarks on the latest ultra-long reads (longer than 100k bases). It is also faster than the current state-of-the-art on other datasets and provides a wider set of metrics to assess the read quality improvement after correction. ELECTOR is available on GitHub ( ) and Bioconda.

Philosophy

Artificial Intelligence

0

Paper

Philosophy

Artificial Intelligence

0

Save

0

A de novo approach to disentangle partner identity and function in holobiont systems

Arnaud Meng et al.Nov 17, 2017

Background: Study of meta-transcriptomic datasets involving non-model organisms represents bioinformatic challenges. The production of chimeric sequences and our inability to distinguish the taxonomic origins of the sequences produced are inherent and recurrent difficulties in de novo assembly analyses. The study of holobiont transcriptomes shares similarities with meta-transcriptomic, and hence, is also affected by challenges invoked above. Here we propose an innovative approach to tackle such difficulties which was applied to the study of marine holobiont models as a proof of concept. Results: We considered three holobionts models, of which two transcriptomes were previously assembled and published, and a yet unpublished transcriptome, to analyze their raw reads and assign them to the host and/or to the symbiont(s) using Short Read Connector, a k-mer based similarity method. We were able to define four distinct categories of reads for each holobiont transcriptome: host reads, symbiont reads, shared reads and unassigned reads. The result of the independent assemblies for each category within a transcriptome led to a significant diminution of de novo assembled chimeras compared to classical assembly methods. Combining independent functional and taxonomic annotations of each partner's transcriptome is particularly convenient to explore the functional diversity of an holobiont. Finally, our strategy allowed to propose new functional annotations for two well-studied holobionts and a first transcriptome from a planktonic Radiolaria-Dinophyta system forming widespread symbiotic association for which our knowledge is limited. Conclusions: In contrast to classical assembly approaches, our bioinformatic strategy not only allows biologists to studying separately host and symbiont data from a holobiont mixture, but also generates improved transcriptome assemblies. The use of Short Read Connector has proven to be an effective way to tackle meta-transcriptomic challenges to study holobiont systems composed of either well-studied or poorly characterized symbiotic lineages such as the newly sequenced marine plankton Radiolaria-Dinophyta symbiosis and ultimately expand our knowledge about these marine symbiotic associations.

Genetics

Ecology

0

Paper

Save

Investigating Population-scale Allele Specific Expression in Wild Populations of Oithona similis (Cyclopoida, Claus 1866)

Romuald Laso-Jadart et al.Apr 4, 2019

Abstract Allele-specific expression (ASE) is a widely studied molecular mechanism at cell, tissue and organism levels. Here, we extrapolated the concept of ASE to the population-scale (psASE), aggregating ASEs detected at smaller scales. We developed a novel approach to detect psASE based on metagenomic and metatranscriptomic data of environmental samples containing communities of organisms. This approach which measures the deviation between the frequency and the relative expression of biallelic loci, was applied on samples collected during the Tara Oceans expedition (2009-2013), in combination to new Oithona similis transcriptomes, a widespread marine copepod. Among a total of 25,768 single nucleotide variants (SNVs) of O. similis , 587 (2.3%) were targeted by psASE in at least one population. The distribution of SNVs targeted by psASE in different populations is significantly shaped by population genomic differentiation (p-value = 9.3×10 −9 ), supporting a partial genetic control of psASE. To investigate the link between evolution and psASE, loci under selection were compared to loci under psASE. A significant amount of SNVs (0.6%) were targeted by both selection and psASE (p-values < 9.89×10 −3 ), supporting the hypothesis that natural selection and ASE may lead to the same phenotype. Population-scale ASE offers new insights into the gene regulation control in populations and its link with natural selection.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save