ResearchHub | Open Science Community

Genome sequence-based species delimitation with confidence intervals and improved distance functions

Jan Meier‐Kolthoff et al.Feb 21, 2013

Abstract Background For the last 25 years species delimitation in prokaryotes ( Archaea and Bacteria ) was to a large extent based on DNA-DNA hybridization (DDH), a tedious lab procedure designed in the early 1970s that served its purpose astonishingly well in the absence of deciphered genome sequences. With the rapid progress in genome sequencing time has come to directly use the now available and easy to generate genome sequences for delimitation of species. (Genome Blast Distance Phylogeny) infers genome-to-genome distances between pairs of entirely or partially sequenced genomes, a digital, highly reliable estimator for the relatedness of genomes. Its application as an in-silico replacement for DDH was recently introduced. The main challenge in the implementation of such an application is to produce digital DDH values that must mimic the wet-lab DDH values as close as possible to ensure consistency in the Prokaryotic species concept. Results Correlation and regression analyses were used to determine the best-performing methods and the most influential parameters. was further enriched with a set of new features such as confidence intervals for intergenomic distances obtained via resampling or via the statistical models for DDH prediction and an additional family of distance functions. As in previous analyses, obtained the highest agreement with wet-lab DDH among all tested methods, but improved models led to a further increase in the accuracy of DDH prediction. Confidence intervals yielded stable results when inferred from the statistical models, whereas those obtained via resampling showed marked differences between the underlying distance functions. Conclusions Despite the high accuracy of -based DDH prediction, inferences from limited empirical data are always associated with a certain degree of uncertainty. It is thus crucial to enrich in-silico DDH replacements with confidence-interval estimation, enabling the user to statistically evaluate the outcomes. Such methodological advancements, easily accessible through the web service at http://ggdc.dsmz.de , are crucial steps towards a consistent and truly genome sequence-based classification of microorganisms.

Genetics

Artificial Intelligence

0

Paper

Save

Genome-Based Taxonomic Classification of the Phylum Actinobacteria

Imen Nouioui et al.Aug 22, 2018

The application of phylogenetic taxonomic procedures led to improvements in the classification of bacteria assigned to the phylum Actinobacteria but even so there remains a need to further clarify relationships within a taxon that encompasses organisms of agricultural, biotechnological, clinical and ecological importance. Classification of the morphologically diverse bacteria belonging to this large phylum based on a limited number of features has proved to be difficult, not least when taxonomic decisions rested heavily on interpretation of poorly resolved 16S rRNA gene trees. Here, draft genome sequences of a large collection of actinobacterial type strains were used to infer phylogenetic trees from genome-scale data using the principles drawn from phylogenetic systematics. The majority of taxa were found to be monophyletic but several orders, families and genera, as well as many species and a few subspecies were shown to be in need of revision leading to proposals for the recognition of 2 orders, 10 families and 17 genera, as well as the transfer of over 100 species to other genera. In addition, emended descriptions are given for many species mainly involving the addition of data on genome size and DNA G+C content, the former can be considered to be a valuable taxonomic marker in actinobacterial systematics. Many of the incongruities detected when the results of the present study were compared with existing classifications had been recognised from 16S rRNA gene trees though whole-genome phylogenies proved to be much better resolved. The few significant incongruities found between 16S/23S rRNA and whole genome trees underline the pitfalls inherent in phylogenies based upon single gene sequences. Similarly good congruence was found between the discontinuous distribution of phenotypic properties and taxa delineated in the phylogenetic trees though diverse non-monophyletic taxa appeared to be based on the use of plesiomorphic character states as diagnostic features.

Genetics

Ecology

0

Paper

Save

TYGS is an automated high-throughput platform for state-of-the-art genome-based taxonomy

Jan Meier‐Kolthoff et al.May 16, 2019

Microbial taxonomy is increasingly influenced by genome-based computational methods. Yet such analyses can be complex and require expert knowledge. Here we introduce TYGS, the Type (Strain) Genome Server, a user-friendly high-throughput web server for genome-based prokaryote taxonomy, connected to a large, continuously growing database of genomic, taxonomic and nomenclatural information. It infers genome-scale phylogenies and state-of-the-art estimates for species and subspecies boundaries from user-defined and automatically determined closest type genome sequences. TYGS also provides comprehensive access to nomenclature, synonymy and associated taxonomic literature. Clinically important examples demonstrate how TYGS can yield new insights into microbial classification, such as evidence for a species-level separation of previously proposed subspecies of Salmonella enterica. TYGS is an integrated approach for the classification of microbes that unlocks novel scientific approaches to microbiologists worldwide and is particularly helpful for the rapidly expanding field of genome-based taxonomic descriptions of new genera, species or subspecies.

Genetics

Ecology

0

Paper

Save

Digital DNA-DNA hybridization for microbial species delineation by means of genome-to-genome sequence comparison

Alexander Auch et al.Jan 28, 2010

The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.

Genetics

Ecology

0

Paper

Save

Analysis of 1,000 Type-Strain Genomes Improves Taxonomic Classification of Bacteroidetes

Marina García-López et al.Sep 23, 2019

Although considerable progress has been made in recent years regarding the classification of bacteria assigned to the phylum Bacteroidetes, there remains a need to further clarify taxonomic relationships within a diverse assemblage that includes organisms of clinical, piscicultural, and ecological importance. Bacteroidetes classification has proved to be difficult, not least when taxonomic decisions rested heavily on interpretation of poorly resolved 16S rRNA gene trees and a limited number of phenotypic features. Here, draft genome sequences of a greatly enlarged collection of genomes of more than 1000 Bacteroidetes and outgroup type strains were used to infer phylogenetic trees from genome-scale data using the principles drawn from phylogenetic systematics. The majority of taxa were found to be monophyletic but several orders, families and genera, including taxa proposed long ago such as Bacteroides, Cytophaga and Flavobacterium but also quite recent taxa, as well as a few species were shown to be in need of revision. According proposals are made for the recognition of new orders, families and genera, as well as the transfer of a variety of species to other genera. In addition, emended descriptions are given for many species mainly involving information on DNA G+C content and (approximate) genome size, both of which can be considered valuable taxonomic markers. We detected many incongruities when comparing the results of the present study with existing classifications, which appear to be caused by insufficiently resolved 16S rRNA gene trees or incomplete taxon sampling. The few significant incongruities found between 16S rRNA gene and whole genome trees underline the pitfalls inherent in phylogenies based upon single gene sequences and the impediment in using ordinary bootstrapping in phylogenomic studies, particularly when combined with too narrow gene selections. While a significant degree of phylogenetic conservation was detected in all phenotypic characters investigated, the overall fit to the tree varied considerably, which is one of the probable causes of misclassifications in the past, much like the use of plesiomorphic character states as diagnostic features.

Genetics

Ecology

0

Paper

Save

List of Prokaryotic names with Standing in Nomenclature (LPSN) moves to the DSMZ

Aidan Parte et al.Jul 23, 2020

The List of Prokaryotic names with Standing in Nomenclature (LPSN) was acquired in November 2019 by the DSMZ and was relaunched using an entirely new production system in February 2020. This article describes in detail the structure of the new site, navigation, page layout, search facilities and new features.

Ecology

Molecular Biology

0

Paper

Save

TYGS and LPSN: a database tandem for fast and reliable genome-based classification and nomenclature of prokaryotes

Jan Meier‐Kolthoff et al.Sep 22, 2021

Microbial systematics is heavily influenced by genome-based methods and challenged by an ever increasing number of taxon names and associated sequences in public data repositories. This poses a challenge for database systems, particularly since it is obviously advantageous if such data are based on a globally recognized approach to manage names, such as the International Code of Nomenclature of Prokaryotes. The amount of data can only be handled if accurate and reliable high-throughput platforms are available that are able to both comply with this demand and to keep track of all changes in an efficient and flexible way. The List of Prokaryotic names with Standing in Nomenclature (LPSN) is an expert-curated authoritative resource for prokaryotic nomenclature and is available at https://lpsn.dsmz.de. The Type (Strain) Genome Server (TYGS) is a high-throughput platform for accurate genome-based taxonomy and is available at https://tygs.dsmz.de. We here present important updates of these two previously introduced, heavily interconnected platforms for taxonomic nomenclature and classification, including new high-level facilities providing access to bioinformatic algorithms, a considerable expansion of the database content, and new ways to easily access the data.

Genetics

Ecology

0

Paper

Save

Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software

Alexander Sczyrba et al.Oct 2, 2017

The Critical Assessment of Metagenome Interpretation (CAMI) community initiative presents results from its first challenge, a rigorous benchmarking of software for metagenome assembly, binning and taxonomic profiling. Methods for assembly, taxonomic profiling and binning are key to interpreting metagenome data, but a lack of consensus about benchmarking complicates performance assessment. The Critical Assessment of Metagenome Interpretation (CAMI) challenge has engaged the global developer community to benchmark their programs on highly complex and realistic data sets, generated from ∼700 newly sequenced microorganisms and ∼600 novel viruses and plasmids and representing common experimental setups. Assembly and genome binning programs performed well for species represented by individual genomes but were substantially affected by the presence of related strains. Taxonomic profiling and binning programs were proficient at high taxonomic ranks, with a notable performance decrease below family level. Parameter settings markedly affected performance, underscoring their importance for program reproducibility. The CAMI results highlight current challenges but also provide a roadmap for software selection to answer specific research questions.

Genetics

Ecology

0

Paper

Save

Toward a Novel Multilocus Phylogenetic Taxonomy for the Dermatophytes

Sybren Hoog et al.Oct 25, 2016

Type and reference strains of members of the onygenalean family Arthrodermataceae have been sequenced for rDNA ITS and partial LSU, the ribosomal 60S protein, and fragments of β-tubulin and translation elongation factor 3. The resulting phylogenetic trees showed a large degree of correspondence, and topologies matched those of earlier published phylogenies demonstrating that the phylogenetic representation of dermatophytes and dermatophyte-like fungi has reached an acceptable level of stability. All trees showed Trichophyton to be polyphyletic. In the present paper, Trichophyton is restricted to mainly the derived clade, resulting in classification of nearly all anthropophilic dermatophytes in Trichophyton and Epidermophyton, along with some zoophilic species that regularly infect humans. Microsporum is restricted to some species around M. canis, while the geophilic species and zoophilic species that are more remote from the human sphere are divided over Arthroderma, Lophophyton and Nannizzia. A new genus Guarromyces is proposed for Keratinomyces ceretanicus. Thirteen new combinations are proposed; in an overview of all described species it is noted that the largest number of novelties was introduced during the decades 1920-1940, when morphological characters were used in addition to clinical features. Species are neo- or epi-typified where necessary, which was the case in Arthroderma curreyi, Epidermophyton floccosum, Lophophyton gallinae, Trichophyton equinum, T. mentagrophytes, T. quinckeanum, T. schoenleinii, T. soudanense, and T. verrucosum. In the newly proposed taxonomy, Trichophyton contains 16 species, Epidermophyton one species, Nannizzia 9 species, Microsporum 3 species, Lophophyton 1 species, Arthroderma 21 species and Ctenomyces 1 species, but more detailed studies remain needed to establish species borderlines. Each species now has a single valid name. Two new genera are introduced: Guarromyces and Paraphyton. The number of genera has increased, but species that are relevant to routine diagnostics now belong to smaller groups, which enhances their identification.

Genetics

Epidemiology

0

Paper

Save

Taxonomic use of DNA G+C content and DNA–DNA hybridization in the genomic age

Jan Meier‐Kolthoff et al.Feb 1, 2014

The G+C content of a genome is frequently used in taxonomic descriptions of species and genera. In the past it has been determined using conventional, indirect methods, but it is nowadays reasonable to calculate the DNA G+C content directly from the increasingly available and affordable genome sequences. The expected increase in accuracy, however, might alter the way in which the G+C content is used for drawing taxonomic conclusions. We here re-estimate the literature assumption that the G+C content can vary up to 3-5 % within species using genomic datasets. The resulting G+C content differences are compared with DNA-DNA hybridization (DDH) similarities calculated in silico using the GGDC web server, with 70% similarity as the gold standard threshold for species boundaries. The results indicate that the G+C content, if computed from genome sequences, varies no more than 1% within species. Statistical models based on larger differences alone can reject the hypothesis that two strains belong to the same species. Because DDH similarities between two non-type strains occur in the genomic datasets, we also examine to what extent and under which conditions such a similarity could be <70% even though the similarity of either strain to a type strain was ≥ 70%. In theory, their similarity could be as low as 50%, whereas empirical data suggest a boundary closer (but not identical) to 70%. However, it is shown that using a 50% boundary would not affect the conclusions regarding the DNA G+C content. Hence, we suggest that discrepancies between G+C content data provided in species descriptions on the one hand and those recalculated after genome sequencing on the other hand ≥ 1% are due to significant inaccuracies of the applied conventional methods and accordingly call for emendations of species descriptions.

Genetics

Artificial Intelligence

0

Paper

Genetics

541

0

Save