ResearchHub | Open Science Community

A general species delimitation method with applications to phylogenetic placements

Jiajie Zhang et al.Aug 29, 2013

Abstract Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.exelixis-lab.org/software.html. Contact: Alexandros.Stamatakis@h-its.org Supplementary information: Supplementary data are available at Bioinformatics online.

Genetics

Artificial Intelligence

0

Paper

Save

Multi-rate Poisson tree processes for single-locus species delimitation under maximum likelihood and Markov chain Monte Carlo

Paschalia Kapli et al.Jan 20, 2017

In recent years, molecular species delimitation has become a routine approach for quantifying and classifying biodiversity. Barcoding methods are of particular importance in large-scale surveys as they promote fast species discovery and biodiversity estimates. Among those, distance-based methods are the most common choice as they scale well with large datasets; however, they are sensitive to similarity threshold parameters and they ignore evolutionary relationships. The recently introduced "Poisson Tree Processes" (PTP) method is a phylogeny-aware approach that does not rely on such thresholds. Yet, two weaknesses of PTP impact its accuracy and practicality when applied to large datasets; it does not account for divergent intraspecific variation and is slow for a large number of sequences.We introduce the multi-rate PTP (mPTP), an improved method that alleviates the theoretical and technical shortcomings of PTP. It incorporates different levels of intraspecific genetic diversity deriving from differences in either the evolutionary history or sampling of each species. Results on empirical data suggest that mPTP is superior to PTP and popular distance-based methods as it, consistently yields more accurate delimitations with respect to the taxonomy (i.e., identifies more taxonomic species, infers species numbers closer to the taxonomy). Moreover, mPTP does not require any similarity threshold as input. The novel dynamic programming algorithm attains a speedup of at least five orders of magnitude compared to PTP, allowing it to delimit species in large (meta-) barcoding data. In addition, Markov Chain Monte Carlo sampling provides a comprehensive evaluation of the inferred delimitation in just a few seconds for millions of steps, independently of tree size.mPTP is implemented in C and is available for download at http://github.com/Pas-Kapli/mptp under the GNU Affero 3 license. A web-service is available at http://mptp.h-its.org .: paschalia.kapli@h-its.org or alexandros.stamatakis@h-its.org or tomas.flouri@h-its.org.Supplementary data are available at Bioinformatics online.

Genetics

Ecology

1

Paper

Save

SweeD: Likelihood-Based Detection of Selective Sweeps in Thousands of Genomes

Pavlos Pavlidis et al.Jun 18, 2013

The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.

Genetics

Molecular Biology

0

Paper

Save

A Critical Assessment of Storytelling: Gene Ontology Categories and the Importance of Validating Genomic Scans

Pavlos Pavlidis et al.May 23, 2012

In the age of whole-genome population genetics, so-called genomic scan studies often conclude with a long list of putatively selected loci. These lists are then further scrutinized to annotate these regions by gene function, corresponding biological processes, expression levels, or gene networks. Such annotations are often used to assess and/or verify the validity of the genome scan and the statistical methods that have been used to perform the analyses. Furthermore, these results are frequently considered to validate "true-positives" if the identified regions make biological sense a posteriori. Here, we show that this approach can be potentially misleading. By simulating neutral evolutionary histories, we demonstrate that it is possible not only to obtain an extremely high false-positive rate but also to make biological sense out of the false-positives and construct a sensible biological narrative. Results are compared with a recent polymorphism data set from Drosophila melanogaster.

Genetics

Artificial Intelligence

0

Paper

Save

Phylogenetic analysis of SARS-CoV-2 data is difficult

Benoît Morel et al.Aug 6, 2020

Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org . Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising all virus sequences available on May 5, 2020 from gisaid.org . We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be possible. Finally, an automatic classification of the current sequences into sub-classes based on statistical criteria is also not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.

Genetics

Artificial Intelligence

111

Paper

Save

Population genomics insights into the recent evolution of SARS-CoV-2

Maria Vasilarou et al.Apr 23, 2020

Abstract The current coronavirus disease 2019 (COVID-19) pandemic is caused by the SARS-CoV-2 virus and is still spreading rapidly worldwide. Full-genome-sequence computational analysis of the SARS-CoV-2 genome will allow us to understand the recent evolutionary events and adaptability mechanisms more accurately, as there is still neither effective therapeutic nor prophylactic strategy. In this study, we used population genetics analysis to infer the mutation rate and plausible recombination events that may have contributed to the evolution of the SARS-CoV-2 virus. Furthermore, we localized targets of recent and strong positive selection. The genomic regions that appear to be under positive selection are largely co-localized with regions in which recombination from non-human hosts appeared to have taken place in the past. Our results suggest that the pangolin coronavirus genome may have contributed to the SARS-CoV-2 genome by recombination with the bat coronavirus genome. However, we find evidence for additional recombination events that involve coronavirus genomes from other hosts, i.e., Hedgehog and Sparrow. Even though recombination events within human hosts cannot be directly assessed, due to the high similarity of SARS-CoV-2 genomes, we infer that recombinations may have recently occurred within human hosts using a linkage disequilibrium analysis. In addition, we employed an Approximate Bayesian Computation approach to estimate the parameters of a demographic scenario involving an exponential growth of the size of the SARS-CoV-2 populations that have infected European, Asian and Northern American cohorts, and we demonstrated that a rapid exponential growth in population size can support the observed polymorphism patterns in SARS-CoV-2 genomes.

Genetics

Demography

1

Paper

Save

Early split between African and European populations of Drosophila melanogaster

Adamandia Kapopoulou et al.Jun 6, 2018

Abstract Natural populations of the fruit fly Drosophila melanogaster have been used extensively as a model system to investigate the effect of neutral and selective processes on genetic variation. The species expanded outside its Afrotropical ancestral range during the last glacial period and numerous studies have focused on identifying molecular adaptations associated with the colonization of northern habitats. The sequencing of many genomes from African and non-African natural populations has facilitated the analysis of the interplay between adaptive and demographic processes. However, most of the non-African sequenced material has been sampled from American and Australian populations that have been introduced within the last hundred years following recent human dispersal and are also affected by recent genetic admixture with African populations. Northern European populations, at the contrary, are expected to be older and less affected by complex admixture patterns and are therefore more appropriate to investigate neutral and adaptive processes. Here we present a new dataset consisting of 14 fully sequenced haploid genomes sampled from a natural population in Umeå, Sweden. We co-analyzed this new data with an African population to compare the likelihood of several competing demographic scenarios for European and African populations. We show that allowing for gene flow between populations in neutral demographic models leads to a significantly better fit to the data and strongly affects estimates of the divergence time and of the size of the bottleneck in the European population. Our results indicate that the time of divergence between cosmopolitan and ancestral populations is 30,000 years older than reported by previous studies.

Genetics

Ecology

0

Paper

Save

Balancing selection on genomic deletion polymorphisms in humans

Alber Aqil et al.Apr 28, 2022

Abstract A key question in biology is why genomic variation persists in a population for extended periods. Recent studies have identified examples of genomic deletions that have remained polymorphic in the human lineage for hundreds of millennia, ostensibly owing to balancing selection. Nevertheless, genome-wide investigations of ancient and possibly adaptive deletions remain an imperative exercise. Here, we used simulations to show an excess of ancient allele sharing between modern and archaic human genomes that cannot be explained solely by introgression or ancient structure under neutrality. We identified 63 deletion polymorphisms that emerged before the divergence of humans and Neanderthals and are associated with GWAS traits. We used empirical and simulation-based analyses to show that the haplotypes that harbor these functional ancient deletions have likely been evolving under time- and geography-dependent balancing selection. Collectively, our results suggest that balancing selection may have maintained at least 27% of the functional deletion polymorphisms in humans for hundreds of thousands of years.

Genetics

Artificial Intelligence

18

Paper

Save

Amylase copy number analysis in several mammalian lineages reveals convergent adaptive bursts shaped by diet

Petar Pajic et al.Jun 5, 2018

The amylase gene (AMY), which codes for a starch-digesting enzyme in animals, underwent several gene copy number gains in humans, dogs, and mice, presumably along with increased starch consumption during the evolution of these species. Here we present evidence for additional AMY copy number expansions in several mammalian species, most of which also consume starch-rich diets. We also show that these independent AMY copy number gains are often accompanied by a gain in enzymatic activity of amylase in saliva. We used multi-species coalescent modeling to provide further evidence that these recurrent AMY gene copy number expansions were adaptive. Our findings underscore the overall importance of gene copy number amplification as a flexible and fast adaptive mechanism in evolution that can independently occur in different branches of the phylogeny.

Genetics

Biochemistry

0

Paper

Save

Read Length Dominates Phylogenetic Placement Accuracy of Ancient DNA Reads

Ben Bettisworth et al.Jun 29, 2024

One of the central problems facing researchers who analyze ancient DNA (aDNA) is identifying the species which corresponds to the recovered aDNA. Prior analysis of aDNA data normally uses sequence matching tools (such as BLAST) to identify reads obtained from aDNA. However, as the source of aDNA is often an previously unsampled taxon due to the taxon having gone extinct prior to the advent of modern sequencing technology, it is likely the case that there is no exact match in any database. As a consequence tools such as BLAST are of limited use in helping to place a read in a phylogenetic context, I.E. identifying the likely source of a read on a phylogenetic tree. Phylogenetic placement is a technique where a sequence or read is placed onto a specific branch phylogenetic tree. These tools offer a the potential for a much finer resolution when identifying reads. However, phylogenetic placement has primarily only been used to place reads obtained from extant sources. Phylogenetic placement's applicability to aDNA data is complicated by the characteristic pattern of degradation that aDNA undergoes. This characteristic damage is generally not accounted for by popular phylogenetic placement tools, and as a consequence some authors have cast doubt on the potential accuracy of such tools. To understand how the characteristic aDNA damage affects placement phylogenetic tools, implemented a statistical model of aDNA damage as a tool, which we call PyGargammel, that takes sequences applies damage characteristic of aDNA to them. We deploy PyGargammel, along with the existing phylogenetic placement assessment pipeline PEWO, to 7 empirical datasets. With this pipeline, we explore the parameter space of aDNA damage via a grid search in order to identify the factors of aDNA damage which are most impactful. We test 4 leading phlyogenetic placement tools: APPLES, EPA-NG, PPLACER, and RAPPAS. We find that the frequency of DNA backbone nicks (and consequently read length) is the primary driver of error for aDNA reads. Additionally, we find that other factors, such as the rate of A to G misincorporations, have a negligible effect on the overall accuracy of phylogenetic placement tools.

Genetics

Paleontology

0

Paper

Genetics

Paleontology

0

Save