ResearchHub | Open Science Community

WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads

Murray Patterson et al.Feb 6, 2015

The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the weighted minimum error correction problem in runtime linear in the number of SNPs. WhatsHap is a fixed parameter tractable (FPT) approach with coverage as the parameter. We demonstrate that WhatsHap can handle datasets of coverage up to 20×, and that 15× are generally enough for reliably phasing long reads, even at significantly elevated sequencing error rates. We also find that the switch and flip error rates of the haplotypes we output are favorable when comparing them with state-of-the-art statistical phasers.

Genetics

Molecular Biology

0

Paper

Save

Inferring Cancer Progression from Single-cell Sequencing while Allowing Mutation Losses

Simone Ciccolella et al.Feb 20, 2018

+4

M

S

Abstract Motivation In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions seen as an accumulation of mutations. However, recent studies (Kuipers et al. , 2017) leveraging Single-cell Sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. Still, established methods that can infer phylogenies with mutation losses are however lacking. Results We present the SASC (Simulated Annealing Single-Cell inference) tool which is a new and robust approach based on simulated annealing for the inference of cancer progression from SCS data. More precisely, we introduce a simple extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of back mutations in the evolutionary history of the tumor: the Dollo- k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real data sets and in comparison with some other available methods. Availability The Simulated Annealing Single-cell inference ( SASC ) tool is open source and available at https://github.com/sciccolella/sasc . Contact s.ciccolella@campus.unimib.it

Genetics

Artificial Intelligence

0

Paper

Save

Felidae call type and species identification based on acoustic features

Danushka Bandara et al.Apr 1, 2022

ABSTRACT The cat family Felidae is one of the most successful carnivore lineages today. However, the study of the evolution of acoustic communication between felids remains a challenge due to the lack of fossils, the limited availability of audio recordings because of their largely solitary and secretive behavior, and the underdevelopment of computational models and methods needed to address acoustic evolutionary questions. This study is a first attempt at developing a machine learning-based approach to the classification of felid calls as well as the identification of acoustic features that distinguish felid call types and species from one another. A felid call dataset was developed by extracting audio clips from diverse sources. The audio clips were manually annotated for call type and species. Due to the limited availability of samples, this study focused on the Pantherinae subfamily. Time-frequency features were then extracted from the Pantherinae dataset. Finally, several classification algorithms were applied to the resulting data. We achieved 91% accuracy for this Pantherinae call type classification. For the species classification, we obtained 86% accuracy. We also obtained the most predictive features for each of the classifications performed. These features can inform future research into the evolutionary acoustic analysis of the felid group.

Ecology

Biochemistry

3

Paper

Save

From Alpha to Zeta: Identifying variants and subtypes of SARS-CoV-2 via clustering

Andrew Melnyk et al.Aug 27, 2021

Abstract The availability of millions of SARS-CoV-2 sequences in public databases such as GISAID and EMBL-EBI (UK) allows a detailed study of the evolution, genomic diversity and dynamics of a virus like never before. Here we identify novel variants and sub-types of SARS-CoV-2 by clustering sequences in adapting methods originally designed for haplotyping intra-host viral populations. We asses our results using clustering entropy — the first time it has been used in this context. Our clustering approach reaches lower entropies compared to other methods, and we are able to boost this even further through gap filling and Monte Carlo based entropy minimization. Moreover, our method clearly identifies the well-known Alpha variant in the UK and GISAID datasets, but is also able to detect the much less represented (< 1% of the sequences) Beta (South Africa), Epsilon (California), Gamma and Zeta (Brazil) variants in the GISAID dataset. Finally, we show that each variant identified has high selective fitness, based on the growth rate of its cluster over time. This demonstrates that our clustering approach is a viable alternative for detecting even rare subtypes in very large datasets.

Genetics

Artificial Intelligence

1

Paper

Save

Molecular sequence classification using efficient kernel based embedding

Sarwan Ali et al.Jun 27, 2024

Artificial Intelligence

0

Paper

Save

gpps: An ILP-based approach for inferring cancer progression with mutation losses from single cell data

Simone Ciccolella et al.Jul 17, 2018

+4

M

S

Motivation: In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progression where mutations are accumulated through histories. However, some recent studies leveraging Single Cell Sequencing (SCS) techniques have shown evidence of mutation losses in several tumor samples [Kuipers et al., 2017], making the inference problem harder. Results: We present a new tool, gpps, that reconstructs a tumor phylogeny from single cell data, allowing each mutation to be lost at most a fixed number of times. Availability: The General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gppf.

Genetics

Philosophy

0

Paper

Save

Effective clustering for single cell sequencing cancer data

Simone Ciccolella et al.Mar 23, 2019

Background Single cell sequencing (SCS) technologies provide a level of resolution that makes it indispensable for inferring from a sequenced tumor, evolutionary trees or phylogenies representing an accumulation of cancerous mutations. A drawback of SCS is elevated false negative and missing value rates, resulting in a large space of possible solutions, which in turn makes infeasible using some approaches and tools. While this has not inhibited the development of methods for inferring phylogenies from SCS data, the continuing increase in size and resolution of these data begin to put a strain on such methods.One possible solution is to reduce the size of an SCS instance — usually represented as a matrix of presence, absence and missing values of the mutations found in the different sequenced cells — and infer the tree from this reduced-size instance. Previous approaches have used k -means to this end, clustering groups of mutations and/or cells, and using these means as the reduced instance. Such an approach typically uses the Euclidean distance for computing means. However, since the values in these matrices are of a categorical nature (having the three categories: present, absent and missing), we explore techniques for clustering categorical data — commonly used in data mining and machine learning — to SCS data, with this goal in mind.Results In this work, we present a new clustering procedure aimed at clustering categorical vector, or matrix data — here representing SCS instances, called celluloid . We demonstrate that celluloid clusters mutations with high precision: never pairing too many mutations that are unrelated in the ground truth, but also obtains accurate results in terms of the phylogeny inferred downstream from the reduced instance produced by this method.Finally, we demonstrate the usefulness of a clustering step by applying the entire pipeline (clustering + inference method) to a real dataset, showing a significant reduction in the runtime, raising considerably the upper bound on the size of SCS instances which can be solved in practice.Availability Our approach, celluloid: clustering single cell sequencing data around centroids is available at under an MIT license.

Artificial Intelligence

Oncology

0

Paper

Artificial Intelligence

Oncology

0

Save

18

Advancing Protein-DNA Binding Site Prediction: Integrating Sequence Models and Machine Learning Classifiers

Taslim Murad et al.Aug 23, 2023

Abstract Predicting protein-DNA binding sites is a challenging computational problem that has led to the development of advanced algorithms and techniques in the field of bioinformatics. Identifying the specific residues where proteins bind to DNA is of paramount importance, as it enables the modeling of their interactions and facilitates downstream studies. Nevertheless, the development of accurate and efficient computational methods for this task remains a persistent challenge. Accurate prediction of protein-DNA binding sites has far-reaching implications for understanding molecular mechanisms, disease processes, drug discovery, and synthetic biology applications. It helps bridge the gap between genomics and functional biology, enabling researchers to uncover the intricacies of cellular processes and advance our knowledge of the biological world. The method used to predict DNA binding residues in this study is a potent combination of conventional bioinformatics tools, protein language models, and cutting-edge machine learning and deep learning classifiers. On a dataset of protein-DNA binding sites, our model is meticulously trained, and it is then rigorously examined using several experiments. As indicated by higher predictive behavior with AUC values on two benchmark datasets, the results show superior performance when compared to existing models. The suggested model has a strong capacity for generalization and shows specificity for DNA-binding sites. We further demonstrated the adaptability of our model as a universal framework for binding site prediction by training it on a variety of protein-ligand binding site datasets. In conclusion, our innovative approach for predicting protein-DNA binding residues holds great promise in advancing our understanding of molecular interactions, thus paving the way for several groundbreaking applications in the field of molecular biology and genetics. Our approach demonstrated efficacy and versatility underscore its potential for driving transformative discoveries in biomolecular research.

Genetics

Artificial Intelligence

18

Paper

Genetics

Artificial Intelligence

0

Save

0

Special Issue, Part I 19th International Symposium on Bioinformatics Research and Applications (ISBRA 2023)

Murray PattersonJun 1, 2024

M

Molecular Biology

Biology

0

Paper

Save

WhatsHap: fast and accurate read-based phasing

Marcel Martin et al.Nov 2, 2016

+6

S

M

Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save