ResearchHub | Open Science Community

0

A High-Resolution Enhancer Atlas of the Developing Telencephalon

Axel Visel et al.Jan 31, 2013

Genetics

Geology

0

Paper

Save

Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale

Hani GirgisJul 23, 2015

With rapid advancements in technology, the sequences of thousands of species' genomes are becoming available. Within the sequences are repeats that comprise significant portions of genomes. Successful annotations thus require accurate discovery of repeats. As species-specific elements, repeats in newly sequenced genomes are likely to be unknown. Therefore, annotating newly sequenced genomes requires tools to discover repeats de-novo. However, the currently available de-novo tools have limitations concerning the size of the input sequence, ease of use, sensitivities to major types of repeats, consistency of performance, speed, and false positive rate.To address these limitations, I designed and developed Red, applying Machine Learning. Red is the first repeat-detection tool capable of labeling its training data and training itself automatically on an entire genome. Red is easy to install and use. It is sensitive to both transposons and simple repeats; in contrast, available tools such as RepeatScout and ReCon are sensitive to transposons, and WindowMasker to simple repeats. Red performed consistently well on seven genomes; the other tools performed well only on some genomes. Red is much faster than RepeatScout and ReCon and has a much lower false positive rate than WindowMasker. On human genes with five or more copies, Red was more specific than RepeatScout by a wide margin. When tested on genomes of unusual nucleotide compositions, Red located repeats with high sensitivities and maintained moderate false positive rates. Red outperformed the related tools on a bacterial genome. Red identified 46,405 novel repetitive segments in the human genome. Finally, Red is capable of processing assembled and unassembled genomes.Red's innovative methodology and its excellent performance on seven different genomes represent a valuable advancement in the field of repeats discovery.

Genetics

Molecular Biology

0

Paper

Save

FASTCAR: Rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models

Benjamin James et al.Jul 31, 2018

Abstract Motivation Pairwise alignment is a predominant algorithm in the field of bioinformatics. This algorithm is quadratic — slow especially on long sequences. Many applications utilize identity scores without the corresponding alignments. For these applications, we propose FASTCAR. It produces identity scores for pairs of DNA sequences using alignment-free methods and two self-supervised general linear models. Results For the first time, the new tool can predict the pair-wise identity score in linear time and space. On two large-scale sequence databases, FASTCAR provided the best compromise between sensitivity and precision while being faster than BLAST by 40% and faster than USEARCH by 6–10 times. Further, FASTCAR is capable of producing the pair-wise identity scores of long DNA sequences — millions-of-nucleotides-long bacterial genomes; this task cannot be accomplished by any alignment-based tool. Availability FASTCAR is available at https://github.com/TulsaBioinformaticsToolsmith/FASTCAR and as the Supplementary Dataset 1. Contact hani-girgis@utulsa.edu Supplementary information Supplementary data are available online.

Genetics

Artificial Intelligence

0

Paper

Save

MeShClust v3.0: High-quality clustering of DNA sequences using the mean shift algorithm and alignment-free identity scores

Hani GirgisJan 17, 2022

Abstract Background Tools for accurately clustering biological sequences are among the most important tools in computational biology. Two pioneering tools for clustering sequences are CD-HIT and UCLUST , both of which are fast and consume reasonable amounts of memory; however, there is a big room for improvement in terms of cluster quality. Motivated by this opportunity for improving cluster quality, we applied the mean shift algorithm in MeShClust v1.0 . The mean shift algorithm is an instance of unsupervised learning. Its strong theoretical foundation guarantees the convergence to the true cluster centers. Our implementation of the mean shift algorithm in MeShClust v1.0 was a step forward; however, it was not the original algorithm. In this work, we make progress toward applying the original algorithm while utilizing alignment-free identity scores in a new tool: MeShClust v3.0 . Results We evaluated CD-HIT, MeShClust v1.0, MeShClust v3.0 , and UCLUST on 22 synthetic sets and five real sets. These data sets were designed or selected for testing the tools in terms of scalability and different similarity levels among sequences comprising clusters. On the synthetic data sets, MeShClust v3.0 outperformed the related tools on all sets in terms of cluster quality. On two real data sets obtained from human microbiome and maize transposons, MeShClust v3.0 outperformed the related tools by wide margins, achieving 55%—300% improvement in cluster quality. On another set that includes degenerate viral sequences, MeShClust v3.0 came third. On two bacterial sets, MeShClust v3.0 was the only applicable tool because of the long sequences in these sets. MeShClust v3.0 requires more time and memory than the related tools; almost all personal computers at the time of this writing can accommodate such requirements. MeShClust v3.0 can estimate an important parameter that controls cluster membership with high accuracy. Conclusions These results demonstrate the high quality of clusters produced by MeShClust v3.0 and its ability to apply the mean shift algorithm to large data sets and long sequences. Because clustering tools are utilized in many studies, providing high-quality clusters will help with deriving accurate biological knowledge.

Ecology

Artificial Intelligence

6

Paper

Save

EnhancerTracker: Comparing cell-type-specific enhancer activity of DNA sequence triplets via an ensemble of deep convolutional neural networks

Anthony Garza et al.Dec 23, 2023

Abstract Motivation Transcriptional enhancers — unlike promoters — are unrestrained by distance or strand orientation with respect to their target genes, making their computational identification a challenge. Further, there are insufficient numbers of confirmed enhancers for many cell types, preventing robust training of machine-learning-based models for enhancer prediction for such cell types. Results We present EnhancerTracker , a novel tool that leverages an ensemble of deep separable convolutional neural networks to identify cell-type-specific enhancers with the need of only two confirmed enhancers. EnhancerTracker is trained, validated, and tested on 52,789 putative enhancers obtained from the FANTOM5 Project and control sequences derived from the human genome. Unlike available tools, which accept one sequence at a time, the input to our tool is three sequences; the first two are enhancers active in the same cell type. EnhancerTracker outputs 1 if the third sequence is an enhancer active in the same cell type(s) where the first two enhancers are active. It outputs 0 otherwise. On a held-out set (15%), EnhancerTracker achieved an accuracy of 64%, a specificity of 93%, a recall of 35%, a precision of 84%, and an F1 score of 49%. Availability and implementation https://github.com/BioinformaticsToolsmith/EnhancerTracker Contact hani.girgis@tamuk.edu

Genetics

Artificial Intelligence

0

Paper

Save

HebbPlot: An intelligent tool for learning and visualizing chromatin mark signatures

Hani Girgis et al.Oct 23, 2017

Histone modifications play important roles in gene regulation, heredity, imprinting, and many human diseases. The histone code is complex, consisting of about 100 marks. Biologists need computational tools for characterizing general signatures representing the distributions of tens of chromatin marks around thousands of regions. To this end, we developed a software tool called HebbPlot, which utilizes a Hebbian neural network to learn such signatures. HebbPlot presents a signature as a digitized image, which can be easily interpreted. We validated HebbPlot in six case studies. HebbPlot is applicable to a wide array of studies, facilitating the deciphering of the histone code.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

Benchmarking of alignment-free sequence comparison methods

Andrzej Zieleziński et al.Apr 16, 2019

Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. Here, we present a community resource ( ) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference and reconstruction of species trees under horizontal gene transfer and recombination events. The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences

Benjamin James et al.Oct 24, 2018

Grouping sequences into similar clusters is an important part of sequence analysis. Widely used clustering tools sacrifice quality for speed. Previously, we developed MeShClust, which utilizes k-mer counts in an alignment-assisted classifier and the mean-shift algorithm for clustering DNA sequences. Although MeShClust outperformed related tools in terms of cluster quality, the alignment algorithm used for generating training data for the classifier was not scalable to longer sequences. In contrast, MeShClust2 generates semi-synthetic sequence pairs with known mutation rates, avoiding alignment algorithms. MeShClust2 clustered 3600 bacterial genomes, providing a utility for clustering long sequences using identity scores for the first time.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save

0

LtrDetector: A modern tool-suite for detecting long terminal repeat retrotransposons de-novo on the genomic scale

Joseph Valencia et al.Oct 22, 2018

Long terminal repeat retrotransposons are the most abundant transposons in plants. They play important roles in alternative splicing, recombination, gene regulation, and genomic evolution. Large-scale sequencing projects for plant genomes are currently underway. Software tools are important for annotating long terminal repeat retrotransposons in these newly available genomes. However, the available tools are not very sensitive to known elements and perform inconsistently on different genomes. Some are hard to install or obsolete. They may struggle to process large plant genomes. None are concurrent or have features to support manual review of new elements. To overcome these limitations, we developed LtrDetector, which uses signal-processing techniques. LtrDetector is easy to install and use. It is not species specific. It utilizes multi-core processors available in personal computers. It is more sensitive than other tools by 14.4%-50.8% while maintaining a low false positive rate on six plant genomes.

Genetics

Molecular Biology

0

Paper

Save

MeShClust: an intelligent tool for clustering DNA sequences

Benjamin James et al.Oct 23, 2017

Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust's ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.

Genetics

Artificial Intelligence

0

Paper

Genetics

Artificial Intelligence

0

Save