ResearchHub | Open Science Community

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

Oliver Schwengers et al.Nov 5, 2021

Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta . An accompanying web version is available at https://bakta.computational.bio .

Genetics

Ecology

0

Paper

Save

Platon: identification and characterization of bacterial plasmid contigs in short-read draft assemblies exploiting protein-sequence-based replicon distribution scores

Oliver Schwengers et al.Apr 23, 2020

ABSTRACT Plasmids are extrachromosomal genetic elements replicating independently of the chromosome which play a vital role in the environmental adaptation of bacteria. Due to potential mobilization or conjugation capabilities, plasmids are important genetic vehicles for antimicrobial resistance genes and virulence factors with huge and increasing clinical implications. They are therefore subject to large genomic studies within the scientific community worldwide. As a result of rapidly improving next generation sequencing methods, the amount of sequenced bacterial genomes is constantly increasing, in turn raising the need for specialized tools to (i) extract plasmid sequences from draft assemblies, (ii) derive their origin and distribution, and (iii) further investigate their genetic repertoire. Recently, several bioinformatic methods and tools have emerged to tackle this issue; however, a combination of both high sensitivity and specificity in plasmid sequence identification is rarely achieved in a taxon-independent manner. In addition, many software tools are not appropriate for large high-throughput analyses or cannot be included into existing software pipelines due to their technical design or software implementation. In this study, we investigated differences in the replicon distributions of protein-coding genes on a large scale as a new approach to distinguish plasmid-borne from chromosome-borne contigs. We defined and computed statistical discrimination thresholds for a new metric: the replicon distribution score (RDS) which achieved an accuracy of 96.6%. The final performance was further improved by the combination of the RDS metric with heuristics exploiting several plasmid specific higher-level contig characterizations. We implemented this workflow in a new high-throughput taxon-independent bioinformatics software tool called Platon for the recruitment and characterization of plasmid-borne contigs from short-read draft assemblies. Compared to PlasFlow, Platon achieved a higher accuracy (97.5%) and more balanced predictions (F1=82.6%) tested on a broad range of bacterial taxa and better or equal performance against the targeted tools PlasmidFinder and PlaScope on sequenced E. coli isolates. Platon is available at: platon.computational.bio Data Summary Platon was developed as a Python 3 command line application for Linux. The complete source code and documentation is available on GitHub under a GPL3 license: https://github.com/oschwengers/platon and platon.computational.bio . All database versions are hosted at Zenodo: DOI 10.5281/zenodo.3349651. Platon is available via bioconda package platon Platon is available via PyPI package cb-platon Bacterial representative sequences for UniProt’s UniRef90 protein clusters, complete bacterial genome sequences from the NCBI RefSeq database, complete plasmid sequences from the NCBI genomes plasmid section, created artificial contigs, RDS threshold metrics and raw protein replicon hit counts used to create and evaluate the marker protein sequence database are hosted at Zenodo: DOI 10.5281/zenodo.3759169 24 Escherichia coli isolates sequenced with short read (Illumina MiSeq) and long read sequencing technologies (Oxford Nanopore Technology GridION platform) used for real data benchmarks are available under the following NCBI BioProjects: PRJNA505407, PRJNA387731 Impact Statement Plasmids play a vital role in the spread of antibiotic resistance and pathogenicity genes. The increasing numbers of clinical outbreaks involving resistant pathogens worldwide pushed the scientific community to increase their efforts to comprehensively investigate bacterial genomes. Due to the maturation of next-generation sequencing technologies, nowadays entire bacterial genomes including plasmids are sequenced in huge scale. To analyze draft assemblies, a mandatory first step is to separate plasmid from chromosome contigs. Recently, many bioinformatic tools have emerged to tackle this issue. Unfortunately, several tools are implemented only as interactive or web-based tools disabling them for necessary high-throughput analysis of large data sets. Other tools providing such a high-throughput implementation however often come with certain drawbacks, e . g . providing taxon-specific databases only, not providing actionable, i . e . true binary classification or achieving biased classification performances towards either sensitivity or specificity. Here, we introduce the tool Platon implementing a new replicon distribution-based approach combined with higher-level contig characterizations to address the aforementioned issues. In addition to the plasmid detection within draft assemblies, Platon provides the user with valuable information on certain higher-level contig characterizations. We show that Platon provides a balanced classification performance as well as a scalable implementation for high-throughput analyses. We therefore consider Platon to be a powerful, species-independent and flexible tool to scan large amounts of bacterial whole-genome sequencing data for their plasmid content.

Genetics

Molecular Biology

6

Paper

Save

Bakta: Rapid & standardized annotation of bacterial genomes via alignment-free sequence identification

Oliver Schwengers et al.Sep 2, 2021

Abstract Command line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command line software pipelines heavily depend on taxon specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command line software tool for the robust, taxon-independent, thorough and nonetheless fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross references. Annotation results are exported in GFF3 and INSDC-compliant flat files as well as comprehensive JSON files facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references whilst providing comparable wall clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta . An accompanying web version is available at https://bakta.computational.bio .

Genetics

Ecology

15

Paper

Save

ReferenceSeeker: rapid determination of appropriate reference genomes

Oliver Schwengers et al.Dec 6, 2019

Abstract Summary The large and growing number of microbial genomes available in public databases makes the optimal selection of reference genomes necessary for many in-silico analyses, e.g. single nucleotide polymorphism detection, scaffolding and comparative genomics, increasingly difficult. Here, we present ReferenceSeeker, a novel command line tool combining a fast kmer profile-based database lookup of candidate reference genomes with subsequent calculation of highly specific average nucleotide identity (ANI) values for the rapid determination of appropriate reference genomes. Pre-built databases for bacteria, archaea, fungi, protozoa and viruses based on the RefSeq database are provided for download. Availability and Implementation ReferenceSeeker is open source software implemented in Python. Source code and binaries are freely available for download at https://github.com/oschwengers/referenceseeker under the GNU GPL3 license. Contact referenceseeker@computational.bio

Genetics

Ecology

0

Paper

Save

Nanoliter-scale selection of optimized bioengineered peptide antibiotics that rescue mice with bacterial lung infection

Nils Böhringer et al.Jun 1, 2024

Abstract Increasing numbers of multi-drug resistant pathogens call for new chemical scaffolds, addressing novel targets, that can serve as lead structures for the development of life-saving drugs. For antibiotics, natural product-inspired molecules represent a most promising resource. Natural products evolved to high chemical complexity and occupy a chemical space different than synthetic libraries. However, clinical translation of promising natural products is often impeded by their relative inaccessibility to medicinal chemistry optimization, e.g. iterative synthesis of large series of derivatives. Here, this limitation is addressed with a randomized library of bicyclic heptapeptides based on the natural product darobactin that hits the clinically not addressed target BamA. Variants of the ribosomally synthesized and post-translationally modified peptides were generated using heterologous mutasynthesis. A parallelized screening assay is adapted in nanoliter-scale beads to test the darobactin derivatives against our sensor strain. Loss of fluorescence sorting prioritized 563 events out of the analyzed ∼500k beads. Re-testing confirmed 48 hit events, of which 40 proved to produce distinct darobactin-type molecules. Most promising structures were isolated and the growth inhibitory effects against Gram-negative pathogens validated. One of our current frontrunner compounds ( i.e. , darobactin B) was reinforced by the randomized screen. While microbiological investigations of the new derivatives is ongoing, darobactin B was profiled in later tier assays and compared to another promising, rationally-designed analog ( i.e. , darobactin B9, “D22”). Early ADMET profiling and efficacy tests in a mouse pneumonia model were performed. Darobactin B reduced bacterial load of Pseudomonas aeruginosa and Klebsiella pneumoniae by intraperitoneal, as well as intratracheal administration. Our study showcases the potential of mutasynthetic libraries for high-throughput screening and identification of functional peptides for drug lead discovery.

Ecology

Artificial Intelligence

0

Paper

Save

BakRep - A searchable large-scale web repository for bacterial genomes, characterizations and metadata

Linda Fenske et al.Jun 2, 2024

Abstract Bacteria are fascinating research objects in many disciplines for countless reasons, and whole-genome sequencing has become the paramount methodology to advance our microbiological understanding. Meanwhile, access to cost-effective sequencing platforms has accelerated bacterial whole-genome sequencing to unprecedented levels introducing new challenges in terms of data accessibility, computational demands, heterogeneity of analysis workflows, and thus, ultimately its scientific usability. To that end, Blackwell et al . released a uniformly processed set of 661,405 bacterial genome assemblies obtained from the European Nucleotide Archive as of November 2018. Building on these accomplishments, we conducted further genome-based analyses like taxonomic classification, MLST subtyping and annotation of all genomes. Here we present BakRep, a searchable large-scale web repository of these genomes enriched with consistent genome characterizations and original metadata. The platform provides a flexible search engine combining taxonomic, genomic and metadata information, as well as interactive elements to visualize genomic features. Furthermore, all results can be downloaded for offline analyses via an accompanying command line tool. The web repository is accessible via https://bakrep.computational.bio .

Ecology

Molecular Biology

0

Paper

Save

sORFdb – A database for sORFs, small proteins, and small protein families in bacteria

Julian Hahnfeld et al.Jun 22, 2024

Small proteins with fewer than 100, particularly fewer than 50, amino acids are still largely unexplored. Nonetheless, they represent an essential part of bacteria's often neglected genetic repertoire. In recent years, the development of ribosome profiling protocols has led to the detection of an increasing number of previously unknown small proteins. Despite this, they are overlooked in many cases by automated genome annotation pipelines, and often, no functional descriptions can be assigned due to a lack of known homologs. To understand and overcome these limitations, the current abundance of small proteins in existing databases was evaluated, and a new dedicated database for small proteins and their potential functions, called 'sORFdb', was created. To this end, small proteins were extracted from annotated bacterial genomes in the GenBank database. Subsequently, they were quality-filtered, compared, and complemented with proteins from Swiss-Prot, UniProt, and SmProt to ensure reliable identification and characterization of small proteins. Families of similar small proteins were created using bidirectional best BLAST hits followed by Markov clustering. Analysis of small proteins in public databases revealed that their number is still limited due to historical and technical constraints. Additionally, functional descriptions were often missing despite the presence of potential homologs. As expected, a taxonomic bias was evident in over-represented clinically relevant bacteria. This new and comprehensive database is accessible via a feature-rich website providing specialized search features for sORFs and small proteins of high quality. Additionally, small protein families with Hidden Markov Models and information on taxonomic distribution and other physicochemical properties are available. In conclusion, the novel small protein database sORFdb is a specialized, taxonomy-independent database that improves the findability and classification of sORFs, small proteins, and their functions in bacteria, thereby supporting their future detection and consistent annotation. All sORFdb data is freely accessible via https://sorfdb.computational.bio.

Genetics

Ecology

0

Paper

Save

ASA³P: An automatic and scalable pipeline for the assembly, annotation and higher level analysis of closely related bacterial isolates

Oliver Schwengers et al.May 29, 2019

Whole genome sequencing of bacteria has become daily routine in many fields. Advances in DNA sequencing technologies and continuously dropping costs have resulted in a tremendous increase in the amounts of available sequence data. However, comprehensive in-depth analysis of the resulting data remains an arduous and time consuming task. In order to keep pace with these promising but challenging developments and to transform raw data into valuable information, standardized analyses and scalable software tools are needed. Here, we introduce ASA³P, a fully automatic, locally executable and scalable assembly, annotation and analysis pipeline for bacterial genomes. The pipeline automatically executes necessary data processing steps, i.e. quality clipping and assembly of raw sequencing reads, scaffolding of contigs and annotation of the resulting genome sequences. Furthermore, ASA³P conducts comprehensive genome characterizations and analyses, e.g. taxonomic classification, detection of antibiotic resistance genes and identification of virulence factors. All results are presented via an HTML5 user interface providing aggregated information, interactive visualizations and access to intermediate results in standard bioinformatics file formats. We distribute ASA³P in two versions: a locally executable Docker container for small-to-medium-scale projects and an OpenStack based cloud computing version able to automatically create and manage self-scaling compute clusters. Thus, automatic and standardized analysis of hundreds of bacterial genomes becomes feasible within hours. The software and further information is available at: http://asap.computational.bio.

Genetics

Ecology

0

Paper

Save

Genome-based development and clinical evaluation of a customized LAMP panel to rapidly detect, quantify, and determine antibiotic sensitivity of Escherichia coli in native urine samples from urological patients

Moritz Fritzenwanker et al.Jan 7, 2025

Abstract Purpose We designed and tested a point of care test panel to detect E.coli and antibiotic susceptibility in urine samples from patients at the point of care in the urological department. The aim of this approach is to facilitate choosing an appropriate antibiotic for urinary tract infections (UTI) at first presentation in the context of increasing antibiotic resistance in uropathogens worldwide. Methods We analyzed 162 E.coli isolates from samples from a university urological department to determine phenotypic and genotypic resistance data. With this data we created customized LAMP (loop-mediated isothermal amplification) panels for a commercial machine with which to detect and possibly quantify E.coli and six antibiotic resistance determinants. In a second step we tested these panel(s) for diagnostic accuracy on 1596 urine samples and compared with routine microbiological culture. Results E.coli was detected with 95.4% sensitivity and 96.1% specificity. Dynamics of the LAMP amplification could be used to gauge bacterial loads in the samples. Antibiotic sensitivity was detected with good negative (sensitive) predictive values: ampicillin 92.8%, ampicillin/sulbactam 96.4%, cefuroxime 92.8%, cefotaxime 97.8%, trimethoprim/sulfamethoxazole 96.5%, ciprofloxacin 96.8%. Conclusion The LAMP panel provided E.coli detection and sensitivity information within one hour and thus could principally guide initial antibiotic therapy upon patients presenting with UTI. The panel helps to select initial adequate antibiotic therapy as well as providing diagnostic stewardship. Follow up investigations will expand the test system to other uropathogens.

Genetics

Epidemiology

0

Paper

Genetics

Epidemiology

0

Save