ResearchHub | Open Science Community

MS2Query: Reliable and Scalable MS² Mass Spectral-based Analogue Search

Niek Jonge et al.Jul 23, 2022

Abstract Metabolomics-driven discoveries of biological samples remain hampered by the grand challenge of metabolite annotation and identification. Only few metabolites have an annotated spectrum in spectral libraries; hence, searching only for exact library matches generally returns a few hits. An attractive alternative is searching for so-called analogues as a starting point for structural annotations; analogues are library molecules which are not exact matches, but display a high chemical similarity. However, current analogue search implementations are not yet very reliable and relatively slow. Here, we present MS2Query, a machine learning-based tool that integrates mass spectral embedding-based chemical similarity predictors (Spec2Vec and MS2Deepscore) as well as detected precursor masses to rank potential analogues and exact matches. Benchmarking MS2Query on reference mass spectra and experimental case studies demonstrates an improved reliability and scalability. Thereby, MS2Query offers exciting opportunities for further increasing the annotation rate of complex metabolite mixtures and for discovering new biology.

Artificial Intelligence

Molecular Biology

9

Paper

Artificial Intelligence

7

0

Save

54

NPOmix: a machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters

Tiago Leão et al.Oct 6, 2021

Abstract Microbial specialized metabolites are an important source of and inspiration for many pharmaceutical, biotechnological products and play key roles in ecological processes. However, most bioactivity-guided isolation and identification methods widely employed in metabolite discovery programs do not explore the full biosynthetic potential of an organism. Untargeted metabolomics using liquid chromatography coupled with tandem mass spectrometry is an efficient technique to access metabolites from fractions and even environmental crude extracts. Nevertheless, metabolomics is limited in predicting structures or bioactivities for cryptic metabolites. Linking the biosynthetic potential inferred from (meta)genomics to the specialized metabolome would accelerate drug discovery programs. Here, we present a k -nearest neighbor classifier to systematically connect mass spectrometry fragmentation spectra to their corresponding biosynthetic gene clusters (independent of their chemical compound class). Our pipeline offers an efficient method to link biosynthetic genes to known, analogous, or cryptic metabolites that they encode for, as detected via mass spectrometry from bacterial cultures or environmental microbiomes. Using paired data sets that include validated genes-mass spectral links from the Paired Omics Data Platform, we demonstrate this approach by automatically linking 18 previously known mass spectra to their corresponding previously experimentally validated biosynthetic genes (i.e., via NMR or genetic engineering). Finally, we demonstrated that this new approach is a substantial step towards making in silico (and even de novo ) structure predictions for peptidic metabolites and a glycosylated terpene. Altogether, we conclude that NPOmix minimizes the need for culturing and facilitates specialized metabolite isolation and structure elucidation based on integrative omics mining. Significance The pace of natural product discovery has remained relatively constant over the last two decades. At the same time, there is an urgent need to find new therapeutics to fight antibiotic-resistant bacteria, cancer, tropical parasites, pathogenic viruses, and other severe diseases. Here, we introduce a new machine learning algorithm that can efficiently connect metabolites to their biosynthetic genes. Our Natural Products Mixed Omics (NPOmix) tool provides access to genomic information for bioactivity, class, (partial) structure, and stereochemistry predictions to prioritize relevant metabolite products and facilitate their structural elucidation. Our approach can be applied to biosynthetic genes from bacteria (used in this study), fungi, algae, and plants where (meta)genomes are paired with corresponding mass fragmentation data.

Biochemistry

Pharmacology

54

Paper

Save

iPRESTO: automated discovery of biosynthetic sub-clusters linked to specific natural product substructures

Joris Louwen et al.Aug 6, 2022

+2

S

J

Abstract Microbial specialised metabolism is full of valuable natural products that are applied clinically, agriculturally, and industrially. The genes that encode their biosynthesis are often physically clustered on the genome in biosynthetic gene clusters (BGCs). Many BGCs consist of multiple groups of co-evolving genes called sub-clusters that are responsible for the biosynthesis of a specific chemical moiety in a natural product. Sub-clusters therefore provide an important link between the structures of a natural product and its BGC, which can be leveraged for predicting natural product structures from sequence, as well as for linking chemical structures and metabolomics-derived mass features to BGCs. While some initial computational methodologies have been devised for sub-cluster detection, current approaches are not scalable, have only been run on small and outdated datasets, or produce an impractically large number of possible sub-clusters to mine through. Here, we constructed a scalable method for unsupervised sub-cluster detection, called iPRESTO, based on topic modelling and statistical analysis of co-occurrence patterns of enzyme-coding protein families. iPRESTO was used to mine sub-clusters across 150,000 prokaryotic BGCs from antiSMASH-DB. After annotating a fraction of the resulting sub-cluster families, we could predict a substructure for 16% of the antiSMASH-DB BGCs. Additionally, our method was able to confirm 83% of the experimentally characterised sub-clusters in MIBiG reference BGCs. Based on iPRESTO-detected sub-clusters, we could correctly identify the BGCs for xenorhabdin and salbostatin biosynthesis (which had not yet been annotated in BGC databases), as well as propose a candidate BGC for akashin biosynthesis. Additionally, we show for a collection of 145 actinobacteria how substructures can aid in linking BGCs to molecules by correlating iPRESTO-detected sub-clusters to MS/MS-derived Mass2Motifs substructure patterns. This work paves the way for deeper functional and structural annotation of microbial BGCs by improved linking of orphan molecules to their cognate gene clusters, thus facilitating accelerated natural product discovery. Author summary In this work, we introduce iPRESTO, a tool for scalable unsupervised sub-cluster detection in biosynthetic gene clusters. This detection is important because these biosynthetic hotspots encode many products useful for humanity, such as antibiotics, antitumor agents, or herbicides. Recent technological developments have made identification of biosynthetic loci in genomes straightforward. Yet, methods to connect these inferred biosynthetic genes to the final chemical structures of their cognate metabolites are largely lacking. Being able to reliably predict parts of the final product would constitute a real step forward in natural product genome mining. Therefore, we focussed on constructing a tool to systematically detect and annotate small regions called sub-clusters, which code for the biosynthesis of substructures in the final product, across all genomically inferred biosynthetic diversity. iPRESTO makes it possible to query unknown biosynthetic regions and infer which substructures are present in their metabolic products. This will facilitate more effective prioritization of chemical novelty, as well as linking activities from bioassays and microbiome-associated phenotypes to the metabolites responsible for them.

Genetics

Biochemistry

18

Paper

Genetics

2

0

Save

MS2Query: Reliable and Scalable MS2 Mass Spectral-based Analogue Search

NPOmix: a machine learning classifier to connect mass spectrometry fragmentation data to biosynthetic gene clusters

iPRESTO: automated discovery of biosynthetic sub-clusters linked to specific natural product substructures

MS2Query: Reliable and Scalable MS² Mass Spectral-based Analogue Search