ResearchHub | Open Science Community

Fine-tuning Large Language Models for Chemical Text Mining

Wei Zhang et al.Jan 1, 2024

Extracting knowledge from complex and diverse chemical texts is a pivotal task for both experimental and computational chemists. The task is still considered to be extremely challenging due to the complexity of the chemical language and scientific literature. This study explored the power of fine-tuned large language models (LLMs) on five intricate chemical text mining tasks: compound entity recognition, reaction role labelling, metal-organic framework (MOF) synthesis information extraction, nuclear magnetic resonance spectroscopy (NMR) data extraction, and the conversion of reaction paragraphs to action sequences. The fine-tuned LLMs demonstrated impressive performance, significantly reducing the need for repetitive and extensive prompt engineering experiments. For comparison, we guided ChatGPT (GPT-3.5-turbo) and GPT-4 with prompt engineering and fine-tuned GPT-3.5-turbo as well as other open-source LLMs such as Mistral, Llama3, Llama2, T5, and BART. The results showed that the fine-tuned ChatGPT models excelled in all tasks. They achieved exact accuracy levels ranging from 69% to 95% on these tasks with minimal annotated data. They even outperformed those task-adaptive pre-training and fine-tuning models that were based on a significantly larger amount of in-domain data. Notably, fine-tuned Mistral and Llama3 show competitive abilities. Given their versatility, robustness, and low-code capability, leveraging fine-tuned LLMs as flexible and effective toolkits for automated data acquisition could revolutionize chemical knowledge extraction.

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

2

0

Save

0

SurfDock is a surface-informed diffusion generative model for reliable and accurate protein–ligand complex prediction

Duanhua Cao et al.Nov 27, 2024

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

1

0

Save

1

Transfer Learning Enhanced Graph Neural Network for Aldehyde Oxidase Metabolism Prediction and Its Experimental Application

Jiacheng Xiong et al.Jun 7, 2023

Abstract Aldehyde oxidase (AOX) is a molybdoenzyme that is primarily expressed in the liver and is involved in the metabolism of drugs and other xenobiotics. AOX-mediated metabolism can result in unexpected outcomes, such as the production of toxic metabolites and high metabolic clearance, which can lead to the clinical failure of novel therapeutic agents. Computational models can assist medicinal chemists in rapidly evaluating the AOX metabolic risk of compounds during the early phases of drug discovery and provide valuable clues for manipulating AOX-mediated metabolism liability. In this study, we developed a novel graph neural network called AOMP for predicting AOX-mediated metabolism. AOMP integrated the tasks of metabolic substrate/non-substrate classification and metabolic site prediction, while utilizing transfer learning from 13C nuclear magnetic resonance data to enhance its performance on both tasks. AOMP significantly outperformed the benchmark methods in both cross-validation and external testing. Using AOMP, we systematically assessed the AOX-mediated metabolism of common fragments in kinase inhibitors and successfully identified four new scaffolds with AOX metabolism liability, which were validated through in vitro experiments. Furthermore, for the convenience of the community, we established the first online service for AOX metabolism prediction based on AOMP, which is freely available at https://aomp.alphama.com.cn .

Biochemistry

Pharmacology

1

Paper

Save

Computational target fishing by mining transcriptional data using a novel Siamese spectral-based graph convolutional network

Feisheng Zhong et al.Apr 3, 2020

Computational target fishing aims to investigate the mechanism of action or the side effects of bioactive small molecules. Unfortunately, conventional ligand-based computational methods only explore a confined chemical space, and structure-based methods are limited by the availability of crystal structures. Moreover, these methods cannot describe cellular context-dependent effects and are thus not useful for exploring the targets of drugs in specific cells. To address these challenges, we propose a novel Siamese spectral-based graph convolutional network (SSGCN) model for inferring the protein targets of chemical compounds from gene transcriptional profiles. Although the gene signature of a compound perturbation only provides indirect clues of the interacting targets, the SSGCN model was successfully trained to learn from known compound-target pairs by uncovering the hidden correlations between compound perturbation profiles and gene knockdown profiles. Using a benchmark set, the model achieved impressive target inference results compared with previous methods such as Connectivity Map and ProTINA. More importantly, the powerful generalization ability of the model observed with the external LINCS phase II dataset suggests that the model is an efficient target fishing or repositioning tool for bioactive compounds.

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

Molecular Biology

0

Save

0

Identify compound-protein interaction with knowledge graph embedding of perturbation transcriptomics

Shengkun Ni et al.Apr 12, 2024

Abstract The emergence of perturbation transcriptomics provides a new perspective and opportunity for drug discovery, but existing analysis methods suffer from inadequate performance and limited applicability. In this work, we present PertKGE, a method designed to improve compound-protein interaction with knowledge graph embedding of perturbation transcriptomics. PertKGE incorporates diverse regulatory elements and accounts for multi-level regulatory events within biological systems, leading to significant improvements compared to existing baselines in two critical “cold-start” settings: inferring binding targets for new compounds and conducting virtual ligand screening for new targets. We further demonstrate the pivotal role of incorporating multi- level regulatory events in alleviating dataset bias. Notably, it enables the identification of ectonucleotide pyrophosphatase/phosphodiesterase-1 as the target responsible for the unique anti- tumor immunotherapy effect of tankyrase inhibitor K-756, and the discovery of five novel hits targeting the emerging cancer therapeutic target, aldehyde dehydrogenase 1B1, with a remarkable hit rate of 10.2%. These findings highlight the potential of PertKGE to accelerate drug discovery by elucidating mechanisms of action and identifying novel therapeutic compounds.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

Biochemistry

0

Save

0

FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling

Wenpei Xiang et al.May 10, 2024

Abstract Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and “tail labels” with few known examples. Unlike previous methods that mainly focused on protein sequence features, we use a pretrained large natural language model to understand the semantic meaning of protein labels. Specifically, we introduce FAPM, a contrastive multi-modal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM’s flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation. The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo .

Artificial Intelligence

Pharmacology

0

Paper

Artificial Intelligence

Pharmacology

0

Save