ResearchHub | Open Science Community

DeProt: A protein language model with quantizied structure and disentangled attention

Mingchen Li et al.Apr 17, 2024

Abstract Protein language models have exhibited remarkable representational capabilities in various downstream tasks, notably in the prediction of protein functions. Despite their success, these models traditionally grapple with a critical shortcoming: the absence of explicit protein structure information, which is pivotal for elucidating the relationship between protein sequences and their functionality. Addressing this gap, we introduce DeProt, a Transformer-based protein language model designed to incorporate protein sequences and structures. It was pre-trained on millions of protein structures from diverse natural protein clusters. DeProt first serializes protein structures into residue-level local-structure sequences and use a graph neural network based auto-encoder to vectorized the local structures. Then, these vectors are quantized and formed a discrete structure tokens by a pre-trained codebook. Meanwhile, DeProt utilize disentangled attention mechanisms to effectively integrate residue sequences with structure token sequences. Despite having fewer parameters and less training data, DeProt significantly outperforms other state-ofthe-art (SOTA) protein language models, including those that are structure-aware and evolution-based, particularly in the task of zero-shot mutant effect prediction across 217 deep mutational scanning assays. Furthermore, DeProt exhibits robust representational capabilities across a spectrum of supervised-learning downstream tasks. Our comprehensive benchmarks underscore the innovative nature of DeProt’s framework and its superior performance, suggesting its wide applicability in the realm of protein deep learning. For those interested in exploring DeProt further, the code, model weights, and all associated datasets are accessible at: https://github.com/ginnm/DeProt .

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

1

0

Save

0

Simple, Efficient, and Scalable Structure-Aware Adapter Boosts Protein Language Models

Yang Tan et al.Aug 7, 2024

Fine-tuning pretrained protein language models (PLMs) has emerged as a prominent strategy for enhancing downstream prediction tasks, often outperforming traditional supervised learning approaches. As a widely applied powerful technique in natural language processing, employing parameter-efficient fine-tuning techniques could potentially enhance the performance of PLMs. However, the direct transfer to life science tasks is nontrivial due to the different training strategies and data forms. To address this gap, we introduce SES-Adapter, a simple, efficient, and scalable adapter method for enhancing the representation learning of PLMs. SES-Adapter incorporates PLM embeddings with structural sequence embeddings to create structure-aware representations. We show that the proposed method is compatible with different PLM architectures and across diverse tasks. Extensive evaluations are conducted on 2 types of folding structures with notable quality differences, 9 state-of-the-art baselines, and 9 benchmark data sets across distinct downstream tasks. Results show that compared to vanilla PLMs, SES-Adapter improves downstream task performance by a maximum of 11% and an average of 3%, with significantly accelerated convergence speed by a maximum of 1034% and an average of 362%, the training efficiency is also improved by approximately 2 times. Moreover, positive optimization is observed even with low-quality predicted structures. The source code for SES-Adapter is available at https://github.com/tyang816/SES-Adapter.

Artificial Intelligence

Biochemistry

0

Paper

Artificial Intelligence

1

0

Save

0

A general temperature-guided language model to design proteins of enhanced stability and activity

Fan Jiang et al.Nov 27, 2024

Designing protein mutants with both high stability and activity is a critical yet challenging task in protein engineering. Here, we introduce PRIME, a deep learning model, which can suggest protein mutants with improved stability and activity without any prior experimental mutagenesis data for the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive ability compared to current state-of-the-art models on the public mutagenesis dataset across 283 protein assays. Furthermore, we validated PRIME’s predictions on five proteins, examining the impact of the top 30 to 45 single-site mutations on various protein properties, including thermal stability, antigen-antibody binding affinity, and the ability to polymerize nonnatural nucleic acid or resilience to extreme alkaline conditions. More than 30% of PRIME-recommended mutants exhibited superior performance compared to their premutation counterparts across all proteins and desired properties. We developed an efficient and effective method based on PRIME to rapidly obtain multisite mutants with enhanced activity and stability. Hence, PRIME demonstrates broad applicability in protein engineering.

Biochemistry

Biophysics

0

Paper

Save

The gut–brain axis underlying hepatic encephalopathy in liver cirrhosis

Xiaolong He et al.Jan 8, 2025

Epidemiology

Immunology

0

Paper

Save

Development and validation of AI models using LR and LightGBM for predicting distant metastasis in breast cancer: a dual-center study

Wenhai Zhang et al.Jun 14, 2024

Objective This study aims to develop an artificial intelligence model utilizing clinical blood markers, ultrasound data, and breast biopsy pathological information to predict the distant metastasis in breast cancer patients. Methods Data from two medical centers were utilized, Clinical blood markers, ultrasound data, and breast biopsy pathological information were separately extracted and selected. Feature dimensionality reduction was performed using Spearman correlation and LASSO regression. Predictive models were constructed using LR and LightGBM machine learning algorithms and validated on internal and external validation sets. Feature correlation analysis was conducted for both models. Results The LR model achieved AUC values of 0.892, 0.816, and 0.817 for the training, internal validation, and external validation cohorts, respectively. The LightGBM model achieved AUC values of 0.971, 0.861, and 0.890 for the same cohorts, respectively. Clinical decision curve analysis showed a superior net benefit of the LightGBM model over the LR model in predicting distant metastasis in breast cancer. Key features identified included creatine kinase isoenzyme (CK-MB) and alpha-hydroxybutyrate dehydrogenase. Conclusion This study developed an artificial intelligence model using clinical blood markers, ultrasound data, and pathological information to identify distant metastasis in breast cancer patients. The LightGBM model demonstrated superior predictive accuracy and clinical applicability, suggesting it as a promising tool for early diagnosis of distant metastasis in breast cancer.

Artificial Intelligence

Oncology

0

Paper

Artificial Intelligence

Oncology

0

Save

0

Decoupling of the onset of anharmonicity between a protein and its surface water around 200 K

Lirong Zheng et al.Aug 19, 2024

The protein dynamical transition at ~200 K, where the biomolecule transforms from a harmonic, non-functional form to an anharmonic, functional state, has been thought to be slaved to the thermal activation of dynamics in its surface hydration water. Here, by selectively probing the dynamics of protein and hydration water using elastic neutron scattering and isotopic labeling, we found that the onset of anharmonicity in the two components around 200 K is decoupled. The one in protein is an intrinsic transition, whose characteristic temperature is independent of the instrumental resolution time, but varies with the biomolecular structure and the amount of hydration, while the one of water is merely a resolution effect.

Biochemistry

Biophysics

0

Paper

Save

Conditional Protein Denoising Diffusion Generates Programmable Endonucleases

Bingxin Zhou et al.Aug 14, 2023

Abstract Computation or deep learning-based functional protein generation methods address the urgent demand for novel biocatalysts, allowing for precise tailoring of functionalities to meet specific requirements. This emergence leads to the creation of highly efficient and specialized proteins with wide-ranging applications in scientific, technological, and biomedical domains. This study establishes a conditional protein diffusion model, namely CPDiffusion, to deliver diverse protein sequences with desired functions. While the model is free from extensive training data and the sampling process involves little guidance on the type of generated amino acids, CPDiffusion effectively secures essential highly conserved residues that are crucial for protein functionalities. We employed CPDiffusion and generated 27 artificially designed Argonaute proteins, programmable endonucleases applied for easy-to-implement and high-throughput screenings in gene editing and molecular diagnostics, that mutated approximately 200 − 400 amino acids with 40% sequence identities to those from nature. Experimental tests demonstrate the solubility of all 27 artificially-designed proteins (AP), with 24 of them displaying DNA cleavage activity. Remarkably, 74% of active APs exhibited superior activity compared to the template protein, and the most effective one showcased a remarkable nearly nine-fold enhancement of enzymatic activity. Moreover, 37% of APs exhibited enhanced thermostability. These findings emphasize CPDiffusion’s remarkable capability to generate long-sequence proteins in a single step while retaining or enhancing intricate functionality. This approach facilitates the design of intricate enzymes featuring multi-domain molecular structures through in silico generation and throughput, all accomplished without the need for supervision from labeled data.

Ecology

Biochemistry

20

Paper

Ecology

Biochemistry

0

Save