ResearchHub | Open Science Community

End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman

Samantha Petti et al.Oct 24, 2023

Abstract Multiple Sequence Alignments (MSAs) of homologous sequences contain information on structural and functional constraints and their evolutionary histories. Despite their importance for many downstream tasks, such as structure prediction, MSA generation is often treated as a separate pre-processing step, without any guidance from the application it will be used for. Here, we implement a smooth and differentiable version of the Smith-Waterman pairwise alignment algorithm that enables jointly learning an MSA and a downstream machine learning system in an end-to-end fashion. To demonstrate its utility, we introduce SMURF (Smooth Markov Unaligned Random Field), a new method that jointly learns an alignment and the parameters of a Markov Random Field for unsupervised contact prediction. We find that SMURF learns MSAs that mildly improve contact prediction on a diverse set of protein and RNA families. As a proof of concept, we demonstrate that by connecting our differentiable alignment module to AlphaFold and maximizing predicted confidence, we can learn MSAs that improve structure predictions over the initial MSAs. Interestingly, the alignments that improve AlphaFold predictions are self-inconsistent and can be viewed as adversarial. This work highlights the potential of differentiable dynamic programming to improve neural network pipelines that rely on an alignment and the potential dangers of relying on black-box methods for optimizing predictions of protein sequences.

Computer Science

Markov Random Field

Smith–waterman Algorithm

1

Paper

Save

Tuned Fitness Landscapes for Benchmarking Model-Guided Protein Design

Neil Thomas et al.Oct 24, 2023

Abstract Advancements in DNA synthesis and sequencing technologies have enabled a novel paradigm of protein design where machine learning (ML) models trained on experimental data are used to guide exploration of a protein fitness landscape. ML-guided directed evolution (MLDE) builds on the success of traditional directed evolution and unlocks strategies which make more efficient use of experimental data. Building an MLDE pipeline involves many design choices across the design-build-test-learn loop ranging from data collection strategies to modeling, each of which has a large impact on the success of designed sequences. The cost of collecting experimental data makes benchmarking every component of these pipelines on real data prohibitively difficult, necessitating the development of synthetic landscapes where MLDE strategies can be tested. In this work, we develop a framework called SLIP (“Synthetic Landscape Inference for Proteins”) for constructing biologically-motivated synthetic landscapes with tunable difficulty based on Potts models. This framework can be extended to any protein family for which there is a sequence alignment. We show that without tuning, Potts models are easy to optimize. In contrast, our tuning framework provides landscapes sufficiently challenging to benchmark MLDE pipelines. SLIP is open-source and is available at https://github.com/google-research/slip .

Benchmarking

Computer Science

Inference

42

Paper

Save

Evaluating Protein Transfer Learning with TAPE

Roshan Rao et al.May 6, 2020

+5

N

R

Protein modeling is an increasingly popular area of machine learning research. Semi-supervised learning has emerged as an important paradigm in protein modeling due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.

Computer Science

Artificial Intelligence

Machine Learning

0

Paper

Computer Science

Artificial Intelligence

0

Save

56

Single Layers of Attention Suffice to Predict Protein Contacts

Nicholas Bhattacharya et al.Oct 23, 2023

+5

R

N

A bstract The established approach to unsupervised protein contact prediction estimates co-evolving positions using undirected graphical models. This approach trains a Potts model on a Multiple Sequence Alignment, then predicts that the edges with highest weight correspond to contacts in the 3D structure. On the other hand, increasingly large Transformers are being pretrained on protein sequence databases but have demonstrated mixed results for downstream tasks, including contact prediction. This has sparked discussion about the role of scale and attention-based models in unsupervised protein representation learning. We argue that attention is a principled model of protein interactions, grounded in real properties of protein family data. We introduce a simplified attention layer, factored attention , and show that it achieves comparable performance to Potts models, while sharing parameters both within and across families. Further, we extract contacts from the attention maps of a pretrained Transformer and show they perform competitively with the other two approaches. This provides evidence that large-scale pretraining can learn meaningful protein features when presented with unlabeled and unaligned data. We contrast factored attention with the Transformer to indicate that the Transformer leverages hierarchical signal in protein family databases not captured by our single-layer models. This raises the exciting possibility for the development of powerful structured models of protein family databases. 1

Transformer

Computer Science

Artificial Intelligence

56

Paper

Save

Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening

Neil Thomas et al.May 27, 2024

Designing enzymes to function in novel chemical environments is a central goal of synthetic biology with broad applications. Guiding protein design with machine learning (ML) has the potential to accelerate the discovery of high-performance enzymes by precisely navigating a rugged fitness landscape. In this work, we describe an ML-guided campaign to engineer the nuclease NucB, an enzyme with applications in the treatment of chronic wounds due to its ability to degrade biofilms. In a multi-round enzyme evolution campaign, we combined ultra-high-throughput functional screening with ML and compared to parallel in-vitro directed evolution (DE) and in-silico hit recombination (HR) strategies that used the same microfluidic screening platform. The ML-guided campaign discovered hundreds of highly-active variants with up to 19-fold nuclease activity improvement, while the best variant found by DE had 12-fold improvement. Further, the ML-designed hits were up to 15 mutations away from the NucB wildtype, far outperforming the HR approach in both hit rate and diversity. We also show that models trained on evolutionary data alone, without access to any experimental data, can design functional variants at a significantly higher rate than a traditional approach to initial library generation. To drive future progress in ML-guided design, we curate a dataset of 55K diverse variants, one of the most extensive genotype-phenotype enzyme activity landscapes to date.

Nuclease

High-throughput Screening

Throughput

0

Paper

Nuclease

High-throughput Screening

0

Save