ResearchHub | Open Science Community

Protein language models trained on multiple sequence alignments learn phylogenetic relationships

Umberto Lupo et al.Mar 30, 2022

Abstract Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold’s EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer’s row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer’s column attentions strongly correlate with Hamming distances between sequences in MSAs. There-fore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.

Genetics

Artificial Intelligence

28

Paper

Save

YeaZ: A convolutional neural network for highly accurate, label-free segmentation of yeast microscopy images

Nicola Dietler et al.May 12, 2020

Abstract The processing of microscopy images constitutes a bottleneck for large-scale experiments. A critical step is the establishment of cell borders (‘segmentation’), which is required for a range of applications such as growth or fluorescent reporter measurements. For the model organism budding yeast ( Saccharomyces cerevisiae ), a number of methods for segmentation exist. However, in experiments involving multiple cell cycles, stress, or various mutants, cells crowd or exhibit irregular visible features, which necessitate frequent manual corrections. Furthermore, budding events are visually subtle but important to detect. Convolutional neural networks (CNNs) have been successfully employed for a range of image processing applications. They require large, diverse training sets. Here, we present i) the first set of publicly available, high-quality segmented yeast images (>10’000 cells) including mutants, stressed cells, and time courses, ii) a corresponding U-Net-based CNN, iii) a Python-based graphical user interface (GUI) to efficiently use the system, and iv) a web application to test it ( www.quantsysbio.com ). A key feature is a cell-cell boundary test which avoids the need for additional input from fluorescent channels. A bipartite graph matching algorithm tracks cells in time with high reliability. Our network is highly accurate and outperforms existing methods on benchmark images recorded by others, suggesting it transfers well to other conditions. Furthermore, new buds are detected early with high reliability. We apply the system to detect differences in geometry between wild-type and cyclin mutant cells. Our results indicate that morphogenesis control occurs unexpectedly early in the cell cycle and is gradual, demonstrating how the efficient processing of large numbers of cells uncovers new biology. Our system can serve as a resource to the community, expanded continuously with new images. Furthermore, the techniques we develop here are likely to be useful for other organisms as well. The identification of cell borders (‘segmentation’) in microscopy images constitutes a bottleneck for large-scale experiments. For the model organism Saccharomyces cerevisiae , current segmentation methods face challenges when cells bud, crowd, or exhibit irregular features. Here, we present i) the first set of publicly available, high-quality segmented yeast images (>10’000 cells) including mutants, stressed cells, and time courses, ii) a corresponding convolutional neural network (CNN), iii) a graphical user interface and a web application ( www.quantsysbio.com ) to efficiently employ, test, and expand the system. A key feature is a cell-cell boundary test which avoids the need for fluorescent markers. Our CNN is highly accurate, including for buds, and outperforms existing methods on benchmark images, indicating it transfers well to other conditions. To demonstrate how efficient, large-scale image processing uncovers new biology, we analyzed the geometries of ≈2200 wild-type and cyclin mutant cells and found that morphogenesis control occurs unexpectedly early and gradually.

Artificial Intelligence

Biophysics

3

Paper

Artificial Intelligence

5

0

Save

0

Inferring interaction partners from protein sequences using mutual information

Anne-Florence BitbolJul 26, 2018

A

Abstract Specific protein-protein interactions are crucial in most cellular processes. They enable multiprotein complexes to assemble and to remain stable, and they allow signal transduction in various pathways. Functional interactions between proteins result in coevolution between the interacting partners, and thus in correlations between their sequences. Pairwise maximum-entropy based models have enabled successful inference of pairs of amino-acid residues that are in contact in the three-dimensional structure of multi-protein complexes, starting from the correlations in the sequence data of known interaction partners. Recently, algorithms inspired by these methods have been developed to identify which proteins are specific interaction partners among the paralogous proteins of two families, starting from sequence data alone. Here, we demonstrate that a slightly higher performance for partner identification can be reached by an approximate maximization of the mutual information between the sequence alignments of the two protein families. This stands in contrast with structure prediction of proteins and of multiprotein complexes from sequence data, where pairwise maximum-entropy based global statistical models substantially improve performance compared to mutual information. Our findings entail that the statistical dependences allowing interaction partner prediction from sequence data are not restricted to the residue pairs that are in direct contact at the interface between the partner proteins. Author summary Specific protein-protein interactions are at the heart of most intra-cellular processes. Mapping these interactions is thus crucial to a systems-level understanding of cells, and has broad applications to areas such as drug targeting. Systematic experimental identification of protein interaction partners is still challenging. However, a large and rapidly growing amount of sequence data is now available. Recently, algorithms have been proposed to identify which proteins interact from their sequences alone, thanks to the co-variation of the sequences of interacting proteins. These algorithms build upon inference methods that have been used with success to predict the three-dimensional structures of proteins and multi-protein complexes, and their focus is on the amino-acid residues that are in direct contact. Here, we propose a simpler method to identify which proteins interact among the paralogous proteins of two families, starting from their sequences alone. Our method relies on an approximate maximization of mutual information between the sequences of the two families, without specifically emphasizing the contacting residue pairs. We demonstrate that this method slightly outperforms the earlier one. This result highlights that partner prediction does not only rely on the identities and interactions of directly contacting amino-acids.

Genetics

Artificial Intelligence

0

Paper

Save

Generative power of a protein language model trained on multiple sequence alignments

Damiano Sgarbossa et al.Apr 15, 2022

A

U

D

Abstract Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

Genetics

Artificial Intelligence

14

Paper

Save

Pairing interacting protein sequences using masked language modeling

Umberto Lupo et al.Aug 14, 2023

A

D

U

Abstract Predicting which proteins interact together from amino-acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments, such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called DiffPALM that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing. Significance statement Deep learning has brought major advances to the analysis of biological sequences. Self-supervised models, based on approaches from natural language processing and trained on large ensembles of protein sequences, efficiently learn statistical dependence in this data. This includes coevolution patterns between structurally or functionally coupled amino acids, which allows them to capture structural contacts. We propose a method to pair interacting protein sequences which leverages the power of a protein language model trained on multiple sequence alignments. Our method performs well for small datasets that are challenging for existing methods. It can improve structure prediction of protein complexes by supervised methods, which remains more challenging than that of single-chain proteins.

Genetics

Artificial Intelligence

1

Paper

Save

Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins

Carlos Gandarilla-Pérez et al.Aug 25, 2022

Abstract Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest. Author summary When two protein families interact, their sequences feature statistical dependencies. First, interacting proteins tend to share a common evolutionary history. Second, maintaining structure and interactions through the course of evolution yields coevolution, detectable via correlations in the amino-acid usage at contacting sites. Both signals can be used to computationally predict which proteins are specific interaction partners among the paralogs of two interacting protein families, starting just from their sequences. We show that combining them improves the performance of interaction partner inference, especially when the average number of potential partners is large and when the total data set size is modest. The resulting paired multiple-sequence alignments might be used as input to machine-learning algorithms to improve protein-complex structure prediction, as well as to understand interaction specificity in signaling pathways.

Genetics

Artificial Intelligence

3

Paper

Save

Extracting the phylogenetic dimension of coevolution reveals hidden functional signal

Alexandre Colavin et al.Sep 25, 2020

Abstract Despite the structural and functional information contained in the statistical coupling between pairs of residues in a protein, coevolution associated with function is often obscured by artifactual signals such as genetic drift, which shapes a protein’s phylogenetic history and gives rise to concurrent variation between protein sequences that is not driven by selection for function. Here, we introduce a method for explicitly defining a phylogenetic dimension of coevolution signal, and demonstrate that coevolution can occur on multiple phylogenetic timescales within a single protein. Our method, Nested Coevolution (NC), can be applied as an extension to any coevolution metric. We use NC to demonstrate that poorly conserved residues can nonetheless have important roles in protein function. Moreover, NC improved structural-contact prediction over gold-standard coevolution-based methods, particularly in subsampled alignments with fewer sequences. NC also lowered the noise in detecting functional sectors of collectively coevolving residues. Sectors of coevolving residues identified after NC correction were more spatially compact and phylogenetically distinct from the rest of the protein, and strongly enriched for mutations that disrupt protein activity. Our conceptualization of the phylogenetic separation of coevolution represents an advance from previous pragmatic attempts to reduce phylogenetic artifacts in measurements of coevolution. Application of NC broadens the application of protein coevolution measurements, particularly to eukaryotic proteins with fewer naturally available sequences, and further elucidates relationships among protein evolution and genetic diseases.

Genetics

Molecular Biology

0

Paper

Save

Hydrodynamic flow and concentration gradients in the gut enhance neutral bacterial diversity

Darka Labavić et al.Apr 20, 2021

A

C

D

Abstract The gut microbiota features important genetic diversity, and the specific spatial features of the gut may shape evolution within this environment. We investigate the fixation probability of neutral bacterial mutants within a minimal model of the gut that includes hydrodynamic flow and resulting gradients of food and bacterial concentrations. We find that this fixation probability is substantially increased compared to an equivalent well-mixed system, in the regime where the profiles of food and bacterial concentration are strongly spatially-dependent. Fixation probability then becomes independent of total population size. We show that our results can be rationalized by introducing an active population, which consists of those bacteria that are actively consuming food and dividing. The active population size yields an effective population size for neutral mutant fixation probability in the gut.

Genetics

Ecology

0

Paper

Save

Hydrodynamic flow and concentration gradients in the gut enhance neutral bacterial diversity

Darka Labavić et al.May 16, 2021

A

C

D

Abstract The gut microbiota features important genetic diversity, and the specific spatial features of the gut may shape evolution within this environment. We investigate the fixation probability of neutral bacterial mutants within a minimal model of the gut that includes hydrodynamic flow and resulting gradients of food and bacterial concentrations. We find that this fixation probability is substantially increased compared to an equivalent well-mixed system, in the regime where the profiles of food and bacterial concentration are strongly spatially-dependent. Fixation probability then becomes independent of total population size. We show that our results can be rationalized by introducing an active population, which consists of those bacteria that are actively consuming food and dividing. The active population size yields an effective population size for neutral mutant fixation probability in the gut. Significance statement The human body harbors numerous and diverse bacteria, the vast majority of which are located in the gut. These bacteria can mutate and evolve within the gut, which is their natural environment. This can have important public health implications, e.g. when gut bacteria evolve antibiotic resistance. The gut features specific characteristics, including hydrodynamic flow and resulting gradients of food and bacterial concentrations. How do these characteristics impact the evolution and diversity of gut bacteria? We demonstrate that they can substantially increase the probability that neutral mutants reach high proportions and eventually take over the population. This is because only a fraction of gut bacteria is actively dividing. Thus, the specific environment of the gut enhances neutral bacterial diversity.

Genetics

Ecology

1

Paper

Save

Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences

Andonis Gerardos et al.Nov 22, 2021

A

N

A

Abstract Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny. Author summary In protein sequence data, the amino acid usages at different sites of a protein or of two interacting proteins can be correlated because of functional constraints. For instance, the need to maintain physicochemical complementarity among two sites that are in contact in the three-dimensional structure of a protein complex causes such correlations. However, correlations can also arise due to shared evolutionary history, even in the absence of any functional constraint. While these phylogenetic correlations are known to obscure the inference of structural contacts, we show, using controlled synthetic data, that correlations from structure and phylogeny combine constructively to allow the inference of protein partners among paralogs using just sequences. We also show that pairs of amino acids that are not in contact in the structure have a major impact on partner inference in a natural data set and in realistic synthetic ones. These findings explain the success of methods based on pairwise maximum-entropy models or on information theory at predicting protein partners from sequences among paralogs.

Genetics

Artificial Intelligence

10

Paper

Genetics

2

0

Save