Abstract Recent developments in protein design have adapted large neural networks with up to 100s of millions of parameters to learn complex sequence-function mappings. However, it is unclear which dependencies between residues are critical for determining protein function, and a better empirical understanding could enable high quality models that are also more data- and resource-efficient. Here, we observe that the per residue amino acid preferences - without considering interactions between mutations are sufficient to explain much, and sometimes virtually all of the combinatorial mutation effects across 7 datasets (R 2 ∼ 78-98%), including one generated here. These preference parameters (20*N, where N is the number of mutated residues) can be learned from as few as ∼5*20*N observations to predict a much larger number (potentially up to 20 N ) of combinatorial variant effects with high accuracy (Pearson r > 0.8). We hypothesized that the local structural dependencies surrounding a residue could be sufficient to learn these required mutation preferences, and developed an unsupervised design approach, which we term CoVES for ‘ Co mbinatorial V ariant E ffects from S tructure’. We show that CoVES outperforms not just model free sampling approaches but also complicated, high-capacity autoregressive neural networks in generating functional and diverse sequence variants for two example proteins. This simple, biologically-rooted model can be an effective alternative to high-capacity, out of domain models for the design of functional proteins.
This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.