Paper
Document
Download
Flag content
22

Removing bias in sequence models of protein fitness

Authors
Ada Y Shaw,Hansen B Spinner
+6 authors
,Debora Marks
Published
Jan 1, 2023
Show more
Save
TipTip
Document
Download
Flag content
22
TipTip
Save
Document
Download
Flag content

Abstract

Unsupervised sequence models for protein fitness have emerged as powerful tools for protein design in order to engineer therapeutics and industrial enzymes, yet they are strongly biased towards potential designs that are close to their training data. This hinders their ability to generate functional sequences that are far away from natural sequences, as is often desired to design new functions. To address this problem, we introduce a de-biasing approach that enables the comparison of protein sequences across mutational depths to overcome the extant sequence similarity bias in natural sequence models. We demonstrate our method9s effectiveness at improving the relative natural sequence model predictions of experimentally measured variant functions across mutational depths. Using case studies proteins with very low functional percentages further away from the wild type, we demonstrate that our method improves the recovery of top-performing variants in these sparsely functional regimes. Our method is generally applicable to any unsupervised fitness prediction model, and for any function for any protein, and can thus easily be incorporated into any computational protein design pipeline. These studies have the potential to develop more efficient and cost-effective computational methods for designing diverse functional proteins and to inform underlying experimental library design to best take advantage of machine learning capabilities.

Paper PDF

This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.