Abstract Predictive modelling of protein properties has become increasingly important to the field of machine-learning guided protein engineering. In one of the two existing approaches, evolutionarily-related sequences to a query protein drive the modelling process, without any property measurements from the laboratory. In the other, a set of protein variants of interest are assayed, and then a supervised regression model is estimated with the assay-labelled data. Although a handful of recent methods have shown promise in combining the evolutionary and supervised approaches, this hybrid problem has not been examined in depth, leaving it unclear how practitioners should proceed, and how method developers should build on existing work. Herein, we present a systematic assessment of methods for protein fitness prediction when evolutionary and assay-labelled data are available. We find that a simple baseline approach we introduce is competitive with and often outperforms more sophisticated methods. Moreover, our simple baseline is plug-and-play with a wide variety of established methods, and does not add any substantial computational burden. Our analysis highlights the importance of systematic evaluations and sufficient baselines.
This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.