Paper
Document
Download
Flag content
5

Mapping the glycosyltransferase fold landscape using deep learning

Save
TipTip
Document
Download
Flag content
5
TipTip
Save
Document
Download
Flag content

Abstract

Abstract Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learned distinguishing features free of primary sequence alignment constraints and, unlike other models, is highly interpretable and helped identify common secondary structural features shared by divergent families. The model delineated sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies identify targets for future structural studies and expand the GT fold landscape.

Paper PDF

This paper's license is marked as closed access or non-commercial and cannot be viewed on ResearchHub. Visit the paper's external site.