A general model for predicting substrates of enzymes
based on transformer networks and gradient boosting models
Tuesday March 19th, 4-5pm EST | Alexander Kroll, PhD— Heinrich-Heine-University, Düsseldorf
Abstract: For most proteins annotated as enzymes, it is unknown which primary and/or secondary reactions they catalyze. Experimental characterization of potential substrates is time consuming and costly. Machine learning predictions could provide an efficient alternative, but are hampered by a lack of information about enzyme non-substrates, as the available training data comprises mainly positive examples. We have developed ESP, the first general machine learning model for predicting enzyme-substrate pairs with over 94% accuracy on independent and diverse test data. ESP can be successfully applied across widely varying enzymes and a broad range of metabolites included in the training data, outperforming models designed for single, well-studied enzyme families. To achieve these results, we developed ProSmith, a machine learning framework that uses a multimodal transformer network to simultaneously process protein amino acid sequences and small molecule strings in the same input. This approach facilitates the exchange of all relevant information between the two types of molecules during the computation of their numerical representations, allowing the model to account for their structural and functional interactions. Our final model combines gradient boosting predictions based on the resulting multimodal transformer network with independent predictions based on separate deep learning representations of the proteins and small molecules.
Alexander is a current PostDoc in the research group "Computational Cell
Biology" of Prof. Martin Lercher at the Heinrich-Heine-University
Düsseldorf. 
