Structure-informed language models are protein designers

Tuesday November 28th, 4-5pm EST | Zaixiang Zheng— research scientist, ByteDance

Summary: This paper demonstrates that language models are strong structure-based protein designers. We present LM-Design, a generic approach to reprogramming sequence-based protein language models (pLMs), that have learned massive sequential evolutionary knowledge from the universe of natural protein sequences, to acquire an immediate capability to design preferable protein sequences for given folds. We conduct a structural surgery on pLMs, where a lightweight structural adapter is implanted into pLMs and endows it with structural awareness. During inference, iterative refinement is performed to effectively optimize the generated protein sequences. Experiments show that LM-Design improves the state-of-the-art results by a large margin, leading to 4% to 12% accuracy gains in sequence recovery (e.g., 55.65%/56.63% on CATH 4.2/4.3 single-chain benchmarks, and >60% when designing protein complexes). We provide extensive and in-depth analyses, which verify that LM-Design can (1) indeed leverage both structural and sequential knowledge to accurately handle structurally non-deterministic regions, (2) benefit from scaling data and model size, and (3) generalize to other proteins (e.g., antibodies and de novo proteins).

Website: https://zhengzx-nlp.github.io/

Paper: https://proceedings.mlr.press/v202/zheng23a/zheng23a.pdf

 

Zaixiang Zheng is a research scientist at ByteDance Research. His research interest lies in deep generative modeling and its applications in various real-world problems, especially in human languages and AI for Science. At ByteDance Research, he is working on protein design with the AI Drug Discovery Team. Prior to joining ByteDance, he received his Ph.D in computer science from Nanjing University.