Generating All-Atom Protein Structure from Sequence-Only Training Data

Amy X. Lu1,2, Wilson Yan1, Sarah A. Robinson2, Kevin K. Yang3, Vladimir Gligorijevic2, Kyunghyun Cho2,4, Richard Bonneau2, Pieter Abbeel1, Nathan Frey2
1UC Berkeley, 2Prescient Design, Genentech, 3Microsoft Research, 4New York University
*Correspondence: amyxlu@berkeley.edu
Unconditional generation

PLAID generates all-atom protein structures from sequence-only training data, with compositional conditioning by function and organism, by sampling from the latent space of ESMFold.

Abstract

We propose PLAID (Protein Latent Induced Diffusion), a method for multimodal protein generation that learns and samples from the latent space of a predictor, mapping from a more abundant data modality (e.g., sequence) to a less abundant one (e.g., crystallography structure). Specifically, we sample from the latent space of ESMFold to address the all-atom structure generation setting, which requires producing both the 3D structure and 1D sequence to define side-chain atom placements. Importantly, PLAID only requires sequence inputs to obtain latent representations during training, enabling the use of sequence databases for generative model training and augmenting the data distribution by 2 to 4 orders of magnitude compared to experimental structure databases. Sequence-only training also allows access to more annotations for conditioning generation. As a demonstration, we use compositional conditioning on 2,219 functions from Gene Ontology and 3,617 organisms across the tree of life. Despite not using structure inputs during training, generated samples exhibit strong structural quality and consistency. Function-conditioned generations learn side-chain residue identities and atomic positions at active sites, as well as hydrophobicity patterns of transmembrane proteins, while maintaining overall sequence diversity. Model weights and code are publicly available at github.com/amyxlu/plaid.

Approach

PLAID allows compositional control of protein function and organism type. By conditioning on Gene Ontology (GO) terms and organism labels, the model can produce all-atom proteins tailored to specific functional classes and taxonomic groups. This controllability extends naturally to motif scaffolding, binder design, and beyond.

approach

Unconditional Generation

PLAID outperforms previous methods in producing structurally diverse, designable proteins that better match real-world biophysical property distributions. It achieves improved multimodal cross-consistency and better scalability to longer protein lengths.

Comparison Figure Comparison Figure

Examining Active Sites of Generations

PLAID-generated proteins match key features of natural proteins at active sites, even with different overall sequences. Each example shows a PLAID structure aligned with its most similar known protein structure (containing a bound molecule), found using Foldseek. RMSD and sequence metrics are calculated for the entire proteins.

active site

Transmembrane Proteins

Generated transmembrane proteins display appropriate spatial distribution of hydrophobic and hydrophilic residues, with hydrophobic residues concentrated in membrane-spanning regions. Transmembrane proteins also have good helix topologies, such as the seven transmembrane helices of GPCRs.

Transmembrane

Motif Scaffolding

PLAID can be easily extended to other design tasks. Shown is a demo of motif scaffolding by keeping the motif fixed and conditioning on the fixed motif at each diffusion timestep.

approach

Additional Results

approach

Future Directions

Our method is designed to leverage progress in data availability and model capabilities. As sequence-to-structure predictors are becoming increasingly capable of producing structures of complexes involving ligands or nucleic acids (e.g. AlphaFold3, Boltz-1, and CHAI-1), it would be straightforward to extend the PLAID paradigm to generate those modalities as well. We hope this work can be viewed as proposing a paradigm rather than just a model, and we expect performance and capabilities to improve alongside the base models.

We also can extend the current model to more specific functions; since the model only requires sequences, it is easier to finetune it for specific datasets related to functions of interest. If you are interested in examining a specific protein design task in the wet lab, please reach out to us!

BibTeX

@article{lu2024generating,
  author    = {Lu, Amy X. and Yan, Wilson and Robinson, Sarah A. and Yang, Kevin K. and 
               Gligorijevic, Vladimir and Cho, Kyunghyun and Bonneau, Richard and 
               Abbeel, Pieter and Frey, Nathan},
  title     = {Generating All-Atom Protein Structure from Sequence-Only Training Data},
  journal   = {bioRxiv},
  year      = {2024},
  doi       = {10.1101/2024.12.02.626353}
}