🌱 Amy X. Lu

Hello! I'm a Computer Science PhD student at UC Berkeley and BAIR advised by Pieter Abbeel, and a part-time researcher at Prescient Design (Genentech).

I'm broadly interested in artificial intelligence for drug discovery, especially via multimodal generation and foundation model approaches. My long-term goal is to understand general agentic behaviors to create scientifically-native intelligence.

Previously, I was a Student Researcher at Google Brain and Machine Learning Engineer at insitro. I completed my Masters at the University of Toronto advised by Alan Moses and Marzyeh Ghassemi, and my undergrad at the University of Waterloo. My PhD is generously supported in part by the NSERC PGS-D award.



CV  /  Google Scholar  /  GitHub  /  Twitter  /  LinkedIn
✉️ amyxlu [at] berkeley [dot] edu

News

2024/12/09

In Vancouver for NeurIPS 2024 -- come say hi 👋

2024/12/06

Our preprint on PLAID is released 🎉

2024/10/22

Excited to give an invited talk at the Stanford AI + Biomedicine Seminar Series.

2024/10/11

Model weights for CHEAP are now released.

2024/10/08

Very excited to have two papers accepted as an ✨oral presentation✨ at MLSB 2024!

2024/10/03

New preprint on understanding how training data affects protein language model likelihoods!

More >>

Research

Generating All-Atom Protein Structure from Sequence-Only Training Data

Amy X. Lu, Wilson Yan, Sarah A. Robinson, Kevin K. Yang, Vladimir Gligorijevic, Kyunghyun Cho, Richard Bonneau, Pieter Abbeel, Nathan Frey
[Oral] NeurIPS Workshop on Machine Learning for Structural Biology (MLSB), 2024
bioRxiv, 2024
Paper / Code / Poster / Slides
PLAID is a multimodal protein generation model that generates all-atom protein structures from function and organism prompts, but requires only sequence training data.

Protein Language Model Fitness Is a Matter of Preference

Cade Gordon, Amy X. Lu, Pieter Abbeel
[Oral] NeurIPS Workshop on Machine Learning for Structural Biology (MLSB), 2024
bioRxiv, 2024
Paper
Enabled by a one-pass pseudolikelihood algorithm, we find that pLMs capture artifacts of training data selection rather than true fitness landscape via influence functions.

Tokenized and Continuous Embedding Compressions of Protein Sequence and Structure

Amy X. Lu, Wilson Yan, Kevin K. Yang, Vladimir Gligorijevic, Kyunghyun Cho, Pieter Abbeel, Richard Bonneau, Nathan Frey
bioRxiv, 2024
Paper / Poster / Code
CHEAP is a joint embedding of protein sequence and structure that can be obtained from sequence alone, and unveil insights into the compressibilitiy, tokenizability, and mechanistic interpretability of protein folding models.

TOPH

TOPH: Adapting A Contrastive Question-Answering Framework for Protein Search

Ron Boger*, Amy X. Lu*, Seyone Chithrananda*, Kevin Yang, Petr Skopintsev, Ben Adler, Eric Wallace, Peter Yoon, Pieter Abbeel, Jennifer Doudna (*Equal Contribution.)
ICML Workshop on Computational Biology, 2023
Paper / Poster
We present a protein semantic similarity search method for RNA-Guided endonuclease discovery, inspired by dense retrieval methods in open-domain question answering, and introduce a new dataset of CRISPR-Cas and evolutionary-related nucleases.

Data-Driven Promoter Design

Pretraining strategies for effective promoter-driven gene expression prediction

Aniketh Janardhan Reddy, Michael H. Herschl, Sathvik Kolli, Amy X. Lu, Xinyang Geng, Aviral Kumar, Patrick D. Hsu, Sergey Levine, Nilah M. Ioannidis
bioRxiv, 2023
Paper
Pretraining and transfer learning strategies for improving model-based design of promoters for cell type-specific expression.

MLDD

Data-Driven Optimization for Protein Design: Workflows, Algorithms and Metrics

Sathvik Kolli, Amy X. Lu, Xinyang Geng, Aviral Kumar, Sergey Levine
ICLR Workshop on Machine Learning for Drug Discovery (MLDD), 2022
Paper
Strategies for data curation, model-training, optimization, and evaluation heuristics for data-driven proposals of novel de novo proteins.

CPCProt

Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning

Alex X Lu, Amy X. Lu, Iva Pritišanac, Taraneh Zarin, Julie D Forman-Kay, Alan M Moses
PLOS Computational Biology, 2022
Paper / Preprint
Reverse Homology is a self-supervised method which captures evolutionary information by contrastive learning to discover molecular features of intrinsically disordered regions.

bio-embeddings

Learned embeddings from deep learning to visualize and predict protein sets

Christian Dallago, Konstantin Schütze, Michael Heinzinger, Tobias Olenyi, Maria Littmann, Amy X. Lu, Kevin K Yang, Seonwoo Min, Sungroh Yoon, James T Morton, Burkhard Rost
Current Protocols, 2021
Paper / Web Server / Code

Evolution Is All You Need

Evolution Is All You Need: Phylogenetic Augmentation for Contrastive Learning

Amy X. Lu, Alex X. Lu, Alan Moses
Machine Learning for Computational Biology (MLCB), 2020
Paper / Poster
We outline how viewing evolution as natural sequence augmentation for contrastive learning recapitulates comparative genomics, and maximizes the mutual information between sequence and function.

CPCProt

Self-Supervised Contrastive Learning of Protein Representations by Mutual Information Maximization

Amy X. Lu, Haoran Zhang, Marzyeh Ghassemi, Alan Moses
Machine Learning for Computational Biology (MLCB), 2020
Paper / Poster / Code
CPCProt uses contrastive learning to learn a parameter-efficient way of embedding proteins, and performs competitively with large language models.

Hurtful Words

Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings

Haoran Zhang*, Amy X. Lu*, Mohamed Abdalla, Matthew McDermott, Marzyeh Ghassemi (*Equal Contribution.)
[Spotlight] ACM Conference on Health, Inference, and Learning (CHIL), 2020
Paper / arXiv / Poster / Code
We apply fairness definitions to quantify the cross-group bias in BERT embeddings pretrained on medical notes, and find statistically significant differences in classifier performance.

COOS

The Cells Out of Sample (COOS) dataset and benchmarks for measuring out-of-sample generalization of image classifiers

Alex X Lu, Amy X. Lu, Wiebke Schormann, Marzyeh Ghassemi, David Andrews, Alan Moses
Neural Information Processing Systems (NeurIPS), 2019
Paper / arXiv
Introduces the COOS-7 dataset to benchmark and evaluate the capacity of feature learning methods to generalize to natural distribution shifts in microscopy images.

Talks

2024

Stanford AI + Biomedicine Seminar  •  Slides

ML Protein Engineering Seminar Series

South Park Commons Demo Night on Interpretability and Steerability  •  Video  •  Slides


Miscellaneous

Reviewing

Nature

2023

Neural Information Processing Systems (NeurIPS)

2024

International Conference on Learning Representations (ICLR)

2024

Artificial Intelligence and Statistics (AISTATS)

2024

Machine Learning for Health (ML4H)

2020 - 2024

Machine Learning for Computational Biology (MLCB)

2021

NeurIPS Workshop on ML for Structural Biology (MLSB)

2022 - 2024

ICLR Workshop on Generative and Experimental Perspectives (GEM)

2024

NeurIPS Workshop on Generative AI for Biology

2023

NeurIPS Workshop on AI for Science

2021 - 2023

NeurIPS Workshop on Distribution Shifts

2021 - 2023

ICML Workshop on AI4Science

2022

NeurIPS Workshop on Robustness in Sequence Modelling

2022

Teaching

BIOE 145: Introduction to Machine Learning for Computational Biology, UC Berkeley

2024

BIOL 239: Genetics, University of Waterloo

2016



Fun. I enjoy road biking through the East Bay redwoods, and playing the piano, especially Chopin and hip-hop covers. I'm usually coding to EDM or Beethoven's complete piano sonatas while eating 90% dark chocolate. My car and bikes are named after F. Scott Fitzgerald characters, and administrative entities call me Xiaoping Lu (逯晓萍).