Binder Design With RFdiffusion#
Designing a high-affinity binder starts with the right tools. This tutorial introduces how to use RFdiffusion on the OpenProtein AI platform, using our Python client, to generate and evaluate binder candidates against a specific protein target.
You’ll learn how to set up your environment, define a target structure and binding site constraints, configure RFdiffusion runs, submit and monitor jobs, and retrieve results programmatically.
We’ll also cover how to use the designs with inverse folding for suggesting suitable protein sequences through ProteinMPNN, then put them through structure prediction with AlphaFold2 to evaluate the designed binders. Whether you’re new to RFdiffusion or looking to streamline your workflow, this guide will help you go from target definition to prioritized binder designs quickly and reproducibly.
This tutorial follows the approach described in Watson et al. (2023) “De novo design of protein structure and function with RFdiffusion”, using the publicly available 3DI3 structure (IL-7Rα) as our target. We also follow some of the methodology in Bennet et al. (2023) in “Improving de novo protein binder design with deep learning”.
Prerequisites#
For this tutorial, you will need your OpenProtein python session for accessing the models available on our platform and manipulating job results, so make sure you have your credentials setup!
[1]:
import openprotein
session = openprotein.connect()
session
[1]:
<openprotein.OpenProtein at 0x7f718f981400>
Target Selection#
For this tutorial, we’ll use the 3DI3 structure from the RCSB PDB, which contains the extracellular domain of human interleukin-7 receptor alpha (IL-7Rα). This receptor was used as one of the benchmark targets in the Watson et al. (2023) for evaluating binder design performance.
Download the structure from RCSB and load it as a Protein object:
[2]:
from pathlib import Path
from openprotein import Protein, Model
import numpy as np
import requests
DATA_DIR = Path("data/")
DATA_DIR.mkdir(exist_ok=True)
# Download 3DI3 from RCSB
pdb_id = "3DI3"
structure_filepath = DATA_DIR / f"{pdb_id}.pdb"
if not structure_filepath.exists():
url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
response = requests.get(url)
structure_filepath.write_text(response.text)
# Load the receptor chain (IL-7Ra is chain B)
target_protein = Protein.from_filepath(path=structure_filepath, chain_id="B")
print("target sequence:", target_protein.sequence)
print("target coordinates shape:", target_protein.coordinates.shape)
print("target plddt shape:", target_protein.plddt.shape)
print("target name:", target_protein.name)
target sequence: b'GSHMESGYAQNGDLEDAELDDYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTPEINNSSGEMD'
target coordinates shape: (223, 37, 3)
target plddt shape: (223,)
target name: 3DI3
Visualize#
We can visually inspect the target structure using molviewspec:
[3]:
%pip install molviewspec
Requirement already satisfied: molviewspec in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (1.7.0)
Requirement already satisfied: pydantic<3,>=1 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from molviewspec) (2.12.5)
Requirement already satisfied: annotated-types>=0.6.0 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (2.41.5)
Requirement already satisfied: typing-extensions>=4.14.1 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (0.4.2)
Note: you may need to restart the kernel to use updated packages.
[4]:
from molviewspec import create_builder
def visualize_pdb(pdb_string: str):
builder = create_builder()
structure = builder.download(url="mystructure.pdb")\
.parse(format="pdb")\
.model_structure()\
.component()\
.representation()\
.color_from_source(schema="atom",
category_name="atom_site",
field_name="auth_asym_id",
palette={"kind": "categorical", # color by chain
"colors": ["blue", "red", "green", "orange"],
"mode": "ordinal"}
)
builder.molstar_notebook(data={'mystructure.pdb': pdb_string}, width=500, height=400)
visualize_pdb(target_protein.make_pdb_string())
Binding region selection#
According to the supplementary information of Watson et al (2023), the following hotspots or binding sites were chosen for 3DI3, which we will use for our walkthrough as well:
B58, B80, B139
Note: RFdiffusion has been trained with masking hotspots, so we only need to pick a few potential contact sites within our areas of interest. Refer to the official RFdiffusion docs for tips on picking hotspots.
To encode these into our generate query, we use the set_binding_at method for the Protein.
[5]:
from openprotein.protein import Binding
binding_sites = [58,80,139]
target_protein = target_protein.set_binding_at(binding_sites, Binding.BINDING)
# Verify the binding is set
target_protein.get_binding_at(binding_sites)
[5]:
array(['B', 'B', 'B'], dtype='<U1')
Generate designs with RFdiffusion#
Query design#
To generate a binder with RFdiffusion, we need to specify there is another unknown chain. For this walkthrough, we’ll keep the full target chain and generate a separate binder chain of length 80 residues. To encode this as a Query, we first create a Protein chain with length
We can use
Protein.from_expras an easy constructor for specifying chains with unknown fragments.
The structure mask determines which part of the structure should be designed. The X below is indicating the sequence mask, which is used in inverse folding, which we will also do in the next step after generating the structure designs. We can examine the structure mask using get_structure_mask.
[6]:
binder_chain = Protein.from_expr(80)
print("binder sequence:", binder_chain.sequence)
print("binder structure mask:", binder_chain.get_structure_mask())
binder sequence: b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
binder structure mask: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80]
As we can see, the whole structure of our binder chain is masked, which is telling the model to fully design the chain. And to indicate to the model that the design is to be done in the presence of another chain, we combine our binder and target Protein objects to create a Model, which represents a multimer.
But before that, let’s quickly examine the structure mask of our target protein to avoid doing unnecessary design.
[7]:
print("target structure mask:", target_protein.get_structure_mask())
target structure mask: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 214 215 216 217 218 219 220 221 222 223]
This is important with RFdiffusion - we should drop any residues that we don’t actually want to design. This saves compute time and also seems to cause some errors in using the model. We should also only do this after setting our hotspots or binding sites since the deletion shifts our residue indices.
[8]:
target_protein = target_protein.delete(target_protein.get_structure_mask())
print("target structure mask:", target_protein.get_structure_mask())
target structure mask: []
That’s better. Now let’s combine our two chains to specify the full query Model object.
[9]:
query_model = target_protein & binder_chain
print("Chains in query:", list(query_model.proteins.keys()))
print("Chain A (target chain):", query_model.proteins["A"].sequence)
print("Chain B (binder chain):", query_model.proteins["B"].sequence)
Chains in query: ['A', 'B']
Chain A (target chain): b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
Chain B (binder chain): b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
Run the design job#
Following Bennett et al. (2023), we reduce the noise added during generation, which has been found to help with binder design, albeit at the cost of some diversity:
[10]:
rfdiffusion_design_params = {
"denoiser.noise_scale_ca": 0.5,
"denoiser.noise_scale_frame": 0.5
}
With these inputs, we can run RFdiffusion to generate designs for both potential binding regions:
[11]:
# Number of designs to generate
N = 100
rfdiffusion_job = session.models.rfdiffusion.generate(
query=query_model,
N=N,
**rfdiffusion_design_params,
)
rfdiffusion_job
[11]:
RFdiffusionJob(job_id='d40002e7-95ec-41a4-9992-ab2edfea2656', job_type='/models/rfdiffusion', status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2025, 12, 22, 21, 40, 19, 404136, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None)
### Wait for completion
Wait for the designs to complete. Note that this can take some time depending on the queue:
[12]:
rfdiffusion_job.wait_until_done(verbose=True, timeout=60*60)
Waiting: 100%|██████████████████████████████████████████████████| 100/100 [00:00<00:00, 509.85it/s, status=SUCCESS]
[12]:
True
Analyze generated designs#
Let’s first retrieve our designs, and inspect the first design.
[13]:
rfdiffusion_models = rfdiffusion_job.get()
print("chains in design:", list(rfdiffusion_models[0].proteins.keys()))
print("target (chain A) sequence:", rfdiffusion_models[0].proteins["A"].sequence)
print("binder (chain B) sequence:", rfdiffusion_models[0].proteins["B"].sequence)
print("target (chain A) mask:", rfdiffusion_models[0].proteins["A"].get_structure_mask())
print("binder (chain B) mask:", rfdiffusion_models[0].proteins["B"].get_structure_mask())
chains in design: ['B', 'A']
target (chain A) sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
binder (chain B) sequence: b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
target (chain A) mask: []
binder (chain B) mask: []
As we can see, we are returned with two chains with their structures fully designed. Note also that our binder chain remains with the sequence masked. We will infer the masked sequence in the next step with inverse folding.
Before that, we can also quickly visually inspect the design:
[14]:
visualize_pdb(rfdiffusion_models[0].make_pdb_string())
Now let’s iterate through these designs and save them.
[15]:
import os
import numpy as np
from pathlib import Path
import string
N = 100
OUTPUT_DIR = Path("data/outputs/3DI3_binder_designs")
os.makedirs(OUTPUT_DIR, exist_ok=True)
for i in range(N):
# Retrieve the completed design
designed_model = rfdiffusion_models[i]
# Save the full complex
with open(f"{OUTPUT_DIR}/design{i+1}.pdb", "w") as f:
f.write(designed_model.make_pdb_string())
Inverse Folding with ProteinMPNN#
Following Bennett et al. (2023), we’ll use ProteinMPNN for inverse folding to design sequences that adopt the designed binder structures.
For each of the 100 designs, we will generate 10 proposed sequences from inverse folding.
[16]:
proteinmpnn_jobs = []
for i in range(N):
rfdiffusion_model = rfdiffusion_models[i]
# Mask the binder sequence to indicate that it should be generated
rfdiffusion_model.mask_sequence(chain_ids="B")
# Use ProteinMPNN to design sequences for the binder backbone
mpnn_job = session.models.proteinmpnn.generate(
query=rfdiffusion_model,
num_samples=10,
temperature=0.1, # Bennett et al. used low temperature
seed=42,
)
proteinmpnn_jobs.append(mpnn_job)
# Wait for all jobs to complete
for mpnn_job in proteinmpnn_jobs:
mpnn_job.wait_until_done(timeout=600)
assert mpnn_job.status == "SUCCESS"
Let’s look at the output from one of the ProteinMPNN jobs.
[16]:
proteinmpnn_jobs[0].get()
[16]:
[Score(name='generated-sequence-1', sequence='EEEEEKEKKEEEERKKLIEEGKKAREELAKKADKALEELEKEEEEEEEEEEEEEEEEEEEEEEEEEEERRREEEEEELER', score=array([1.2638])),
Score(name='generated-sequence-2', sequence='SELEKKLEEEEKERKKLIEEGEKHREELAKKSEEALKKLEEKEKAEEAARAAEEAARRAAAAAAAAAAAAAAAAAAAAAA', score=array([1.278])),
Score(name='generated-sequence-3', sequence='EKEEEEKKKEEEELEEKIKEGEEARKKLAELSDKALKEREEKEREEEEKEEEEREEEEEEEAEEEEEEEEEEEEEEEEEE', score=array([1.3055])),
Score(name='generated-sequence-4', sequence='KEEEEKKKKEEEEKEKLIKEGKEALEERAKKAEEALAALEAEEAEREAAAAAARAAARAAAAAAAAAAAAAAAEAARAAA', score=array([1.2675])),
Score(name='generated-sequence-5', sequence='MEEEEKKKKEEEEKKKLIEEGKKAQEERAEKADKAYEELKKAEAEAEAAAAAAAAAAAAAAAAAAAAAAAAAAAALAAAA', score=array([1.2027])),
Score(name='generated-sequence-6', sequence='EEEEKEKKKEEEEKKKLIEEGEEARKKRAEEAEKALEELEKEEEEKEKKELEARLAAEKAAAAAAAAAAAAAEEAARLAA', score=array([1.3404])),
Score(name='generated-sequence-7', sequence='EEEEKKEKEEKEKKEKLIEEGKEALKKRAEESEKALEELQLKEALEELLEALLELLRELLAALEEALRLLEEELRRLEEE', score=array([1.5148])),
Score(name='generated-sequence-8', sequence='SEEEEKEKKEEEEKKKLIEEGKKAREERAKEAEKALEELEKKEEEEERRRREEREARRRREEEERERRLEEERRREEEER', score=array([1.2896])),
Score(name='generated-sequence-9', sequence='SEEEEKKKEEEEKKKKLIEEGKKAQEERAKKAEEALKKLEAAQAAEEAAKAAAAAAAAAAAAAAAAAAAAEAAAAAAAAA', score=array([1.1734])),
Score(name='generated-sequence-10', sequence='SEEEEKKEEEKKKKEELIKEAKKALEERAKKAEEALKELERKLEEEEERRRREEEERREREAEERRREEEERRRREEEER', score=array([1.2883]))]
Each of these 10 sequences correspond to the first design from RFdiffusion. Let’s save the ProteinMPNN predictions together with the RFdiffusion designs so that we have a 1000 of these potential designs.
[17]:
scores = []
for i in range(N):
rfdiffusion_model = rfdiffusion_models[i]
mpnn_job = proteinmpnn_jobs[i]
mpnn_results = mpnn_job.get()
for j, (_, sequence, score) in enumerate(mpnn_results):
# replace chain explicitly due to defensive copy
binder = rfdiffusion_model.proteins["B"]
binder.sequence = sequence
rfdiffusion_model.proteins["B"] = binder
scores.append(score.item())
with open(f"{OUTPUT_DIR}/design{i+1}_mpnn{j+1}.pdb", "w") as f:
f.write(rfdiffusion_model.make_pdb_string())
with open(f"{OUTPUT_DIR}/mpnn_scores.txt", "w") as f:
f.write("\n".join([str(score) for score in scores]))
Let’s now verify that our new designed and inverse-folded model looks correct:
[18]:
from openprotein import Model
OUTPUT_DIR = Path("data/outputs/3DI3_binder_designs")
proteinmpnn_model = Model.from_filepath(f"{OUTPUT_DIR}/design1_mpnn1.pdb")
print("chains in proteinmpnn + rfdiffusion design:", list(proteinmpnn_model.proteins.keys()))
print("target sequence:", proteinmpnn_model.proteins["A"].sequence)
print("binder sequence:", proteinmpnn_model.proteins["B"].sequence)
print("target mask:", proteinmpnn_model.proteins["A"].get_structure_mask())
print("binder mask:", proteinmpnn_model.proteins["B"].get_structure_mask())
chains in proteinmpnn + rfdiffusion design: ['B', 'A']
target sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
binder sequence: b'EEEEEKEKKEEEERKKLIEEGKKAREELAKKADKALEELEKEEEEEEEEEEEEEEEEEEEEEEEEEEERRREEEEEELER'
target mask: []
binder mask: []
Notice that what we have is a combination of the two models: the binder structure is from RFdiffusion and the inverse folded binder sequence is from ProteinMPNN. The next step is to check if the predicted multimer folds into what we expect.
Structure Prediction with ESMFold#
Whilst Bennett et al. (2023) and Watson et al. (2023) both used AlphaFold2 to re-fold their designs, we will use ESMFold instead to validate our designs.
The key insight from their paper is that AF2’s prediction confidence metrics (particularly pAEinteraction) can effectively discriminate successful binders from failures. Bennett et al. found that the pAEinteraction metric (average pAE of interchain residue pairs) was extremely effective at identifying successful binders, with sharp increases in success rates for designs with pAEinteraction < 10.
We can obtain these same metrics with ESMFold, which will also run a lot faster than AF2. The papers also used AF2 initial guess, with templates, which are features not yet ready for use with our AF2 on our platform. This walkthrough will be updated if add support for these features and find that the AF2 metrics perform better. The key point to note is that our platform allows easy drop-in replacements for various steps in your protein design pipeline.
[17]:
proteinmpnn_models = []
for i in range(N):
for j in range(10):
proteinmpnn_model = Model.from_filepath(f"{OUTPUT_DIR}/design{i+1}_mpnn{j+1}.pdb")
proteinmpnn_models.append(proteinmpnn_model)
esmfold_job = session.fold.esmfold.fold(
proteinmpnn_models
)
esmfold_job
/home/jmage/Projects/openprotein/openprotein-python-private/openprotein/base.py:147: UserWarning: The requested payload is >1MB. There might be some delays or issues in processing. If the request fails, please try again with smaller sizes.
warnings.warn(
[17]:
FoldJob(num_records=1000, job_id='80d96bc4-0d04-4aba-b2c4-1074341f18c0', job_type=<JobType.embeddings_fold: '/embeddings/fold'>, status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2025, 12, 22, 23, 25, 30, 319818, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None)
Wait for completion. This will likely take around an hour.
[18]:
esmfold_job.wait_until_done(verbose=True, timeout=60*60)
Waiting: 100%|███████████████████████████████████████████████████| 100/100 [00:10<00:00, 9.25it/s, status=SUCCESS]
[18]:
True
Let’s retrieve and inspect the ESMFold fold results:
[19]:
esmfold_results = esmfold_job.get()
esmfold_seq, esmfold_model = esmfold_results[0] # a fold returns (seq, model) tuples
print("chains in folded model:", list(esmfold_model.proteins.keys()))
print("target sequence:", esmfold_model.proteins["A"].sequence)
print("binder sequence:", esmfold_model.proteins["B"].sequence)
print("target mask:", esmfold_model.proteins["A"].get_structure_mask())
print("binder mask:", esmfold_model.proteins["B"].get_structure_mask())
chains in folded model: ['A', 'B']
target sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
binder sequence: b'EEEEEKEKKEEEERKKLIEEGKKAREELAKKADKALEELEKEEEEEEEEEEEEEEEEEEEEEEEEEEERRREEEEEELER'
target mask: []
binder mask: []
As expected, our sequences are the same and the structure mask is there, meaning the whole structure for the complex is predicted by ESMFold.
We can also retrieve the pAE matrix predicted by ESMFold, which are useful as metrics for measuring the success of our designs. This could take awhile to retrieve all the results.
[20]:
esmfold_pae_results = esmfold_job.pae
esmfold_seq, esmfold_complex_pae = esmfold_pae_results[0]
print("pae interaction shape:", esmfold_complex_pae.shape)
pae interaction shape: (273, 273)
Ranking designs by metrics#
Following Bennet et al. (2023), we’ll rank our designs based on:
Monomer pLDDT (confidence that sequence folds to designed structure)
Complex pAE interaction (confidence that binder forms intended interface)
Complex Cα RMSD to designed structure
[21]:
import pandas as pd
design_files = []
plddt_scores = []
pae_scores = []
rmsd_scores = []
for i in range(N*10):
# Get ESMFold predictions
_, esmfold_model = esmfold_results[i]
design_files.append(f"design{i//10}_mpnn{i%10}.pdb")
target = esmfold_model.proteins["A"]
binder = esmfold_model.proteins["B"]
# Get pLDDT of binder
plddt_score = np.mean(binder.plddt)
plddt_scores.append(plddt_score)
# Get pAE
_, esmfold_complex_pae = esmfold_pae_results[i]
binder_target_pae = esmfold_complex_pae.squeeze() # squeeze the shape
pae_interaction_1 = np.mean(binder_target_pae[len(binder):,:len(binder)])
pae_interaction_2 = np.mean(binder_target_pae[:len(binder),len(binder):])
pae_interaction_total = (pae_interaction_1 + pae_interaction_2) / 2
pae_scores.append(pae_interaction_total)
# RMSD between designed binder and folded binder
designed_binder = rfdiffusion_models[i//10].proteins["B"]
folded_binder = binder
binder_rmsd = designed_binder.rmsd(folded_binder, backbone_only=True)
rmsd_scores.append(binder_rmsd)
df = pd.DataFrame({"design_file": design_files, "plddt": plddt_scores, "pae": pae_scores, "rmsd": rmsd_scores})
print(df.head(10))
design_file plddt pae rmsd
0 design0_mpnn0.pdb 76.457253 25.671998 2.206150
1 design0_mpnn1.pdb 77.294754 27.315724 2.512000
2 design0_mpnn2.pdb 76.244125 26.871473 2.287846
3 design0_mpnn3.pdb 75.910248 27.265617 1.644508
4 design0_mpnn4.pdb 53.631123 27.608766 8.185164
5 design0_mpnn5.pdb 78.709000 27.659219 1.758274
6 design0_mpnn6.pdb 72.471497 27.965097 32.389101
7 design0_mpnn7.pdb 76.930008 27.544944 1.285401
8 design0_mpnn8.pdb 72.673996 27.675875 2.939884
9 design0_mpnn9.pdb 80.238121 27.472807 2.103914
Analysis and Ranking#
Let’s rank the successful designs by their AF2 metrics:
[22]:
import pandas as pd
df_sorted = df.sort_values(by=["plddt", "pae", "rmsd"], ascending=[False, True, True])
print(df_sorted.head(10))
# Save rankings
df_sorted.to_csv(OUTPUT_DIR / f"rankings.csv", index=False)
design_file plddt pae rmsd
635 design63_mpnn5.pdb 84.924500 26.963812 0.389475
637 design63_mpnn7.pdb 84.507500 27.310462 0.527650
639 design63_mpnn9.pdb 84.464996 27.525507 0.401737
56 design5_mpnn6.pdb 84.348129 26.986829 3.080789
879 design87_mpnn9.pdb 84.326874 28.125531 33.819915
636 design63_mpnn6.pdb 84.314255 25.902550 0.429665
638 design63_mpnn8.pdb 84.234756 27.267720 0.369650
935 design93_mpnn5.pdb 84.018005 28.145075 0.630338
871 design87_mpnn1.pdb 84.014374 27.292237 33.734930
631 design63_mpnn1.pdb 83.806129 26.773962 0.511148
Summary#
In this tutorial, we’ve demonstrated the deep learning-augmented binder design workflow using RFdiffusion, ProteinMPNN and ESMFold:
Target Selection: Downloaded 3DI3 structure from RCSB PDB
Hotspot Identification: Selected binding regions based on known ligand-receptor interactions
Structure Generation: Used RFdiffusion to generate binder backbones
Sequence Design: Applied ProteinMPNN for fast, efficient sequence design
Validation: Used ESMFold to rank designs based on:
Monomer folding confidence (pLDDT)
Complex formation confidence (pAE interaction)
Structural accuracy (RMSD)
This approach achieves ~10-fold higher success rates compared to purely physics-based methods by leveraging deep learning models to identify Type I failures (sequences that don’t fold as intended) and Type II failures (structures that don’t bind as intended).
Next Steps#
The top-ranked designs from this workflow can be:
Expressed and purified for experimental validation
Tested for binding affinity
Further optimized through additional rounds of design