Binder Design With RFdiffusion#

Designing a high-affinity binder starts with the right tools. This tutorial introduces how to use RFdiffusion on the OpenProtein AI platform, using our Python client, to generate and evaluate binder candidates against a specific protein target.

You’ll learn how to set up your environment, define a target structure and binding site constraints, configure RFdiffusion runs, submit and monitor jobs, and retrieve results programmatically.

We’ll also cover how to use the designs with inverse folding for suggesting suitable protein sequences through ProteinMPNN, then put them through structure prediction with AlphaFold2 to evaluate the designed binders. Whether you’re new to RFdiffusion or looking to streamline your workflow, this guide will help you go from target definition to prioritized binder designs quickly and reproducibly.

This tutorial follows the approach described in Watson et al. (2023) “De novo design of protein structure and function with RFdiffusion”, using the publicly available 3DI3 structure (IL-7Rα) as our target. We also follow some of the methodology in Bennet et al. (2023) in “Improving de novo protein binder design with deep learning”.

Prerequisites#

For this tutorial, you will need your OpenProtein python session for accessing the models available on our platform and manipulating job results, so make sure you have your credentials setup!

[1]:
import openprotein
session = openprotein.connect()
session
[1]:
<openprotein.OpenProtein at 0x7f718f981400>

Target Selection#

For this tutorial, we’ll use the 3DI3 structure from the RCSB PDB, which contains the extracellular domain of human interleukin-7 receptor alpha (IL-7Rα). This receptor was used as one of the benchmark targets in the Watson et al. (2023) for evaluating binder design performance.

Download the structure from RCSB and load it as a Protein object:

[2]:
from pathlib import Path
from openprotein import Protein, Model
import numpy as np
import requests

DATA_DIR = Path("data/")
DATA_DIR.mkdir(exist_ok=True)

# Download 3DI3 from RCSB
pdb_id = "3DI3"
structure_filepath = DATA_DIR / f"{pdb_id}.pdb"

if not structure_filepath.exists():
    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
    response = requests.get(url)
    structure_filepath.write_text(response.text)

# Load the receptor chain (IL-7Ra is chain B)
target_protein = Protein.from_filepath(path=structure_filepath, chain_id="B")
print("target sequence:", target_protein.sequence)
print("target coordinates shape:", target_protein.coordinates.shape)
print("target plddt shape:", target_protein.plddt.shape)
print("target name:", target_protein.name)
target sequence: b'GSHMESGYAQNGDLEDAELDDYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTPEINNSSGEMD'
target coordinates shape: (223, 37, 3)
target plddt shape: (223,)
target name: 3DI3

Visualize#

We can visually inspect the target structure using molviewspec:

[3]:
%pip install molviewspec
Requirement already satisfied: molviewspec in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (1.7.0)
Requirement already satisfied: pydantic<3,>=1 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from molviewspec) (2.12.5)
Requirement already satisfied: annotated-types>=0.6.0 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (2.41.5)
Requirement already satisfied: typing-extensions>=4.14.1 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (0.4.2)
Note: you may need to restart the kernel to use updated packages.
[4]:
from molviewspec import create_builder

def visualize_pdb(pdb_string: str):
    builder = create_builder()
    structure = builder.download(url="mystructure.pdb")\
        .parse(format="pdb")\
        .model_structure()\
        .component()\
        .representation()\
        .color_from_source(schema="atom",
                            category_name="atom_site",
                            field_name="auth_asym_id",
                            palette={"kind": "categorical", # color by chain
                                    "colors": ["blue", "red", "green", "orange"],
                                    "mode": "ordinal"}
                          )
    builder.molstar_notebook(data={'mystructure.pdb': pdb_string}, width=500, height=400)

visualize_pdb(target_protein.make_pdb_string())

Binding region selection#

According to the supplementary information of Watson et al (2023), the following hotspots or binding sites were chosen for 3DI3, which we will use for our walkthrough as well:

  • B58, B80, B139

Note: RFdiffusion has been trained with masking hotspots, so we only need to pick a few potential contact sites within our areas of interest. Refer to the official RFdiffusion docs for tips on picking hotspots.

To encode these into our generate query, we use the set_binding_at method for the Protein.

[5]:
from openprotein.protein import Binding

binding_sites = [58,80,139]
target_protein = target_protein.set_binding_at(binding_sites, Binding.BINDING)
# Verify the binding is set
target_protein.get_binding_at(binding_sites)
[5]:
array(['B', 'B', 'B'], dtype='<U1')

Generate designs with RFdiffusion#

Query design#

To generate a binder with RFdiffusion, we need to specify there is another unknown chain. For this walkthrough, we’ll keep the full target chain and generate a separate binder chain of length 80 residues. To encode this as a Query, we first create a Protein chain with length

  1. We can use Protein.from_expr as an easy constructor for specifying chains with unknown fragments.

The structure mask determines which part of the structure should be designed. The X below is indicating the sequence mask, which is used in inverse folding, which we will also do in the next step after generating the structure designs. We can examine the structure mask using get_structure_mask.

[6]:
binder_chain = Protein.from_expr(80)
print("binder sequence:", binder_chain.sequence)
print("binder structure mask:", binder_chain.get_structure_mask())
binder sequence: b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
binder structure mask: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
 73 74 75 76 77 78 79 80]

As we can see, the whole structure of our binder chain is masked, which is telling the model to fully design the chain. And to indicate to the model that the design is to be done in the presence of another chain, we combine our binder and target Protein objects to create a Model, which represents a multimer.

But before that, let’s quickly examine the structure mask of our target protein to avoid doing unnecessary design.

[7]:
print("target structure mask:", target_protein.get_structure_mask())
target structure mask: [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20 214 215 216 217 218 219 220 221 222 223]

This is important with RFdiffusion - we should drop any residues that we don’t actually want to design. This saves compute time and also seems to cause some errors in using the model. We should also only do this after setting our hotspots or binding sites since the deletion shifts our residue indices.

[8]:
target_protein = target_protein.delete(target_protein.get_structure_mask())
print("target structure mask:", target_protein.get_structure_mask())
target structure mask: []

That’s better. Now let’s combine our two chains to specify the full query Model object.

[9]:
query_model = target_protein & binder_chain
print("Chains in query:", list(query_model.proteins.keys()))
print("Chain A (target chain):", query_model.proteins["A"].sequence)
print("Chain B (binder chain):", query_model.proteins["B"].sequence)
Chains in query: ['A', 'B']
Chain A (target chain): b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
Chain B (binder chain): b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

Run the design job#

Following Bennett et al. (2023), we reduce the noise added during generation, which has been found to help with binder design, albeit at the cost of some diversity:

[10]:
rfdiffusion_design_params = {
    "denoiser.noise_scale_ca": 0.5,
    "denoiser.noise_scale_frame": 0.5
}

With these inputs, we can run RFdiffusion to generate designs for both potential binding regions:

[11]:
# Number of designs to generate
N = 100
rfdiffusion_job = session.models.rfdiffusion.generate(
    query=query_model,
    N=N,
    **rfdiffusion_design_params,
)
rfdiffusion_job
[11]:
RFdiffusionJob(job_id='d40002e7-95ec-41a4-9992-ab2edfea2656', job_type='/models/rfdiffusion', status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2025, 12, 22, 21, 40, 19, 404136, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None)

### Wait for completion

Wait for the designs to complete. Note that this can take some time depending on the queue:

[12]:
rfdiffusion_job.wait_until_done(verbose=True, timeout=60*60)
Waiting: 100%|██████████████████████████████████████████████████| 100/100 [00:00<00:00, 509.85it/s, status=SUCCESS]
[12]:
True

Analyze generated designs#

Let’s first retrieve our designs, and inspect the first design.

[13]:
rfdiffusion_models = rfdiffusion_job.get()
print("chains in design:", list(rfdiffusion_models[0].proteins.keys()))
print("target (chain A) sequence:", rfdiffusion_models[0].proteins["A"].sequence)
print("binder (chain B) sequence:", rfdiffusion_models[0].proteins["B"].sequence)
print("target (chain A) mask:", rfdiffusion_models[0].proteins["A"].get_structure_mask())
print("binder (chain B) mask:", rfdiffusion_models[0].proteins["B"].get_structure_mask())
chains in design: ['B', 'A']
target (chain A) sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
binder (chain B) sequence: b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
target (chain A) mask: []
binder (chain B) mask: []

As we can see, we are returned with two chains with their structures fully designed. Note also that our binder chain remains with the sequence masked. We will infer the masked sequence in the next step with inverse folding.

Before that, we can also quickly visually inspect the design:

[14]:
visualize_pdb(rfdiffusion_models[0].make_pdb_string())

Now let’s iterate through these designs and save them.

[15]:
import os
import numpy as np
from pathlib import Path
import string

N = 100
OUTPUT_DIR = Path("data/outputs/3DI3_binder_designs")
os.makedirs(OUTPUT_DIR, exist_ok=True)

for i in range(N):
    # Retrieve the completed design
    designed_model = rfdiffusion_models[i]

    # Save the full complex
    with open(f"{OUTPUT_DIR}/design{i+1}.pdb", "w") as f:
        f.write(designed_model.make_pdb_string())

Inverse Folding with ProteinMPNN#

Following Bennett et al. (2023), we’ll use ProteinMPNN for inverse folding to design sequences that adopt the designed binder structures.

For each of the 100 designs, we will generate 10 proposed sequences from inverse folding.

[16]:
proteinmpnn_jobs = []
for i in range(N):
    rfdiffusion_model = rfdiffusion_models[i]
    # Mask the binder sequence to indicate that it should be generated
    rfdiffusion_model.mask_sequence(chain_ids="B")

    # Use ProteinMPNN to design sequences for the binder backbone
    mpnn_job = session.models.proteinmpnn.generate(
        query=rfdiffusion_model,
        num_samples=10,
        temperature=0.1,  # Bennett et al. used low temperature
        seed=42,
    )
    proteinmpnn_jobs.append(mpnn_job)

# Wait for all jobs to complete
for mpnn_job in proteinmpnn_jobs:
    mpnn_job.wait_until_done(timeout=600)
    assert mpnn_job.status == "SUCCESS"

Let’s look at the output from one of the ProteinMPNN jobs.

[16]:
proteinmpnn_jobs[0].get()
[16]:
[Score(name='generated-sequence-1', sequence='EEEEEKEKKEEEERKKLIEEGKKAREELAKKADKALEELEKEEEEEEEEEEEEEEEEEEEEEEEEEEERRREEEEEELER', score=array([1.2638])),
 Score(name='generated-sequence-2', sequence='SELEKKLEEEEKERKKLIEEGEKHREELAKKSEEALKKLEEKEKAEEAARAAEEAARRAAAAAAAAAAAAAAAAAAAAAA', score=array([1.278])),
 Score(name='generated-sequence-3', sequence='EKEEEEKKKEEEELEEKIKEGEEARKKLAELSDKALKEREEKEREEEEKEEEEREEEEEEEAEEEEEEEEEEEEEEEEEE', score=array([1.3055])),
 Score(name='generated-sequence-4', sequence='KEEEEKKKKEEEEKEKLIKEGKEALEERAKKAEEALAALEAEEAEREAAAAAARAAARAAAAAAAAAAAAAAAEAARAAA', score=array([1.2675])),
 Score(name='generated-sequence-5', sequence='MEEEEKKKKEEEEKKKLIEEGKKAQEERAEKADKAYEELKKAEAEAEAAAAAAAAAAAAAAAAAAAAAAAAAAAALAAAA', score=array([1.2027])),
 Score(name='generated-sequence-6', sequence='EEEEKEKKKEEEEKKKLIEEGEEARKKRAEEAEKALEELEKEEEEKEKKELEARLAAEKAAAAAAAAAAAAAEEAARLAA', score=array([1.3404])),
 Score(name='generated-sequence-7', sequence='EEEEKKEKEEKEKKEKLIEEGKEALKKRAEESEKALEELQLKEALEELLEALLELLRELLAALEEALRLLEEELRRLEEE', score=array([1.5148])),
 Score(name='generated-sequence-8', sequence='SEEEEKEKKEEEEKKKLIEEGKKAREERAKEAEKALEELEKKEEEEERRRREEREARRRREEEERERRLEEERRREEEER', score=array([1.2896])),
 Score(name='generated-sequence-9', sequence='SEEEEKKKEEEEKKKKLIEEGKKAQEERAKKAEEALKKLEAAQAAEEAAKAAAAAAAAAAAAAAAAAAAAEAAAAAAAAA', score=array([1.1734])),
 Score(name='generated-sequence-10', sequence='SEEEEKKEEEKKKKEELIKEAKKALEERAKKAEEALKELERKLEEEEERRRREEEERREREAEERRREEEERRRREEEER', score=array([1.2883]))]

Each of these 10 sequences correspond to the first design from RFdiffusion. Let’s save the ProteinMPNN predictions together with the RFdiffusion designs so that we have a 1000 of these potential designs.

[17]:
scores = []
for i in range(N):
    rfdiffusion_model = rfdiffusion_models[i]
    mpnn_job = proteinmpnn_jobs[i]
    mpnn_results = mpnn_job.get()
    for j, (_, sequence, score) in enumerate(mpnn_results):
        # replace chain explicitly due to defensive copy
        binder = rfdiffusion_model.proteins["B"]
        binder.sequence = sequence
        rfdiffusion_model.proteins["B"] = binder
        scores.append(score.item())
        with open(f"{OUTPUT_DIR}/design{i+1}_mpnn{j+1}.pdb", "w") as f:
            f.write(rfdiffusion_model.make_pdb_string())
with open(f"{OUTPUT_DIR}/mpnn_scores.txt", "w") as f:
    f.write("\n".join([str(score) for score in scores]))

Let’s now verify that our new designed and inverse-folded model looks correct:

[18]:
from openprotein import Model

OUTPUT_DIR = Path("data/outputs/3DI3_binder_designs")

proteinmpnn_model = Model.from_filepath(f"{OUTPUT_DIR}/design1_mpnn1.pdb")
print("chains in proteinmpnn + rfdiffusion design:", list(proteinmpnn_model.proteins.keys()))
print("target sequence:", proteinmpnn_model.proteins["A"].sequence)
print("binder sequence:", proteinmpnn_model.proteins["B"].sequence)
print("target mask:", proteinmpnn_model.proteins["A"].get_structure_mask())
print("binder mask:", proteinmpnn_model.proteins["B"].get_structure_mask())
chains in proteinmpnn + rfdiffusion design: ['B', 'A']
target sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
binder sequence: b'EEEEEKEKKEEEERKKLIEEGKKAREELAKKADKALEELEKEEEEEEEEEEEEEEEEEEEEEEEEEEERRREEEEEELER'
target mask: []
binder mask: []

Notice that what we have is a combination of the two models: the binder structure is from RFdiffusion and the inverse folded binder sequence is from ProteinMPNN. The next step is to check if the predicted multimer folds into what we expect.

Structure Prediction with ESMFold#

Whilst Bennett et al. (2023) and Watson et al. (2023) both used AlphaFold2 to re-fold their designs, we will use ESMFold instead to validate our designs.

The key insight from their paper is that AF2’s prediction confidence metrics (particularly pAEinteraction) can effectively discriminate successful binders from failures. Bennett et al. found that the pAEinteraction metric (average pAE of interchain residue pairs) was extremely effective at identifying successful binders, with sharp increases in success rates for designs with pAEinteraction < 10.

We can obtain these same metrics with ESMFold, which will also run a lot faster than AF2. The papers also used AF2 initial guess, with templates, which are features not yet ready for use with our AF2 on our platform. This walkthrough will be updated if add support for these features and find that the AF2 metrics perform better. The key point to note is that our platform allows easy drop-in replacements for various steps in your protein design pipeline.

[17]:
proteinmpnn_models = []
for i in range(N):
    for j in range(10):
        proteinmpnn_model = Model.from_filepath(f"{OUTPUT_DIR}/design{i+1}_mpnn{j+1}.pdb")
        proteinmpnn_models.append(proteinmpnn_model)
esmfold_job = session.fold.esmfold.fold(
   proteinmpnn_models
)
esmfold_job
/home/jmage/Projects/openprotein/openprotein-python-private/openprotein/base.py:147: UserWarning: The requested payload is >1MB. There might be some delays or issues in processing. If the request fails, please try again with smaller sizes.
  warnings.warn(
[17]:
FoldJob(num_records=1000, job_id='80d96bc4-0d04-4aba-b2c4-1074341f18c0', job_type=<JobType.embeddings_fold: '/embeddings/fold'>, status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2025, 12, 22, 23, 25, 30, 319818, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None)

Wait for completion. This will likely take around an hour.

[18]:
esmfold_job.wait_until_done(verbose=True, timeout=60*60)
Waiting: 100%|███████████████████████████████████████████████████| 100/100 [00:10<00:00,  9.25it/s, status=SUCCESS]
[18]:
True

Let’s retrieve and inspect the ESMFold fold results:

[19]:
esmfold_results = esmfold_job.get()
esmfold_seq, esmfold_model = esmfold_results[0] # a fold returns (seq, model) tuples
print("chains in folded model:", list(esmfold_model.proteins.keys()))
print("target sequence:", esmfold_model.proteins["A"].sequence)
print("binder sequence:", esmfold_model.proteins["B"].sequence)
print("target mask:", esmfold_model.proteins["A"].get_structure_mask())
print("binder mask:", esmfold_model.proteins["B"].get_structure_mask())
chains in folded model: ['A', 'B']
target sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
binder sequence: b'EEEEEKEKKEEEERKKLIEEGKKAREELAKKADKALEELEKEEEEEEEEEEEEEEEEEEEEEEEEEEERRREEEEEELER'
target mask: []
binder mask: []

As expected, our sequences are the same and the structure mask is there, meaning the whole structure for the complex is predicted by ESMFold.

We can also retrieve the pAE matrix predicted by ESMFold, which are useful as metrics for measuring the success of our designs. This could take awhile to retrieve all the results.

[20]:
esmfold_pae_results = esmfold_job.pae
esmfold_seq, esmfold_complex_pae = esmfold_pae_results[0]
print("pae interaction shape:", esmfold_complex_pae.shape)
pae interaction shape: (273, 273)

Ranking designs by metrics#

Following Bennet et al. (2023), we’ll rank our designs based on:

  1. Monomer pLDDT (confidence that sequence folds to designed structure)

  2. Complex pAE interaction (confidence that binder forms intended interface)

  3. Complex Cα RMSD to designed structure

[21]:
import pandas as pd

design_files = []
plddt_scores = []
pae_scores = []
rmsd_scores = []
for i in range(N*10):
    # Get ESMFold predictions
    _, esmfold_model = esmfold_results[i]

    design_files.append(f"design{i//10}_mpnn{i%10}.pdb")

    target = esmfold_model.proteins["A"]
    binder = esmfold_model.proteins["B"]

    # Get pLDDT of binder
    plddt_score = np.mean(binder.plddt)
    plddt_scores.append(plddt_score)

    # Get pAE
    _, esmfold_complex_pae = esmfold_pae_results[i]
    binder_target_pae = esmfold_complex_pae.squeeze() # squeeze the shape
    pae_interaction_1 = np.mean(binder_target_pae[len(binder):,:len(binder)])
    pae_interaction_2 = np.mean(binder_target_pae[:len(binder),len(binder):])
    pae_interaction_total = (pae_interaction_1 + pae_interaction_2) / 2
    pae_scores.append(pae_interaction_total)

    # RMSD between designed binder and folded binder
    designed_binder = rfdiffusion_models[i//10].proteins["B"]
    folded_binder = binder

    binder_rmsd = designed_binder.rmsd(folded_binder, backbone_only=True)
    rmsd_scores.append(binder_rmsd)

df = pd.DataFrame({"design_file": design_files, "plddt": plddt_scores, "pae": pae_scores, "rmsd": rmsd_scores})
print(df.head(10))
         design_file      plddt        pae       rmsd
0  design0_mpnn0.pdb  76.457253  25.671998   2.206150
1  design0_mpnn1.pdb  77.294754  27.315724   2.512000
2  design0_mpnn2.pdb  76.244125  26.871473   2.287846
3  design0_mpnn3.pdb  75.910248  27.265617   1.644508
4  design0_mpnn4.pdb  53.631123  27.608766   8.185164
5  design0_mpnn5.pdb  78.709000  27.659219   1.758274
6  design0_mpnn6.pdb  72.471497  27.965097  32.389101
7  design0_mpnn7.pdb  76.930008  27.544944   1.285401
8  design0_mpnn8.pdb  72.673996  27.675875   2.939884
9  design0_mpnn9.pdb  80.238121  27.472807   2.103914

Analysis and Ranking#

Let’s rank the successful designs by their AF2 metrics:

[22]:
import pandas as pd

df_sorted = df.sort_values(by=["plddt", "pae", "rmsd"], ascending=[False, True, True])

print(df_sorted.head(10))

# Save rankings
df_sorted.to_csv(OUTPUT_DIR / f"rankings.csv", index=False)
            design_file      plddt        pae       rmsd
635  design63_mpnn5.pdb  84.924500  26.963812   0.389475
637  design63_mpnn7.pdb  84.507500  27.310462   0.527650
639  design63_mpnn9.pdb  84.464996  27.525507   0.401737
56    design5_mpnn6.pdb  84.348129  26.986829   3.080789
879  design87_mpnn9.pdb  84.326874  28.125531  33.819915
636  design63_mpnn6.pdb  84.314255  25.902550   0.429665
638  design63_mpnn8.pdb  84.234756  27.267720   0.369650
935  design93_mpnn5.pdb  84.018005  28.145075   0.630338
871  design87_mpnn1.pdb  84.014374  27.292237  33.734930
631  design63_mpnn1.pdb  83.806129  26.773962   0.511148

Summary#

In this tutorial, we’ve demonstrated the deep learning-augmented binder design workflow using RFdiffusion, ProteinMPNN and ESMFold:

  1. Target Selection: Downloaded 3DI3 structure from RCSB PDB

  2. Hotspot Identification: Selected binding regions based on known ligand-receptor interactions

  3. Structure Generation: Used RFdiffusion to generate binder backbones

  4. Sequence Design: Applied ProteinMPNN for fast, efficient sequence design

  5. Validation: Used ESMFold to rank designs based on:

    • Monomer folding confidence (pLDDT)

    • Complex formation confidence (pAE interaction)

    • Structural accuracy (RMSD)

This approach achieves ~10-fold higher success rates compared to purely physics-based methods by leveraging deep learning models to identify Type I failures (sequences that don’t fold as intended) and Type II failures (structures that don’t bind as intended).

Next Steps#

The top-ranked designs from this workflow can be:

  1. Expressed and purified for experimental validation

  2. Tested for binding affinity

  3. Further optimized through additional rounds of design