Binder Design With RFdiffusion#

Designing a high-affinity binder starts with the right tools. This tutorial introduces how to use RFdiffusion on the OpenProtein AI platform, using our Python client, to generate and evaluate binder candidates against a specific protein target.

You’ll learn how to set up your environment, define a target structure and binding site constraints, configure RFdiffusion runs, submit and monitor jobs, and retrieve results programmatically.

We’ll also cover how to use the designs with inverse folding for suggesting suitable protein sequences through ProteinMPNN, then put them through structure prediction with AlphaFold2 to evaluate the designed binders. Whether you’re new to RFdiffusion or looking to streamline your workflow, this guide will help you go from target definition to prioritized binder designs quickly and reproducibly.

This tutorial follows the approach described in Watson et al. (2023) “De novo design of protein structure and function with RFdiffusion”, using the publicly available 3DI3 structure (IL-7Rα) as our target. We also follow some of the methodology in Bennet et al. (2023) in “Improving de novo protein binder design with deep learning”.

Prerequisites#

For this tutorial, you will need your OpenProtein python session for accessing the models available on our platform and manipulating job results, so make sure you have your credentials setup!

[1]:
import openprotein
session = openprotein.connect()
session
[1]:
<openprotein.OpenProtein at 0x7fb07bc6e900>

Target Selection#

For this tutorial, we’ll use the 3DI3 structure from the RCSB PDB, which contains the extracellular domain of human interleukin-7 receptor alpha (IL-7Rα). This receptor was used as one of the benchmark targets in the Watson et al. (2023) for evaluating binder design performance.

Download the structure from RCSB and load it as a Protein object:

[2]:
from pathlib import Path
from openprotein import Protein, Model
import numpy as np
import requests

DATA_DIR = Path("data/")
DATA_DIR.mkdir(exist_ok=True)

# Download 3DI3 from RCSB
pdb_id = "3DI3"
structure_filepath = DATA_DIR / f"{pdb_id}.pdb"

if not structure_filepath.exists():
    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
    response = requests.get(url)
    structure_filepath.write_text(response.text)

# Load the receptor chain (IL-7Ra is chain B)
target_protein = Protein.from_filepath(path=structure_filepath, chain_id="B")
print("target sequence:", target_protein.sequence)
print("target coordinates shape:", target_protein.coordinates.shape)
print("target plddt shape:", target_protein.plddt.shape)
print("target name:", target_protein.name)
target sequence: b'GSHMESGYAQNGDLEDAELDDYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTPEINNSSGEMD'
target coordinates shape: (223, 37, 3)
target plddt shape: (223,)
target name: 3DI3

Visualize#

We can visually inspect the target structure using molviewspec:

[3]:
%pip install molviewspec
Requirement already satisfied: molviewspec in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (1.7.0)
Requirement already satisfied: pydantic<3,>=1 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from molviewspec) (2.12.5)
Requirement already satisfied: annotated-types>=0.6.0 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (2.41.5)
Requirement already satisfied: typing-extensions>=4.14.1 in /nix/store/7lagvix8y98xrdj17qz5wllxnksbfh0s-python3.13-typing-extensions-4.15.0/lib/python3.13/site-packages (from pydantic<3,>=1->molviewspec) (4.15.0)
Requirement already satisfied: typing-inspection>=0.4.2 in /home/jmage/Projects/openprotein/openprotein-python-private/.pixi/envs/dev/lib/python3.12/site-packages (from pydantic<3,>=1->molviewspec) (0.4.2)
Note: you may need to restart the kernel to use updated packages.
[4]:
from molviewspec import create_builder

def visualize_pdb(pdb_string: str):
    builder = create_builder()
    structure = builder.download(url="mystructure.pdb")\
        .parse(format="pdb")\
        .model_structure()\
        .component()\
        .representation()\
        .color_from_source(schema="atom",
                            category_name="atom_site",
                            field_name="auth_asym_id",
                            palette={"kind": "categorical", # color by chain
                                    "colors": ["blue", "red", "green", "orange"],
                                    "mode": "ordinal"}
                          )
    builder.molstar_notebook(data={'mystructure.pdb': pdb_string}, width=500, height=400)

visualize_pdb(target_protein.make_pdb_string())

Binding region selection#

According to the supplementary information of Watson et al (2023), the following hotspots or binding sites were chosen for 3DI3, which we will use for our walkthrough as well:

  • B58, B80, B139

Note: RFdiffusion has been trained with masking hotspots, so we only need to pick a few potential contact sites within our areas of interest. Refer to the official RFdiffusion docs for tips on picking hotspots.

To encode these into our generate query, we use the set_binding_at method for the Protein.

[5]:
from openprotein.protein import Binding

binding_sites = [58,80,139]
target_protein = target_protein.set_binding_at(binding_sites, Binding.BINDING)
# Verify the binding is set
target_protein.get_binding_at(binding_sites)
[5]:
array(['B', 'B', 'B'], dtype='<U1')

Generate designs with RFdiffusion#

Query design#

To generate a binder with RFdiffusion, we need to specify there is another unknown chain. For this walkthrough, we’ll keep the full target chain and generate a separate binder chain of length 80 residues. To encode this as a Query, we first create a Protein chain with length

  1. We can use Protein.from_expr as an easy constructor for specifying chains with unknown fragments.

The structure mask determines which part of the structure should be designed. The X below is indicating the sequence mask, which is used in inverse folding, which we will also do in the next step after generating the structure designs. We can examine the structure mask using get_structure_mask.

[6]:
binder_chain = Protein.from_expr(80)
print("binder sequence:", binder_chain.sequence)
print("binder structure mask:", binder_chain.get_structure_mask())
binder sequence: b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
binder structure mask: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
 73 74 75 76 77 78 79 80]

As we can see, the whole structure of our binder chain is masked, which is telling the model to fully design the chain. And to indicate to the model that the design is to be done in the presence of another chain, we combine our binder and target Protein objects to create a Model, which represents a multimer.

But before that, let’s quickly examine the structure mask of our target protein to avoid doing unnecessary design.

[7]:
print("target structure mask:", target_protein.get_structure_mask())
target structure mask: [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20 214 215 216 217 218 219 220 221 222 223]

This is important with RFdiffusion - we should drop any residues that we don’t actually want to design. This saves compute time and also seems to cause some errors in using the model. We should also only do this after setting our hotspots or binding sites since the deletion shifts our residue indices.

[8]:
target_protein = target_protein.delete(target_protein.get_structure_mask())
print("target structure mask:", target_protein.get_structure_mask())
target structure mask: []

That’s better. Now let’s combine our two chains to specify the full query Model object.

[9]:
query_model = target_protein & binder_chain
print("Chains in query:", list(query_model.proteins.keys()))
print("Chain A (target chain):", query_model.proteins["A"].sequence)
print("Chain B (binder chain):", query_model.proteins["B"].sequence)
Chains in query: ['A', 'B']
Chain A (target chain): b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
Chain B (binder chain): b'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

Run the design job#

Following Bennett et al. (2023), we reduce the noise added during generation, which has been found to help with binder design, albeit at the cost of some diversity:

[10]:
rfdiffusion_design_params = {
    "denoiser.noise_scale_ca": 0.5,
    "denoiser.noise_scale_frame": 0.5
}

With these inputs, we can run RFdiffusion to generate designs for both potential binding regions:

[11]:
# Number of designs to generate
N = 100
rfdiffusion_job = session.models.rfdiffusion.generate(
    query=query_model,
    N=N,
    **rfdiffusion_design_params,
)
rfdiffusion_job
[11]:
RFdiffusionJob(job_id='3c61719e-67ea-4190-acdd-6e8e8aae7147', job_type='/models/rfdiffusion', status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2025, 12, 18, 21, 1, 46, 352527, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None)

Wait for completion#

Wait for the designs to complete. Note that this can take some time depending on the queue:

[12]:
rfdiffusion_job.wait_until_done(verbose=True, timeout=60*60)
Waiting: 100%|██████████████████████████████████████████████████| 100/100 [00:00<00:00, 600.93it/s, status=SUCCESS]
[12]:
True

Analyze generated designs#

Let’s first retrieve our designs, and inspect the first design.

[13]:
rfdiffusion_models = rfdiffusion_job.get()
print("chains in design:", list(rfdiffusion_models[0].proteins.keys()))
print("first design chain A sequence:", rfdiffusion_models[0].proteins["A"].sequence)
print("first design chain B sequence:", rfdiffusion_models[0].proteins["B"].sequence)
print("first design chain A mask:", rfdiffusion_models[0].proteins["A"].get_structure_mask())
print("first design chain B mask:", rfdiffusion_models[0].proteins["B"].get_structure_mask())
chains in design: ['A', 'B']
first design chain A sequence: b'GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG'
first design chain B sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
first design chain A mask: []
first design chain B mask: []

As we can see, we are returned with two chains with their structures fully designed. Note however that RFdiffusion has re-positioned our chains. Our binder is now chain A and the target is chain B. This is important to be careful about. Also, RFdiffusion has set our binder sequence as G, which is not a big deal, we will want to mask these for inverse folding in our next step anyway.

We can also visually inspect the design:

[14]:
visualize_pdb(rfdiffusion_models[0].make_pdb_string())

Now let’s iterate through these designs and save them.

[15]:
import os
import numpy as np
from pathlib import Path
import string

OUTPUT_DIR = Path("data/outputs/3DI3_binder_designs")
os.makedirs(OUTPUT_DIR, exist_ok=True)

for i in range(N):
    # Retrieve the completed design
    designed_model = rfdiffusion_models[i]

    # Save the full complex
    with open(f"{OUTPUT_DIR}/design{i+1}.pdb", "w") as f:
        f.write(designed_model.make_pdb_string())

Inverse Folding with ProteinMPNN#

Following Bennett et al. (2023), we’ll use ProteinMPNN for inverse folding to design sequences that adopt the designed binder structures.

For each of the 100 designs, we will generate 10 proposed sequences from inverse folding.

[16]:
proteinmpnn_jobs = []
for i in range(N):
    rfdiffusion_model = rfdiffusion_models[i]
    # Mask the binder sequence to indicate that it should be generated
    rfdiffusion_model.mask_sequence(chain_ids="A")

    # Use ProteinMPNN to design sequences for the binder backbone
    mpnn_job = session.models.proteinmpnn.generate(
        query=rfdiffusion_model,
        num_samples=10,
        temperature=0.1,  # Bennett et al. used low temperature
        seed=42,
    )
    proteinmpnn_jobs.append(mpnn_job)

# Wait for all jobs to complete
for mpnn_job in proteinmpnn_jobs:
    mpnn_job.wait_until_done(timeout=600)
    assert mpnn_job.status == "SUCCESS"

Let’s look at the output from one of the ProteinMPNN jobs.

[17]:
proteinmpnn_designs[0].get()
[17]:
[Score(name='generated-sequence-1', sequence='MKKTYTDTVRVIKTSPDTYSLSITVNLDGEKVTISMEVPNTKELTKKKTVTTSSGKKYKITLKLTLEGDEWKVEITIEEL', score=array([1.1707])),
 Score(name='generated-sequence-2', sequence='MTKKETTTAKAIEISPDTLDIVIYVNLNGETVTLAMTIPNTPKLKKKVTVTTSSGKKYEIDLEITLEGDEYKINITIKEL', score=array([1.1413])),
 Score(name='generated-sequence-3', sequence='MKKEEKTTAKAIKISPDTYEISIDIELDGEKVTISKTIPNTEELEKEVTVTTSSGKKYKIKLKLKLKGDEWEIEITIEEL', score=array([1.1164])),
 Score(name='generated-sequence-4', sequence='MTKTETTYVKAIEVSPDTLQAVLDITLDGEKVTLALTIPNTKEFTKEKTVTTSSGKKYKITLKGTLEGDKLKVTITIEEL', score=array([1.0946])),
 Score(name='generated-sequence-5', sequence='MTKTYTTTVRVIEISPNTLDYVLYVNLNGETVVIAKTIPNTPEFTHHDIVTTSSGKKYEIDIKGKLEGDNLNLKITIKEL', score=array([1.2136])),
 Score(name='generated-sequence-6', sequence='ATTTETTRARAIKISPDKYEISIDLTLNGETVTLNLVIPNTPTLTVTRTVTTSSGKKYKVTLKLTLEGDEWLIDITTEEL', score=array([1.15])),
 Score(name='generated-sequence-7', sequence='ETSKEHTTARAIQIDPTTYDTVIDITLGGEKQTIAMRVPNTPTLSKERTITTSSGEKYRINLKITRNGDTWNIDITIEKL', score=array([1.1796])),
 Score(name='generated-sequence-8', sequence='EKKEETTTVRAIEISPDTLDTVIDITLNGEKVTIAMRIPNSEELEKEKTVTTSSGKKYKIKMKFKREGDKLNVKITIEEL', score=array([1.1133])),
 Score(name='generated-sequence-9', sequence='KKEEYTTTVRAIKISPDTYEISIDVTLNGEKKTINMTVPNTEKLEKEKTITTSSGKKYKIKLELTKEGDTWKVKITIEEL', score=array([1.1253])),
 Score(name='generated-sequence-10', sequence='EKKEETQTVRAIKISPDKLETVLDINLNGEKKTISMIIPNSKELEKEKTITTSSGEKYKVKLKLKLEGDKLLVKITIEKL', score=array([1.1292]))]

Each of these 10 sequences correspond to the first design from RFdiffusion. Let’s save the ProteinMPNN predictions together with the RFdiffusion designs so that we have a 1000 of these potential designs.

[18]:
scores = []
for i in range(N):
    rfdiffusion_model = rfdiffusion_models[i]
    mpnn_job = proteinmpnn_designs[i]
    mpnn_results = mpnn_job.get()
    for j, (_, sequence, score) in enumerate(mpnn_results):
        # replace chain explicitly due to defensive copy
        binder = generated_model.proteins["A"]
        binder.sequence = sequence
        generated_model.proteins["A"] = binder
        scores.append(score.item())
        with open(f"{OUTPUT_DIR}/design{i+1}_mpnn{j+1}.pdb", "w") as f:
            f.write(generated_model.make_pdb_string())
with open(f"{OUTPUT_DIR}/mpnn_scores.txt", "w") as f:
    f.write("\n".join([str(score) for score in scores]))

Let’s just verify that our new designed model looks correct:

[19]:
from openprotein import Model

OUTPUT_DIR = Path("data/outputs/3DI3_binder_designs")

proteinmpnn_model = Model.from_filepath(f"{OUTPUT_DIR}/design1_mpnn1.pdb")
print("chains in proteinmpnn + rfdiffusion design:", list(proteinmpnn_model.proteins.keys()))
print("binder sequence:", proteinmpnn_model.proteins["A"].sequence)
print("target sequence:", proteinmpnn_model.proteins["B"].sequence)
print("binder mask:", proteinmpnn_model.proteins["A"].get_structure_mask())
print("target mask:", proteinmpnn_model.proteins["B"].get_structure_mask())

Notice that what we have is a combination of the two models: the binder structure is from RFdiffusion and the inverse folded binder sequence is from ProteinMPNN. The next step is to check if the predicted multimer folds into what we expect.

Structure Prediction with ESMFold#

Whilst Bennett et al. (2023) and Watson et al. (2023) both used AlphaFold2 to re-fold their designs, we will use ESMFold instead to validate our designs.

The key insight from their paper is that AF2’s prediction confidence metrics (particularly pAEinteraction) can effectively discriminate successful binders from failures. Bennett et al. found that the pAEinteraction metric (average pAE of interchain residue pairs) was extremely effective at identifying successful binders, with sharp increases in success rates for designs with pAEinteraction < 10.

We can obtain these same metrics with ESMFold, which will also run a lot faster than AF2. The papers also used AF2 initial guess, with templates, which are features not yet ready for use with our AF2 on our platform. This walkthrough will be updated if add support for these features and find that the AF2 metrics perform better. The key point to note is that our platform allows easy drop-in replacements for various steps in your protein design pipeline.

[20]:
proteinmpnn_models = []
for i in range(N):
    for j in range(10):
        proteinmpnn_model = Model.from_filepath(f"{OUTPUT_DIR}/design{i+1}_mpnn{j+1}.pdb")
        proteinmpnn_models.append(proteinmpnn_model)
esmfold_job = session.fold.esmfold.fold(
   proteinmpnn_models
)
esmfold_job
[20]:
FoldJob(num_records=1000, job_id='58a6a0ee-285a-40ff-bc18-32392d9f4d51', job_type=<JobType.embeddings_fold: '/embeddings/fold'>, status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2025, 12, 19, 19, 1, 16, 12126, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None)

Wait for completion. This will likely take quite around an hour.

[21]:
esmfold_job.wait_until_done(verbose=True, timeout=60*60)
Waiting: 100%|██████████████████████████████████████████████████| 100/100 [00:00<00:00, 566.69it/s, status=SUCCESS]
[21]:
True

Let’s retrieve and inspect the ESMFold fold results:

[22]:
esmfold_results = esmfold_job.get()
esmfold_seq, esmfold_model = esmfold_results[0] # a fold returns (seq, model) tuples
print("chains in folded model:", list(esmfold_model.proteins.keys()))
print("first fold chain A sequence:", esmfold_model.proteins["A"].sequence)
print("first fold chain B sequence:", esmfold_model.proteins["B"].sequence)
print("first fold chain A mask:", esmfold_model.proteins["A"].get_structure_mask())
print("first fold chain B mask:", esmfold_model.proteins["B"].get_structure_mask())
chains in folded model: ['A', 'B']
first fold chain A sequence: b'MKKTYTDTVRVIKTSPDTYSLSITVNLDGEKVTISMEVPNTKELTKKKTVTTSSGKKYKITLKLTLEGDEWKVEITIEEL'
first fold chain B sequence: b'DYSFSCYSQLEVNGSQHSLTCAFEDPDVNTTNLEFEICGALVEVKCLNFRKLQEIYFIETKKFLLIGKSNICVKVGEKSLTCKKIDLTTIVKPEAPFDLSVVYREGANDFVVTFNTSHLQKKYVKVLMHDVAYRQEKDENKWTHVNLSSTKLTLLQRKLQPAAMYEIKVRSIPDHYFKGFWSEWSPSYYFRTP'
first fold chain A mask: []
first fold chain B mask: []

As expected, our sequences are the same and the structure mask is there, meaning the whole structure for the complex is predicted by ESMFold.

We can also retrieve the pAE matrix predicted by ESMFold, which are useful as metrics for measuring the success of our designs. This could take awhile to retrieve all the results.

[23]:
esmfold_pae_results = esmfold_job.pae
esmfold_seq, esmfold_complex_pae = esmfold_pae_results[0]
print("pae interaction shape:", esmfold_complex_pae.shape)
pae interaction shape: (273, 273)

Ranking designs by metrics#

Following Bennet et al. (2023), we’ll rank our designs based on:

  1. Monomer pLDDT (confidence that sequence folds to designed structure)

  2. Complex pAE interaction (confidence that binder forms intended interface)

  3. Complex Cα RMSD to designed structure

[24]:
import pandas as pd

design_files = []
plddt_scores = []
pae_scores = []
rmsd_scores = []
for i in range(N*10):
    # Get ESMFold predictions
    _, esmfold_model = esmfold_results[i]

    design_files.append(f"design{i//10}_mpnn{i%10}.pdb")

    binder = esmfold_model.proteins["A"]
    target = esmfold_model.proteins["B"]

    # Get pLDDT of binder
    plddt_score = np.mean(binder.plddt)
    plddt_scores.append(plddt_score)

    # Get pAE
    _, esmfold_complex_pae = esmfold_pae_results[i]
    binder_target_pae = esmfold_complex_pae.squeeze() # squeeze the shape
    pae_interaction_1 = np.mean(binder_target_pae[len(binder):,:len(binder)])
    pae_interaction_2 = np.mean(binder_target_pae[:len(binder),len(binder):])
    pae_interaction_total = (pae_interaction_1 + pae_interaction_2) / 2
    pae_scores.append(pae_interaction_total)

    # RMSD between designed binder and folded binder
    designed_binder = rfdiffusion_models[i//10].proteins["A"]
    folded_binder = binder

    binder_rmsd = designed_binder.rmsd(folded_binder, backbone_only=True)
    rmsd_scores.append(binder_rmsd)

df = pd.DataFrame({"design_file": design_files, "plddt": plddt_scores, "pae": pae_scores, "rmsd": rmsd_scores})
print(df.head(10))
         design_file      plddt        pae      rmsd
0  design0_mpnn0.pdb  64.985374  26.108816  2.508876
1  design0_mpnn1.pdb  60.023247  26.559431  2.313612
2  design0_mpnn2.pdb  64.042992  26.312916  3.289286
3  design0_mpnn3.pdb  65.501999  25.959415  2.709022
4  design0_mpnn4.pdb  65.479004  24.624670  2.541036
5  design0_mpnn5.pdb  72.557373  25.386769  0.839039
6  design0_mpnn6.pdb  58.761375  25.435695  3.850175
7  design0_mpnn7.pdb  69.294373  24.323982  1.071133
8  design0_mpnn8.pdb  63.944874  26.111162  3.166059
9  design0_mpnn9.pdb  61.955128  26.144578  4.354014

Analysis and Ranking#

Let’s rank the successful designs by their AF2 metrics:

[25]:
import pandas as pd

df_sorted = df.sort_values(by=["plddt", "pae", "rmsd"], ascending=[False, True, True])

print(df_sorted.head(10))

# Save rankings
df_sorted.to_csv(OUTPUT_DIR / f"rankings.csv", index=False)
            design_file      plddt        pae      rmsd
993  design99_mpnn3.pdb  85.635620  14.546918  0.576264
362  design36_mpnn2.pdb  85.087120  24.013505  0.558198
365  design36_mpnn5.pdb  84.757126  11.708263  0.376451
363  design36_mpnn3.pdb  83.749001  14.447840  0.695756
361  design36_mpnn1.pdb  83.108627  15.472516  0.395722
469  design46_mpnn9.pdb  82.964622  11.207534  0.656172
387  design38_mpnn7.pdb  82.905998  23.893167  0.725844
280  design28_mpnn0.pdb  82.414253  24.409521  0.538304
895  design89_mpnn5.pdb  82.379875  13.552719  0.630388
385  design38_mpnn5.pdb  82.042374  25.747786  0.546411

Summary#

In this tutorial, we’ve demonstrated the deep learning-augmented binder design workflow using RFdiffusion, ProteinMPNN and ESMFold:

  1. Target Selection: Downloaded 3DI3 structure from RCSB PDB

  2. Hotspot Identification: Selected binding regions based on known ligand-receptor interactions

  3. Structure Generation: Used RFdiffusion to generate binder backbones

  4. Sequence Design: Applied ProteinMPNN for fast, efficient sequence design

  5. Validation: Used ESMFold to rank designs based on:

    • Monomer folding confidence (pLDDT)

    • Complex formation confidence (pAE interaction)

    • Structural accuracy (RMSD)

This approach achieves ~10-fold higher success rates compared to purely physics-based methods by leveraging deep learning models to identify Type I failures (sequences that don’t fold as intended) and Type II failures (structures that don’t bind as intended).

Next Steps#

The top-ranked designs from this workflow can be:

  1. Expressed and purified for experimental validation

  2. Tested for binding affinity

  3. Further optimized through additional rounds of design