Library analysis, comparison, and cost efficiency calculation#
Library reagents and experiments, whether from combinatorial or bespoke libraries, are resource intensive. The OpenProtein.AI platform provides tools to make informed decisions around cost-performance trade-offs.
This walkthrough shows you how to use the OpenProtein.AI Python client to analyze and compare libraries for expected performance. In this example, we evaluate ML-designed and combinatorial variant libraries and demonstrate how a user can make data-driven decisions around library composition and size to achieve success with a given confidence level.
What you need before getting started#
You need an OpenProtein.AI account and a dataset with sequence and property measurements. This tutorial uses the antibody variable fragment mutagenesis dataset, 14H.
Setting up your environment#
You also need a Jupyter notebook environment for data analysis and visualization:
[1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import spearmanr
import json
Next, connect to the OpenProtein.AI backend using your credentials:
[2]:
import openprotein
with open('secrets.config', 'r') as f:
config = json.load(f)
session = openprotein.connect(config['username'], config['password'])
Load data for analysis#
We’ll use an antibody variable fragment mutagenesis dataset, 14H. This dataset contains heavy chain variable fragment sequences mutagensized in all three CDRs with measured binding affinities.
Upload the dataset:
[3]:
path = 'data/14H_affinities_nonan.csv'
table = pd.read_csv(path)
print(table.shape)
table.head()
(7476, 2)
[3]:
sequence | log_kdnm | |
---|---|---|
0 | EVQLVETGGGLVQPGGSLRLSCAASPFELNSYGISWVRQAPGKGPE... | 2.696930 |
1 | EVQLVETGGGLVQPGGSLRLSCAASGFDLNSYGISWVRQAPGKGPE... | 1.988077 |
2 | EVQLVETGGGLVQPGGSLRLSCAASGFKLNSYGISWVRQAPGKGPE... | 2.975487 |
3 | EVQLVETGGGLVQPGGSLRLSCAASGFHLNSYGISWVRQAPGKGPE... | 2.381006 |
4 | EVQLVETGGGLVQPGGSLRLSCAASGFTLLSYGISWVRQAPGKGPE... | 3.481793 |
As we can see above, the dataset is displayed with each row representing a sequence along with its corresponding ‘log_kdnm’ value.
Post the dataset to train a model for affinity prediction:
[4]:
dataset = session.data.create(table, '14H_affinities', 'Antibody heavy chain variable region mutagenesis with affinity measurements.')
dataset
[4]:
AssayMetadata(assay_name='14H_affinities', assay_description='Antibody heavy chain variable region mutagenesis with affinity measurements.', assay_id='b8bc4e73-f46c-4f37-be75-25955d68e4ef', original_filename='assay_data', created_date=datetime.datetime(2024, 10, 18, 15, 36, 36, 176156), num_rows=7476, num_entries=7476, measurement_names=['log_kdnm'], sequence_length=118)
The returned AssayMetadata object contains information about the assay metadata, including the model configuration, assay name, description, ID, original filename, creation date, number of rows, number of entries, measurement names, and sequence length.
Train a sequence-to-function prediction model#
We’re ready to train the model using our dataset:
[5]:
model_future = session.train.create_training_job(dataset, 'log_kdnm')
print(model_future.job)
model_future.wait_until_done(verbose=True)
job_id='45de818e-3d88-4579-85e2-626834ceb2ac' job_type=<JobType.workflow_train: '/workflow/train'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 10, 55, 39, 562977) start_date=None end_date=None prerequisite_job_id='73a52b4d-6c92-4187-81be-c2607bbdca25' progress_message=None progress_counter=None sequence_length=118 traingraph=None
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [04:58<00:00, 2.98s/it, status=SUCCESS]
[5]:
True
Run cross validation#
It’s important to evaluate the trained model’s performance through cross-validation. This process estimates the model’s predictive abilities by simulating how well it would perform on independent datasets.
[6]:
model_future.crossvalidate()
cv_future = model_future.crossvalidation
print(cv_future)
cv_future.wait_until_done(verbose=True)
job_id='612caf9e-e588-4dc5-bd87-926a9194556a' job_type=<JobType.workflow_crossvalidate: '/workflow/crossvalidate'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 11, 0, 40, 172174) start_date=None end_date=None prerequisite_job_id='45de818e-3d88-4579-85e2-626834ceb2ac' progress_message=None progress_counter=None sequence_length=None num_rows=None page_size=None page_offset=None result=None
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [17:17<00:00, 10.37s/it, status=SUCCESS]
[6]:
True
Retrieve the results of the cross-validation process with:
[7]:
cv_results = cv_future.get()
The next step is to process the cross-validation results and visualize the relationship between predicted and actual affinities on a scatter plot.
[8]:
y = []
y_mu = []
y_var = []
for r in cv_results:
y.append(r.y)
y_mu.append(r.y_mu)
y_var.append(r.y_var)
y = np.array(y)
y_mu = np.array(y_mu)
y_var = np.array(y_var)
y.shape, y_mu.shape, y_var.shape
[8]:
((7476,), (7476,), (7476,))
[9]:
print(spearmanr(y_mu, y))
plt.scatter(y_mu, y, s=2)
plt.axis('equal')
plt.xlabel('Predicted affinity')
plt.ylabel('Actual affinity')
SignificanceResult(statistic=np.float64(0.693878735931715), pvalue=np.float64(0.0))
[9]:
Text(0, 0.5, 'Actual affinity')
Cross validation performance is strong, as indicated by a Spearman rho = 0.694. We’re ready to start working with our data.
Use the design API to identify variants with enhanced binding affinity#
[10]:
# create the design objective
# look for variants that will have single digit picomolar affinity, Kd <10 pm (< -2 log Kd)
from openprotein.schemas import ModelCriterion, Criterion, DesignJobCreate
target = -2.0
criterion = [
[ModelCriterion(
criterion_type = 'model',
model_id = model_future.job.job_id,
measurement_name = 'log_kdnm',
criterion = Criterion(target=target, weight=1.0, direction='<')
)]
]
# limit mutation sites to the CDRs
sites = [ 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
98, 99, 100, 101, 102, 103, 104, 105, 106]
site_mask = np.ones(len(table['sequence'].iloc[0]), dtype=bool)
site_mask[sites] = False
sites = np.where(site_mask)[0].tolist()
print(sites)
# run the algorithm for 50 steps with otherwise default parameters
design_params = DesignJobCreate(
assay_id = dataset.id,
criteria = criterion,
num_steps = 50,
mutation_positions = sites,
)
design_future = session.design.create_design_job(design_params)
print(design_future.job)
design_future.wait_until_done(verbose=True)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117]
job_id='6775d60c-7916-40ae-8c3d-78ed5d10dc1c' job_type=<JobType.workflow_design: '/workflow/design'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 14, 41, 42, 427618) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [30:11<00:00, 18.12s/it, status=SUCCESS]
[10]:
True
[11]:
# get the design results and parse them into a table for analysis
design_results = design_future.get()
rows = []
for r in design_results:
scores = r.scores[0][0]
y_mu = scores.metadata.y_mu
y_var = scores.metadata.y_var
score = scores.score
row = [r.step, score, y_mu, y_var, r.sequence]
rows.append(row)
design_results = pd.DataFrame(rows, columns=['step', 'score', 'y_mu', 'y_var', 'sequence'])
print(design_results.shape)
design_results.sort_values('score', ascending=False).head()
(12800, 5)
[11]:
step | score | y_mu | y_var | sequence | |
---|---|---|---|---|---|
6912 | 27 | -0.894528 | -1.844848 | 0.452562 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
10496 | 41 | -0.894528 | -1.844848 | 0.452562 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
11008 | 43 | -0.894528 | -1.844848 | 0.452562 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
11264 | 44 | -0.894528 | -1.844848 | 0.452562 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
8704 | 34 | -0.894528 | -1.844848 | 0.452562 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
Analyze possible libraries from the design results#
[12]:
design_results_filtered = design_results.drop_duplicates('sequence').sort_values(['score', 'y_mu'], ascending=[False, True])
print(design_results_filtered.shape)
design_results_filtered.head()
(4735, 5)
[12]:
step | score | y_mu | y_var | sequence | |
---|---|---|---|---|---|
6656 | 26 | -0.894528 | -1.844848 | 0.452562 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
10497 | 41 | -0.902979 | -1.845308 | 0.417193 | EVRLIQTGGGLVQPGASLRLSCAASGFSLNEYGISWVRQAPGKGPD... |
7937 | 31 | -0.903904 | -1.845814 | 0.411142 | EVRLIQTGGGLVQPGASLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
7425 | 29 | -0.903986 | -1.836633 | 0.461234 | EVRLIQTGGGLVQPGVSLRLSCAASGFSLNEYGISWVRQAPGKGPE... |
10500 | 41 | -0.904092 | -1.845299 | 0.413217 | EVRLIQTGGGLVQPGASLRLSCAASGFSLNEYGISWVRQAPGKGPD... |
[13]:
library_1000 = design_results_filtered.iloc[:1000]
library_100 = design_results_filtered.iloc[:100]
pred_dist_1000 = scipy.stats.norm(library_1000.y_mu, np.sqrt(library_1000.y_var))
pred_dist_100 = scipy.stats.norm(library_100.y_mu, np.sqrt(library_100.y_var))
Examine the expected distributions over affinites in the designed libraries versus the initial dataset#
[14]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)
format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)
# plot the predictive distribution over property values
c_lo = pred_dist_1000.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_1000.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Designed (Top 1000)', color='lightblue', hatch='/', **format_params)
c_lo = pred_dist_100.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_100.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Desiged (Top 100)', **format_params)
_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
Examine the predictive distribution over the best affinity in each library#
Above, we looked at the overall expected distribution of affinities, but we may actually be interested in what the affinity value of the best sequence in a library is likely to be. When comparing the libraries of size 100 and size 1,000, the affinity distributions are expected to be nearly identical, but we would expect a larger library to be more likely to contain a sequence that meets the design criteria (log KD < -2) and to contain a best sequence with stronger affinity.
Probability of containing a success#
[15]:
# what is the probability that a variant meeting the design criteria is present in each library?
target = -2
print('target value <', target)
log_one_minus_p_1000 = pred_dist_1000.logsf(target)
p_1000 = 1 - np.exp(np.sum(log_one_minus_p_1000)) # the probability that at least one sequence meets the target value
print(f'Designed (Top 1000), probability of at least one success = {p_1000:.5f}')
log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed (Top 100), probability of at least one success = {p_100:.5f}')
target value < -2
Designed (Top 1000), probability of at least one success = 1.00000
Designed (Top 100), probability of at least one success = 1.00000
[16]:
# in this case, we think that both libraries are virtually guaranteed to contain a successful variant!
# let's take a more fine-grained look
log_one_minus_p = pred_dist_1000.logsf(target)
print('Top k', 'Probability', sep='\t')
for i in range(20):
p = 1 - np.exp(np.sum(log_one_minus_p[:i+1]))
print(f'{i+1:2>d}', f'{p:.5f}', sep='\t')
Top k Probability
1 0.40880
2 0.64845
3 0.79082
4 0.87553
5 0.92593
6 0.95590
7 0.97370
8 0.98431
9 0.99062
10 0.99439
11 0.99665
12 0.99799
13 0.99880
14 0.99928
15 0.99957
16 0.99974
17 0.99984
18 0.99991
19 0.99994
20 0.99997
As we can see, we would only need to test a few sequences to have high probability of finding one that has picomolar affinity. We can also use this kind of analysis to determine how large our library should be given a success confidence level. For example, if we wanted a library with at least 0.95 probability of containing a variant with picomolar affinity (only 1/20 chance of failure), then we would only need to test the top 17 sequences to achieve this.
[17]:
# let's perform the same analysis but for a more aggresive affinity target value
target = -3
print('target value <', target)
log_one_minus_p_1000 = pred_dist_1000.logsf(target)
p_1000 = 1 - np.exp(np.sum(log_one_minus_p_1000)) # the probability that at least one sequence meets the target value
print(f'Designed (Top 1000), probability of at least one success = {p_1000:.5f}')
log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed (Top 100), probability of at least one success = {p_100:.5f}')
target value < -3
Designed (Top 1000), probability of at least one success = 1.00000
Designed (Top 100), probability of at least one success = 0.98015
[18]:
# now we can see clearer resolution between library sizes and would need a much larger library to be confident
# that it will include a sequence with log kd < -3
log_one_minus_p = pred_dist_1000.logsf(target)
print('Top k', 'Probability', sep='\t')
for i in range(20):
p = 1 - np.exp(np.sum(log_one_minus_p[:i+1]))
print(f'{i+1:2>d}', f'{p:.5f}', sep='\t')
Top k Probability
1 0.04298
2 0.07830
3 0.11142
4 0.14994
5 0.18074
6 0.21220
7 0.24562
8 0.27759
9 0.30557
10 0.33460
11 0.36119
12 0.38779
13 0.41329
14 0.43782
15 0.46106
16 0.47998
17 0.49828
18 0.51930
19 0.53961
20 0.56054
Predicted distribution over the best affinity in each library#
Next, let’s examine the distribution over the affinity of the best sequence we expect to see in each library.
[19]:
# estimate best affinities in each library from sampling
num_samples = 10_000
r = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
sample_best_1000 = r.min(axis=-1)
r = np.random.randn(num_samples, 100)*pred_dist_100.std() + pred_dist_100.mean()
sample_best_100 = r.min(axis=-1)
[20]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)
format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)
# plot the predictive distribution over property values
_ = plt.hist(sample_best_1000, bins=bins, label='Designed (Top 1000)', color='lightblue', density=True, hatch='/', **format_params)
_ = plt.hist(sample_best_100, bins=bins, label='Desiged (Top 100)', density=True, **format_params)
_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
[21]:
# plot out the expected best value versus library size
r = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
print('Top k', 'E[Best Affinity]', sep='\t')
for i in range(20):
sample_best = r[:, :i+1].min(axis=-1)
expected_best_value = sample_best.mean()
print(f'{i+1:2>d}', f'{expected_best_value:.5f}', sep='\t')
Top k E[Best Affinity]
1 -1.84668
2 -2.22402
3 -2.39987
4 -2.52823
5 -2.61204
6 -2.67881
7 -2.73604
8 -2.78417
9 -2.82574
10 -2.86174
11 -2.89300
12 -2.92206
13 -2.94817
14 -2.97389
15 -2.99546
16 -3.01335
17 -3.03081
18 -3.04887
19 -3.06638
20 -3.08263
ML-designed sequences are divergent from the initial dataset#
The variants we’ve identified are divergent from anything present in the initial dataset. Let’s calculate some edit distance statistics between our top 1000 sequence library and the sequences in the initial dataset.
[22]:
x_1000 = library_1000['sequence'].values
x_1000 = np.array([[c for c in x_1000[i]] for i in range(len(x_1000))])
print(x_1000.shape)
x_init = table['sequence'].values
x_init = np.array([[c for c in x_init[i]] for i in range(len(x_init))])
print(x_init.shape)
edit_dist = 0
for i in range(x_1000.shape[1]):
edit_dist += (x_1000[:, i][:, np.newaxis] != x_init[:, i])
edit_dist.shape
(1000, 118)
(7476, 118)
[22]:
(1000, 7476)
[23]:
min_edit_dist = edit_dist.min(axis=-1)
print(min_edit_dist.min(), min_edit_dist.mean(), min_edit_dist.max())
bins = np.arange(min_edit_dist.min(), min_edit_dist.max()+1) - 0.5
_ = plt.hist(min_edit_dist, bins=bins)
11 15.882 20
Sequences in the top 1000 library are, on average, more than 6 edits away from the most similar sequence in the inital library and the most divergent variant identified is 9 mutations away from anything in the initial library.
Compare top sequences against alternative library designs#
Design a conventional combinatorial variant library (CVL)#
A typical library design approach is to re-randomize the substitution variants found in the intial screen. Let’s build a CVL based on the variants found in the top 50 sequences.
[24]:
# randomize the variants found in the best sequences in the initial library
# to create a CVL
amino_acids = 'ARNDCQEGHILKMFPSTWYV'
amino_acid_to_index = {amino_acids[i]: i for i in range(len(amino_acids))}
topk = 50
sequences = table.sort_values('log_kdnm').head(n=topk)['sequence'].values
sequences = np.array([[amino_acid_to_index[sequences[i][j]] for j in range(len(sequences[i]))] for i in range(len(sequences))])
sequences.shape
[24]:
(50, 118)
[25]:
counts = np.zeros((sequences.shape[1], len(amino_acids)))
for i in range(sequences.shape[1]):
index, count = np.unique(sequences[:, i], return_counts=True)
counts[i, index] = count
counts = (counts > 1)
cvl = counts/counts.sum(axis=-1, keepdims=True)
[26]:
# visualize the CVL
_, ax = plt.subplots(figsize=(20, 4))
ax.imshow(cvl.T)
_ = ax.set_yticks(np.arange(len(amino_acids)), [c for c in amino_acids])
_ = ax.set_xlabel('Position')
_ = ax.set_ylabel('Amino Acid')
Assess the expected performance of the conventional CVL#
[27]:
library_size = 10_000
random = np.random.RandomState(42)
cvl_sequences_sampled = np.zeros((library_size, cvl.shape[0]), dtype=int)
for i in range(len(cvl)):
cvl_sequences_sampled[:, i] = random.choice(len(cvl[i]), p=cvl[i], size=(library_size), replace=True)
cvl_sequences_sampled = [''.join([amino_acids[i] for i in cvl_sequences_sampled[j]]) for j in range(cvl_sequences_sampled.shape[0])]
# only consider the unique sequences
cvl_sequences_sampled = np.unique(cvl_sequences_sampled).tolist()
[28]:
cvl_predict_future = model_future.predict(cvl_sequences_sampled)
print(cvl_predict_future.job)
cvl_predict_future.wait_until_done(verbose=True)
job_id='88eb91b6-58dc-4ab9-908a-23e5f061cc51' job_type=<JobType.workflow_predict: '/workflow/predict'> status=<JobStatus.PENDING: 'PENDING'> created_date=None start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None result=None
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [04:23<00:00, 2.64s/it, status=SUCCESS]
[28]:
True
[29]:
cvl_predict_results = cvl_predict_future.get()
cvl_y_mus = np.zeros(len(cvl_predict_results))
cvl_y_vars = np.zeros(len(cvl_predict_results))
for i,r in enumerate(cvl_predict_results):
props = r.predictions[0].properties['log_kdnm']
cvl_y_mus[i] = props['y_mu']
cvl_y_vars[i] = props['y_var']
cvl_dists = scipy.stats.norm(cvl_y_mus, np.sqrt(cvl_y_vars))
[30]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)
format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)
# plot the predictive distribution over property values
c_lo = pred_dist_100.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_100.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Desiged (Top 100)', **format_params)
# compare the expected performance of the 10,000 sequences from the CVL
c_lo = cvl_dists.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = cvl_dists.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='CVL (standard)', color='lightblue', hatch='/', **format_params)
_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
The expected performance of a sequence in the standard CVL is expected to be better than the average sequence in the initial dataset and may improve over the best sequences in the intial library, but is expected to be significantly worse than the sequences in the bespoke library we created with the design API.
Compare the library success statistics#
[31]:
# what is the probability that a variant meeting the design criteria is present in each library?
# if we tested 10,000 sequences from the CVL, how would it compare with the bespoke top 100 sequences?
target = -2
print('target value <', target)
log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed, 100 sequences, probability of at least one success = {p_100:.5f}')
log_one_minus_p_cvl = cvl_dists.logsf(target)
p_cvl = 1 - np.exp(np.sum(log_one_minus_p_cvl))
print(f'CVL (standard), 10,000 sequences, probability of at least one success = {p_cvl:.5f}')
target value < -2
Designed, 100 sequences, probability of at least one success = 1.00000
CVL (standard), 10,000 sequences, probability of at least one success = 0.77343
We can see that even if we test 10,000 sequences from the CVL, the probability of finding a sequence that meets our target affinity (picomolar) is only about 67%, significantly lower than the virtual certainty that a success will be contained in the bespoke 100 sequence library. If we compare this with the fine-grained results above, we can see that we would only need to test the top 3 bespoke sequence designs to have the same probability of success.
Predicted distribution over the best affinity in each library#
Next, let’s examine the distribution over the affinity of the best sequence we expect to see in each library.
[32]:
cvl_dists.std().shape
[32]:
(9969,)
[33]:
# estimate best affinities in each library from sampling
num_samples = 10_000
r = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
sample_cvl = r.min(axis=-1)
r = np.random.randn(num_samples, 100)*pred_dist_100.std() + pred_dist_100.mean()
sample_best_100 = r.min(axis=-1)
[34]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)
format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)
# plot the predictive distribution over property values
_ = plt.hist(sample_best_100, bins=bins, label='Desiged (Top 100)', density=True, **format_params)
_ = plt.hist(sample_cvl, bins=bins, label='CVL (standard), 10,000 sequences', color='lightblue', density=True, hatch='/', **format_params)
_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of best sequence in each library')
[35]:
# plot out the expected best value versus library size
r_des = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
r_cvl = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
print('Library Size', 'E[Best Affinity | Bespoke]', 'E[Best Affinity | CVL (standard)]', sep='\t')
for i in range(20):
sample_best = r_des[:, :i+1].min(axis=-1)
expected_best_value_des = sample_best.mean()
sample_best = r_cvl[:, :i+1].min(axis=-1)
expected_best_value_cvl = sample_best.mean()
print(f'{i+1: 12d}', f'{expected_best_value_des:26.5f}', f'{expected_best_value_cvl:33.5f}', sep='\t')
Library Size E[Best Affinity | Bespoke] E[Best Affinity | CVL (standard)]
1 -1.84382 1.58382
2 -2.22086 1.13444
3 -2.39839 0.71612
4 -2.52292 0.36662
5 -2.61014 -0.04695
6 -2.67672 -0.07006
7 -2.73481 -0.10994
8 -2.78496 -0.12804
9 -2.82323 -0.12937
10 -2.85986 -0.13356
11 -2.89200 -0.15080
12 -2.92242 -0.20068
13 -2.94814 -0.26210
14 -2.97371 -0.26403
15 -2.99466 -0.34228
16 -3.01183 -0.35330
17 -3.02806 -0.35466
18 -3.04423 -0.35513
19 -3.06097 -0.35547
20 -3.07764 -0.35774
Design a better combinatorial variant library from the OpenProtein.AI design results#
[36]:
# randomize the variants found in the best sequences in the initial library
# to create a CVL
amino_acids = 'ARNDCQEGHILKMFPSTWYV'
amino_acid_to_index = {amino_acids[i]: i for i in range(len(amino_acids))}
sequences = library_1000['sequence'].values
sequences = np.array([[amino_acid_to_index[sequences[i][j]] for j in range(len(sequences[i]))] for i in range(len(sequences))])
sequences.shape
[36]:
(1000, 118)
[37]:
counts = np.zeros((sequences.shape[1], len(amino_acids)))
for i in range(sequences.shape[1]):
index, count = np.unique(sequences[:, i], return_counts=True)
counts[i, index] = count
# counts = (counts > 1)
cvl2 = counts/counts.sum(axis=-1, keepdims=True)
[38]:
# visualize the CVL
_, ax = plt.subplots(figsize=(20, 4))
ax.imshow(cvl2.T)
_ = ax.set_yticks(np.arange(len(amino_acids)), [c for c in amino_acids])
_ = ax.set_xlabel('Position')
_ = ax.set_ylabel('Amino Acid')
Assess the expected performance of the optimized CVL#
[39]:
library_size = 10_000
random = np.random.RandomState(42)
cvl2_sequences_sampled = np.zeros((library_size, cvl2.shape[0]), dtype=int)
for i in range(len(cvl2)):
cvl2_sequences_sampled[:, i] = random.choice(len(cvl2[i]), p=cvl2[i], size=(library_size), replace=True)
cvl2_sequences_sampled = [''.join([amino_acids[i] for i in cvl2_sequences_sampled[j]]) for j in range(cvl2_sequences_sampled.shape[0])]
# only consider the unique sequences
cvl2_sequences_sampled = np.unique(cvl2_sequences_sampled).tolist()
[40]:
cvl2_predict_future = model_future.predict(cvl2_sequences_sampled)
print(cvl2_predict_future.job)
cvl2_predict_future.wait_until_done(verbose=True)
job_id='e6f2a992-72b0-4966-be48-70db344467cd' job_type=<JobType.workflow_predict: '/workflow/predict'> status=<JobStatus.PENDING: 'PENDING'> created_date=None start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None result=None
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [01:23<00:00, 1.20it/s, status=SUCCESS]
[40]:
True
[41]:
cvl2_predict_results = cvl2_predict_future.get()
cvl2_y_mus = np.zeros(len(cvl2_predict_results))
cvl2_y_vars = np.zeros(len(cvl2_predict_results))
for i,r in enumerate(cvl2_predict_results):
props = r.predictions[0].properties['log_kdnm']
cvl2_y_mus[i] = props['y_mu']
cvl2_y_vars[i] = props['y_var']
cvl2_dists = scipy.stats.norm(cvl2_y_mus, np.sqrt(cvl2_y_vars))
[42]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)
format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)
# plot the predictive distribution over property values
c_lo = pred_dist_100.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = pred_dist_100.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='Desiged (Top 100)', **format_params)
# compare the expected performance of the 10,000 sequences from the CVL
c_lo = cvl_dists.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = cvl_dists.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='CVL (standard)', color='lightblue', hatch='/', **format_params)
c_lo = cvl2_dists.cdf(bins[:-1][:, np.newaxis]).mean(axis=1)
c_hi = cvl2_dists.cdf(bins[1:][:, np.newaxis]).mean(axis=1)
weights = (c_hi - c_lo)/(bins[1:] - bins[:-1])
_ = plt.hist(bins[:-1], bins=bins, weights=weights, label='CVL (optimized)', color='blue', hatch='/', **format_params)
_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of designed libraries versus the initial dataset')
We can see that the CVL derived from variants identified by the design algorithm contains sequences that are expected to perform much better, on average, than the sequences in the standard CVL. Sequences in the optimized CVL are only expected to be slightly worse than the best bespoke variants we identified.
Compare the library success statistics#
[43]:
# what is the probability that a variant meeting the design criteria is present in each library?
# if we tested 10,000 sequences from the CVL, how would it compare with the bespoke top 100 sequences?
target = -2
print('target value <', target)
log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed, 100 sequences, probability of at least one success = {p_100:.5f}')
log_one_minus_p_cvl = cvl_dists.logsf(target)
p_cvl = 1 - np.exp(np.sum(log_one_minus_p_cvl))
print(f'CVL (standard), 10,000 sequences, probability of at least one success = {p_cvl:.5f}')
log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:100]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 100 sequences, probability of at least one success = {p_cvl2:.5f}')
log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:1000]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 1,000 sequences, probability of at least one success = {p_cvl2:.5f}')
log_one_minus_p_cvl2 = cvl2_dists.logsf(target)
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 10,000 sequences, probability of at least one success = {p_cvl2:.5f}')
target value < -2
Designed, 100 sequences, probability of at least one success = 1.00000
CVL (standard), 10,000 sequences, probability of at least one success = 0.77343
CVL (optimized), 100 sequences, probability of at least one success = 1.00000
CVL (optimized), 1,000 sequences, probability of at least one success = 1.00000
CVL (optimized), 10,000 sequences, probability of at least one success = 1.00000
[44]:
# what is the probability that a variant meeting the design criteria is present in each library?
# if we tested 10,000 sequences from the CVL, how would it compare with the bespoke top 100 sequences?
target = -3
print('target value <', target)
log_one_minus_p_100 = pred_dist_100.logsf(target)
p_100 = 1 - np.exp(np.sum(log_one_minus_p_100))
print(f'Designed, 100 sequences, probability of at least one success = {p_100:.5f}')
log_one_minus_p_cvl = cvl_dists.logsf(target)
p_cvl = 1 - np.exp(np.sum(log_one_minus_p_cvl))
print(f'CVL (standard), 10,000 sequences, probability of at least one success = {p_cvl:.5f}')
log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:100]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 100 sequences, probability of at least one success = {p_cvl2:.5f}')
log_one_minus_p_cvl2 = cvl2_dists.logsf(target)[:1000]
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 1,000 sequences, probability of at least one success = {p_cvl2:.5f}')
log_one_minus_p_cvl2 = cvl2_dists.logsf(target)
p_cvl2 = 1 - np.exp(np.sum(log_one_minus_p_cvl2))
print(f'CVL (optimized), 10,000 sequences, probability of at least one success = {p_cvl2:.5f}')
target value < -3
Designed, 100 sequences, probability of at least one success = 0.98015
CVL (standard), 10,000 sequences, probability of at least one success = 0.00813
CVL (optimized), 100 sequences, probability of at least one success = 0.72520
CVL (optimized), 1,000 sequences, probability of at least one success = 1.00000
CVL (optimized), 10,000 sequences, probability of at least one success = 1.00000
We can see that even if we test 10,000 sequences from the CVL, the probability of finding a sequence that meets our target affinity (picomolar) is only about 67%, significantly lower than the virtual certainty that a success will be contained in the bespoke 100 sequence library. If we compare this with the fine-grained results above, we can see that we would only need to test the top 3 bespoke sequence designs to have the same probability of success.
Predicted distribution over the best affinity in each library#
Next, let’s examine the distribution over the affinity of the best sequence we expect to see in each library.
[45]:
# estimate best affinities in each library from sampling
num_samples = 10_000
r = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
sample_cvl = r.min(axis=-1)
r = np.random.randn(num_samples, len(cvl2_y_mus))*cvl2_dists.std() + cvl2_dists.mean()
sample_cvl2 = r.min(axis=-1)
r = np.random.randn(num_samples, 100)*pred_dist_100.std() + pred_dist_100.mean()
sample_best_100 = r.min(axis=-1)
[46]:
# plot the expected results vs. the input data
bins = np.linspace(-4, 6, 101)
format_params = {'alpha': 0.5, 'histtype': 'stepfilled', 'ec': 'k'}
_ = plt.hist(table['log_kdnm'].values, bins=bins, label='Initial Dataset', color='gray', density=True, **format_params)
# plot the predictive distribution over property values
_ = plt.hist(sample_best_100, bins=bins, label='Desiged (Top 100)', density=True, **format_params)
_ = plt.hist(sample_cvl, bins=bins, label='CVL (standard), 10,000 sequences', color='lightblue', density=True, hatch='/', **format_params)
_ = plt.hist(sample_cvl2, bins=bins, label='CVL (optimized), 10,000 sequences', color='blue', density=True, hatch='/', **format_params)
_ = plt.legend(loc='best')
_ = plt.xlabel('log$_{10}$ $K_D$ nM')
_ = plt.ylabel('Density')
_ = plt.title('Affinity distributions of best sequence in each library')
[47]:
# plot out the expected best value versus library size
r_des = np.random.randn(num_samples, 1000)*pred_dist_1000.std() + pred_dist_1000.mean()
r_cvl = np.random.randn(num_samples, len(cvl_y_mus))*cvl_dists.std() + cvl_dists.mean()
r_cvl2 = np.random.randn(num_samples, len(cvl2_y_mus))*cvl2_dists.std() + cvl2_dists.mean()
print('Library Size', 'E[Best Affinity | Bespoke]', 'E[Best Affinity | CVL (standard)]', 'E[Best Affinity | CVL (optimized)]', sep='\t')
for i in range(20):
sample_best = r_des[:, :i+1].min(axis=-1)
expected_best_value_des = sample_best.mean()
sample_best = r_cvl[:, :i+1].min(axis=-1)
expected_best_value_cvl = sample_best.mean()
sample_best = r_cvl2[:, :i+1].min(axis=-1)
expected_best_value_cvl2 = sample_best.mean()
print(f'{i+1: 12d}', f'{expected_best_value_des:26.5f}', f'{expected_best_value_cvl:33.5f}', f'{expected_best_value_cvl2:33.5f}', sep='\t')
Library Size E[Best Affinity | Bespoke] E[Best Affinity | CVL (standard)] E[Best Affinity | CVL (optimized)]
1 -1.84565 1.59247 -1.69554
2 -2.22522 1.13991 -1.99613
3 -2.40500 0.71012 -2.10122
4 -2.52714 0.36688 -2.23626
5 -2.61457 -0.03296 -2.35743
6 -2.68090 -0.05541 -2.41145
7 -2.73863 -0.09612 -2.44894
8 -2.78669 -0.11404 -2.53044
9 -2.82268 -0.11523 -2.54199
10 -2.85709 -0.12108 -2.54264
11 -2.88716 -0.13785 -2.55852
12 -2.91473 -0.18954 -2.60263
13 -2.94081 -0.25322 -2.60596
14 -2.96515 -0.25493 -2.63060
15 -2.98712 -0.33339 -2.67014
16 -3.00522 -0.34357 -2.70753
17 -3.02198 -0.34494 -2.72773
18 -3.03734 -0.34525 -2.76505
19 -3.05492 -0.34575 -2.78563
20 -3.07325 -0.34829 -2.80568
## Summary and next steps
In this example, we evaluated: - The initial antibody variable fragment mutagenesis, 14H dataset. - A conventional combinatorial library designed using the top 50 sequences in the antibody variable fragment mutagenesis, 14H dataset. - A machine learning-designed library using OP Models. - A combinatorial library based on our machine learning-designed library.
For each library, we were able to predict performance and evaluate the size at which one library performed better than the other. This can be useful for making decisions on library design strategies and size trade-offs between fully bespoke and partially randomized libraries. We also explored a strategy for designing semi-randomized combinatorial variant libraries using the design tool.
We encourage you to try this out on your own libraries when making choices on library size and composition. Happy modeling!