Predicting the fitness of isomerases without experimental data#

Chorismate Mutase (EC 5.4.99.5) (CM) holds a pivotal role in the biosynthesis of biological compounds, facilitating a crucial chemical reaction in the shikimate pathway - the conversion of chorismate to prephenate. CM is responsible for a million-fold acceleration of this reaction rate, and therefore is crucial to bacterial, archaeal, fungal, and plant life.

CM is instrumental in the production of two aromatic amino acids in vivo, phenylalanine and tyrosine. It operates at a pivitol point in the shikimate pathway where it channels its substrate (chorismate) towards the biosynthesis of tyrosine and phenylalanine, and away from another aromatic amino acid, tryptophan. It therefore can determine the relative balance of various aromatic amino acids in several biological systems.

As well as this core role in aromatic amino acid synthesis, CM displays many other interesting characteristics from both a biological and industrial perspective:

  • CM is used in industrial biotechnology to develop novel antimicrobials efficiently

  • CM is the only known natural enzyme that catalyses a pericyclic reaction.

  • Certain variants of this enzyme are thought to show allosteric regulaton

It is therefore a good candidate for mutagenesis, both to improve the biotechnological applications of this enzyme, and to study it’s unique biochemistry.

Here, we will use Openprotein’s PoET pipelines to delve into the intricacies of Chorismate Mutase, exploring its function and potential for enhanced activity through bioengineering.

PoET#

PoET, a state-of-the-art protein language model, enables the evaluation and generation of protein sequences from scratch, without necessitating functional measurements of your protein.

PoET’s core workflow revolves around a ‘prompt’, a strategically assembled set of sequences encapsulating essential insights about the local fitness landscape and co-evolutionary patterns related to your protein of interest. This prompt creation involves two crucial steps - first, generating a multiple sequence alignment (MSA) of your protein’s sequences spanning various evolutionary lineages, and secondly, applying filters to refine this MSA into a focussed prompt.

The sequences, owing to their shared evolutionary ties, inherently embed information about the protein’s local fitness landscape. PoET utilizes this information to assess the probability of observing a specific sequence based on the inferred evolutionary process. Although PoET also boasts the capability to generate novel sequences, for the scope of this tutorial, we will focus solely on its scoring function to identify promising Chorismate Mutase enzyme variants.

[1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import pearsonr, spearmanr
import json
import seaborn as sns

from umap import UMAP
sns.set()
/Users/jmage/.micromamba/envs/openprotein-sdk/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2]:
import openprotein # install from pypi
import openprotein.fasta as fasta

Connect to the OpenProtein.AI API#

[3]:
with open('secrets.config', 'r') as f:
    config = json.load(f)

session = openprotein.connect(config['username'], config['password'])

Load the chorismate mutase dataset#

We have a dataset of Chorismate mutase activity from Russ et al 2020.

It contains the following information (as seen below):

  • Species - the source of Chorismate Mutase (CM) sequence

  • Sequence - the amino acid sequence

  • ID to EcCM - the sequence identity to E. coli CM. (EcCM).

  • top ID - the highest sequence identity to any sequence in the data (exc. EcCM).

  • norm r.e. - the normalized relative enrichment from the assay

The enrichment (norm r.e.) assay involves using a knock-out CM E. coli strain and then re-introducing variant CM. As CM is essential for E. coli growth on minimal media (i.e. without aromatic amino acids), the E. coli count itself can be used as a proxy for enzymatic activity.

The dataset also includes a computational measure of protein fitness pioneered in Russ’ 2020 paper above:

  • EDCA - sequence modelling energy; lower energies are associated with higher probability

We will compare the wetlab functional assay (norm r.e.) with EDCA and our own OpenProtein proprietary model, PoET which makes scoring and generating novel sequences easy.

[4]:
table_natural = pd.read_csv('data/chorismate_mutase_natural_sequences.csv' )
table_variant = pd.read_csv('data/chorismate_mutase_variant_sequences.csv')

table_natural.head()
[4]:
No. Species Sequence ID to EcCM top ID EDCA (rel. to EcCM) norm r.e.
0 1 Escherichia coli ... -TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPV... 1.00 0.93 0.00 1.00
1 2 Thermus thermophilus HB8 ... ---DERIQALRKEVDRVNREILRLLSERGRLVQEIGRLQTELGLPH... 0.26 0.89 28.62 -0.06
2 3 Pseudomonas aeruginosa UCBPP-PA14 ... PEDCTGLADIREAIDQIDLDIVQALGRRMDYVKAASRFKANEA-AI... 0.20 0.71 -4.19 0.04
3 4 Enterobacter cloacae subsp. dissolvens SDM ... -TPENPLLDLRVKISALDEKLLALLAERRALAIEVGKAKLDSHRPV... 0.93 0.93 8.57 0.65
4 5 Thermus aquaticus Y51MC23 ... ---DERIQALRREVDRVNRELLRLLSERGRLVQEIGRIQTELGLPH... 0.28 0.89 18.42 -0.08

Datasets#

Let’s look at the Species with the highest identity to EcCM:

  • We can see norm r.e. is benchmarked to EcCM (=1)

  • We can see the most similar sequences all score lower than EcCM

[5]:
table_natural.sort_values("ID to EcCM", ascending=False).drop(columns=["Sequence"]).head()
[5]:
No. Species ID to EcCM top ID EDCA (rel. to EcCM) norm r.e.
0 1 Escherichia coli ... 1.00 0.93 0.00 1.00
3 4 Enterobacter cloacae subsp. dissolvens SDM ... 0.93 0.93 8.57 0.65
6 7 Cronobacter sakazakii ATCC BAA-894 ... 0.91 0.90 6.42 0.95
5 6 Salmonella enterica subsp. enterica serovar Sa... 0.90 0.89 9.83 0.70
8 9 Erwinia billingiae Eb661 ... 0.80 0.85 26.25 0.84
[6]:
table_natural.sort_values("EDCA (rel. to EcCM)", ascending=False).drop(columns=["Sequence"]).head()
[6]:
No. Species ID to EcCM top ID EDCA (rel. to EcCM) norm r.e.
379 380 Arthrobacter sp. FB24 ... 0.18 0.83 54.95 0.07
380 381 Arthrobacter phenanthrenivorans Sphe3 ... 0.18 0.83 53.67 -0.03
398 399 Helicobacter pylori B38 ... 0.27 0.89 52.00 0.88
1057 1058 Caulobacter sp. K31 ... 0.19 0.30 50.81 -0.06
399 400 Helicobacter pylori SouthAfrica7 ... 0.25 0.89 49.09 0.90
[7]:
import seaborn as sns
from scipy.stats import spearmanr

_, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

natural_corr, natural_p = spearmanr(table_natural['EDCA (rel. to EcCM)'].values, table_natural['norm r.e.'].values)
variant_corr, variant_p = spearmanr(table_variant['EDCA (rel. to EcCM)'].values, table_variant['norm r.e.'].values)

sns.regplot(data=table_natural, x='EDCA (rel. to EcCM)', y='norm r.e.', ax=axs[0])
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("EDCA")
axs[0].set_ylabel("Enrichment")

sns.regplot(data=table_variant, x='EDCA (rel. to EcCM)', y='norm r.e.', ax=axs[1])
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("EDCA")

plt.tight_layout()
plt.show()

../_images/walkthroughs_predicting-fitness_11_0.png

In the plots displayed above, we examine the relationships between the EDCA and the normalized relative enrichment over EcCM (‘norm r.e.’) for both ‘Natural’ and ‘Variant’ sequences.

We can see from the ‘Natural’ plot above that there is no clear relationship between the EDCA of a natural sequence and it’s relative enrichment, this is confirmed by a high spearman p-value (0.24) and low correlation coefficient (0.04).

By contrast, the ‘Variant’ plot shows a different picture. The data here exhibits a more patterned distribution, suggesting a possible relationship between the two variables. The Spearman correlation coefficient for the ‘Variants’ dataset is large and negative (~0.4) and the p-value significant (<0.05). Given that a more negative EDCA means a more likely sequence, there appears to be a positive relationship between the statistical liklihood of a sequence and its enrichment.

Lastly, we can also see the span of EDCA is much larger for our variant sequences than natural sequences.

Interestingly, given the E. coli based enrichment assay, we might expect identity to EcCM alone to be a reasonable predictor for enzymatic activity. We find this to be somewhat true (see below):

[8]:
from Bio import pairwise2

ECCM_sequence = table_natural.query("`ID to EcCM`==1").Sequence[0]

table_natural['EcCM_similarity'] = [pairwise2.align.globalxx(i, ECCM_sequence)[0].score for i in table_natural.Sequence]
table_variant['EcCM_similarity'] = [pairwise2.align.globalxx(i, ECCM_sequence)[0].score for i in table_variant.Sequence]
/Users/jmage/.micromamba/envs/openprotein-sdk/lib/python3.10/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
[9]:
import seaborn as sns
from scipy.stats import spearmanr

_, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

natural_corr, natural_p = spearmanr(table_natural['EcCM_similarity'].values, table_natural['EDCA (rel. to EcCM)'].values)
variant_corr, variant_p = spearmanr(table_variant['EcCM_similarity'].values, table_variant['EDCA (rel. to EcCM)'].values)

sns.regplot(data=table_natural, x='EcCM_similarity', y='EDCA (rel. to EcCM)', ax=axs[0])
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("EcCM similarity %")
axs[0].set_ylabel("EDCA")

sns.regplot(data=table_variant, x='EcCM_similarity', y='EDCA (rel. to EcCM)', ax=axs[1])
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("EcCM similarity %")

plt.tight_layout()
plt.show()

../_images/walkthroughs_predicting-fitness_15_0.png
[10]:
import seaborn as sns
from scipy.stats import spearmanr

_, axs = plt.subplots(1, 2, figsize=(12, 4), sharex=True, sharey=True)

natural_corr, natural_p = spearmanr(table_natural['EcCM_similarity'].values, table_natural['norm r.e.'].values)
variant_corr, variant_p = spearmanr(table_variant['EcCM_similarity'].values, table_variant['norm r.e.'].values)

sns.regplot(data=table_natural, x='EcCM_similarity', y='norm r.e.', ax=axs[0])
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("EcCM similarity %")
axs[0].set_ylabel("Enrichment")

sns.regplot(data=table_variant, x='EcCM_similarity', y='norm r.e.', ax=axs[1])
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("EcCM similarity %")

plt.tight_layout()
plt.show()

../_images/walkthroughs_predicting-fitness_16_0.png

We can see above that there is a statistically significant relationship between sequence identity to EcCM and relative enrichment, the coeffecients are <0.2 and therefore not particularly strong, but they do present a simple baseline of expectations for model performance.

Let’s now use our PoET model to score these same sequences.

Generating an MSA and Prompt for PoET#

We will build a prompt around the E. coli chorismate mutase sequence, generating an MSA and filtering it to create a curated prompt to evaluate our target sequences on.

[11]:
seed = table_natural.iloc[0]['Sequence']
seed = seed.replace('-', '')
seed
[11]:
'TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH'
[12]:
msa = session.align.create_msa(seed.encode())
print(msa)

job_id='5548acb6-cdc0-4af0-908e-93c96012c19d' job_type=<JobType.align_align: '/align/align'> status=<JobStatus.SUCCESS: 'SUCCESS'> created_date=datetime.datetime(2024, 10, 17, 9, 47, 32, 126851) start_date=None end_date=datetime.datetime(2024, 10, 17, 9, 47, 32, 128787) prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None

Let’s examine the resulting MSA:

We can see the constructed MSA had 3500+ sequences, only 4 of which were found in our natural sequence dataset:

[13]:
msa_result = pd.DataFrame(list(list(msa.get())), columns=['name','sequence'])
print(f"Number of sequences in the MSA: {msa_result.sequence.nunique()}")
Number of sequences in the MSA: 3741
[14]:
set(table_natural.Sequence.str.replace("-","").tolist()) & set(msa_result.sequence.str.replace("-","").tolist())
[14]:
{'EQNLVNLREKISNIDADLLDLLAKRRQLANNIAKTKLLENRPIRDKNRERELLNVLITKGKNRGLDGFYITRLFQMIIEDSVLTQQALLQQH',
 'TENSLLALRERISALDLKLLALLAERRTMAIEVAQSKMHSHRPIRDKEREAELINTLIQEGKKQGLDGHYITRVYQLIIEDSVLTQQALLQQH',
 'TPENPLLDLRVKISALDEKLLALLAERRALAIEVGKAKLDSHRPVRDIDRERDLLERLIQLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH',
 'TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH'}

We have generated a large, diverse MSA that can be filtered into a suitable prompt for our PoET Model.

Rank the natural and variant sequences#

Creating a prompt from an MSA involves filtering the ~3700 sequences to ~100 based on a number of criteria, our aim is for the resulting prompt to successfully encode relevant evolutionary and fitness data that our PoET model can use as context when scoring or generating new sequences. For now, let’s try the default prompt settings:

  • num_residues: This controls the maximum number of residues (tokens) in the prompt. We’ll use the default 12288 here. Which should be roughly X sequences.

  • method: This parameter signifies the MSA sampling method to be used. The default method is NEIGHBORS_NONGAP_NORM_NO_LIMIT.

  • homology_level: Applicable for the neighbors methods only, this parameter controls the level of homology for sequences in the MSA. The value ranges between 0 and 1, with the default set at 0.8.

  • max_similarity: This defines the maximum similarity between sequences in the MSA and the seed, ranging between 0 and 1. By default, it is set to 1.0.

  • min_similarity: This determines the minimum similarity between sequences in the MSA and the seed, with values ranging from 0 to 1. The default is set at 0.0.

  • always_include_seed_sequence: This Boolean parameter controls whether to always include the seed sequence in the MSA or not. By default, it is set to False.

  • num_ensemble_prompts: It denotes the number of ensemble jobs to run, each job will have a different prompt to generate a more diverse set of scores. We will set this value to 3.

  • random_seed: The seed allows you to make reproducible runs. We’ll set it to 42 here.

[15]:
prompt = msa.sample_prompt(num_ensemble_prompts=3, random_seed=42)
print(prompt)
job_id='6def2dce-944d-49bd-be8b-f8963d9fd42c' job_type=<JobType.align_prompt: '/align/prompt'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 9, 47, 37, 42067) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None
[16]:
prompt.wait() # wait for job to complete

prompt_result = []
for i in [1,2,3]:
    prompt_result.append( pd.DataFrame( list(prompt.get(i)) , columns=['name','sequence']) )

We have created 3 prompts, each of 100 sequences. We can see the prompt for the first model (in the ensemble of 3 models) below:

[17]:
prompt_result[0]
[17]:
name sequence
0 A0A1Y2L7K8 TTKNPCSDVRANVDRVDRLLVALMAERLNYIEQAARIKPTRDAVRD...
1 A0A523L5L1 SDDSLQKLRSQIDAIDDQLLELFNQRAGCAVEVAEVKRKSGGSIFR...
2 A0A7C2R2J8 DINDWRKRIDELDAELVKLLNERSRCAAEVGKLKRKAGTPLYQPER...
3 A0A2A4N5R5 DSTLGELRHLIDDFDQKLLELLENRMQVAKTIGHYKKENNITVLQN...
4 C6XN30 VRYEIDRIDRLLVEILAERTTYIKAASRIKPDREMVRDVPRIEDVV...
... ... ...
95 A0A521G9H4 LQSLRDHIDALDDQLLQLLNQRMDYVRQVGELKRSTNAIIYRPERE...
96 A0A661QGZ3 TNSNLDDLRQRIDLIDNQILGLLNRRAEVVVAVGKAKEGAEKTFYV...
97 A0A7W9KC75 SEHRLSDLRAELDRLDQVLLDTIRRRLECGVQIAEHKAANDVPMMQ...
98 A0A7Y7D3I2 LQDVRDNIDRLDETIVPLLAERIGYVLQAAQFKETRADVRIPPRIE...
99 B0NB59 LEELRIRLDDVDEQMVRLFEERMSICEEVGEYK

100 rows × 2 columns

Each of the three prompts will satisfy the criteria given above, but will use different sequences to do so. This diversity will enable us to run the model 3 times and get an ensemble of scores for a more robust comparison.

Let’s see how many sequences overlap between the prompts:

[18]:
overlap = np.zeros((3,3))
seqs = []
for i,p1 in enumerate(prompt_result):
    for j,p2 in enumerate(prompt_result):
        overlapped_seqs = list(set(p1['sequence']) & set(p2['sequence']))
        overlap[i,j] = len(overlapped_seqs)
        if i!=j:
            seqs += overlapped_seqs
seqs = list(set(seqs))
[19]:
pd.DataFrame(overlap, columns=['prompt1','prompt2','prompt3'], index=['prompt1','prompt2','prompt3'],).astype(int)
[19]:
prompt1 prompt2 prompt3
prompt1 100 2 5
prompt2 2 100 5
prompt3 5 5 100

We can see a handful of sequences appear more than once, including TSENPLL…. which occurs under accession A0A379UTL6 and A0A5R1YFZ7 due to being present in 2 different ColabFold sources.

[20]:
pd.concat(prompt_result).query("sequence in @seqs").drop_duplicates()
[20]:
name sequence
17 A0A3A1Y7B8 SNRKLELIREDINNIDQQLIDLLSKRRDLSAEVAQYKIKECKPIKD...
24 UPI001423B622 ENPLLVLREKISAVDEKLLTLLAERRALAVDVAQAKLATHRPIRDI...
54 A0A512MB85 DLQAIRVRIDAIDEQLIRLLNERGDLVLEVAEVKKKAGLDIYVPGR...
55 A0A4R2PQ94 DDPLVAARTRIAEIDRTLLDLLAERRAMSRRIAEHKLATGTAVRNT...
96 A0A661QGZ3 TNSNLDDLRQRIDLIDNQILGLLNRRAEVVVAVGKAKEGAEKTFYV...
97 A0A7W9KC75 SEHRLSDLRAELDRLDQVLLDTIRRRLECGVQIAEHKAANDVPMMQ...
6 A0A1E7HTR6 EQELENLRRQLDTIDQNIVSLLAKRQAEVERVVKLKKAHNLPVYHP...
74 A0A2D5QF52 KDKLIQLRDSIDSLDEELLQLINKRASFAVEVANIKEGLGDNTFKP...
89 C6CBA0 TDNPLQALRERITTLDTQLLDLLAQRRELALDVARAKQHSHRPIRD...
93 A0A2D8VTW8 TQNKLDELRSKIDKCDNKILNLLDDRFKIALKIGKLKKDKNLLIEH...

PoET Scoring#

We will now use the 3 diverse prompts above to condition our PoET model and score our database of CM sequences. To do this we will first we remove any gap indicators (-) and then we send them to the PoET scoring endpoint:

[21]:
poet = session.embedding.get_model('poet')
[22]:
sequences = table_natural['Sequence'].apply(lambda s: s.replace('-', '').encode()).values
print(sequences.shape)

future_natural = poet.score(prompt=prompt, sequences=sequences)
print(future_natural.job)
(1130,)
num_records=3390 job_id='21149f19-305d-43a5-9926-4cc1d69155a7' job_type=<JobType.poet_score: '/poet/score'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 9, 48, 22, 189239, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None
[23]:
sequences = table_variant['Sequence'].apply(lambda s: s.replace('-', '').encode()).values
print(sequences.shape)

future_variant = poet.score(prompt=prompt, sequences=sequences)

print(future_variant.job)
(1618,)
num_records=4854 job_id='cfac1d68-c607-4661-8da1-117475ee194f' job_type=<JobType.poet_score: '/poet/score'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 9, 48, 27, 908032, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None

We can wait for the jobs to complete, this will auto-fetch the results too:

[24]:
results_natural = future_natural.wait(verbose=True)
results_variant = future_variant.wait(verbose=True)

print(len(results_natural), len(results_variant))
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:21<00:00,  4.63it/s, status=SUCCESS]
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:10<00:00,  9.25it/s, status=SUCCESS]
1130 1618
[25]:
scores_natural = np.array([r[2] for r in results_natural])
scores_variant = np.array([r[2] for r in results_variant])
scores_natural.shape, scores_variant.shape
[25]:
((1130, 3), (1618, 3))
[26]:
score_natural = scores_natural.mean(axis=1)
score_variant = scores_variant.mean(axis=1)
score_natural.shape, scores_variant.shape
[26]:
((1130,), (1618, 3))

We can now compare the enrichment scores vs the PoET liklihood for this default prompt:

[27]:
_, axs = plt.subplots(1, 2, figsize=(12, 4))

x = score_natural
y = table_natural['norm r.e.'].values
sns.regplot(x=x, y=y, ax=axs[0])
natural_corr, natural_p = spearmanr(x, y)
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_xlabel("Score")
axs[0].set_ylabel("Enrichment")

x = score_variant
y = table_variant['norm r.e.'].values
sns.regplot(x=x, y=y, ax=axs[1])
variant_corr, variant_p = spearmanr(x, y)
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_xlabel("Score")
axs[1].set_ylabel("Enrichment")

plt.tight_layout()
plt.show()

../_images/walkthroughs_predicting-fitness_48_0.png

We can see above that both the natural sequences and variant sequences correlate strongly with the PoET liklihood scores when conditioned on the default prompt. Both relationships have p<0.05, indicating significance, furthermore we can see the correlation coefficient is stronger for the variant sequences as opposed to the natural sequences (0.5 vs 0.35).

We can also see that both these coefficients exceed the <0.2 correlation we saw with our simple EcCM identity baseline model.

Most strikingly, however, we note that both of these results improve upon the EDCA measure we saw above, from Russ et al 2020, where the correlation coefficients were ~0.04 and ~0.4 for natural and variant sequences respectively.

Combining these sequences we can see the trend still holds:

[28]:
x_combined = np.concatenate([score_natural, score_variant])
y_combined = np.concatenate([table_natural['norm r.e.'].values, table_variant['norm r.e.'].values])
correlation, p_value = spearmanr(x_combined, y_combined)

sns.regplot(x=x_combined, y=y_combined)
plt.title(f'All Sequences (Spearman: {correlation:.2f}, p-value: {p_value:.2f})')
plt.xlabel("Score")
plt.ylabel("Enrichment")
plt.tight_layout()
plt.show()
../_images/walkthroughs_predicting-fitness_51_0.png

Lastly let’s compare PoET scores to EcCM identity to compare:

[29]:
_, axs = plt.subplots(1, 2, figsize=(12, 4))

x = table_natural['EcCM_similarity'].values
y = score_natural
sns.regplot(x=x, y=y, ax=axs[0])
natural_corr, natural_p = spearmanr(x, y)
axs[0].set_title(f'Natural sequences (Coeff: {natural_corr:.2f}, p-value: {natural_p:.2f})')
axs[0].set_ylabel("Score")
axs[0].set_xlabel("EcCM similarity %")

x = table_variant['EcCM_similarity'].values
y = score_variant
sns.regplot(x=x, y=y, ax=axs[1])
variant_corr, variant_p = spearmanr(x, y)
axs[1].set_title(f'Variant sequences (Coeff: {variant_corr:.2f}, p-value: {variant_p:.2f})')
axs[1].set_ylabel("Score")
axs[1].set_xlabel("EcCM similarity %")

plt.tight_layout()
plt.show()

../_images/walkthroughs_predicting-fitness_53_0.png

We can see there is a relationship between PoET scores and EcCM similarity. However given that the PoET score v enrichment correlations are much stronger than the similarity vs enrichment correlation, we know the model is learning something more fundamental about protein fitness from the MSA-based prompts.

Improving the prompt#

We have seen how a default prompt PoET model can generate statistically significant correlations between the experimental relative enrichment and PoET liklihoods.

Let’s now try to improve the prompting by guiding the model with purely natural functional sequences of chorismate mutase. To accomplish this, we will skip the seed -> MSA generation step and supply a read-made MSA instead (containing functional natural sequences only, where functional is defined as a >0.45 enrichment).

Providing sequences that operate within the natural functional subspace allows us to steer the PoET model’s attention towards naturally occurring, functionally viable protein sequences. This focused approach enables the model to discern evolutionary patterns, relationships and constraints more effectively. Consequently, when scoring potential sequences, PoET is better equipped to make accurate and informed predictions that are grounded in naturally observed functionality and diversity.

[30]:
table_nat_functional = table_natural.loc[table_natural['norm r.e.'] >= 0.45]
print(table_nat_functional.shape)
table_nat_functional.head()
(407, 8)
[30]:
No. Species Sequence ID to EcCM top ID EDCA (rel. to EcCM) norm r.e. EcCM_similarity
0 1 Escherichia coli ... -TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPV... 1.00 0.93 0.00 1.00 96.0
3 4 Enterobacter cloacae subsp. dissolvens SDM ... -TPENPLLDLRVKISALDEKLLALLAERRALAIEVGKAKLDSHRPV... 0.93 0.93 8.57 0.65 89.0
5 6 Salmonella enterica subsp. enterica serovar Sa... -TSENPLLALRDKISALDEELLALLAKRRALAIEVGQAKLLSHRPV... 0.90 0.89 9.83 0.70 86.0
6 7 Cronobacter sakazakii ATCC BAA-894 ... -TAENPLLALRDKISALDEQLLALLAQRRALAVEVGKAKLATHRPV... 0.91 0.90 6.42 0.95 87.0
8 9 Erwinia billingiae Eb661 ... -TPENPLLGLRDKISALDEQLLNMLAQRRALAIEVATAKMATHRPI... 0.80 0.85 26.25 0.84 77.0
[31]:
from io import BytesIO

msa_const = session.align.upload_msa(BytesIO(b'\n'.join(table_nat_functional['Sequence'].apply(lambda s: s.encode()))))
print(msa_const)


job_id='b1b00662-4aa8-4b0d-88af-46b402d46d3d' job_type=<JobType.align_align: '/align/align'> status=<JobStatus.SUCCESS: 'SUCCESS'> created_date=datetime.datetime(2024, 10, 17, 9, 49, 19, 152529) start_date=None end_date=datetime.datetime(2024, 10, 17, 9, 49, 19, 152939) prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None
[32]:
prompt_const = msa_const.sample_prompt( num_ensemble_prompts=3, random_seed=42)
print(prompt_const)
job_id='67b7ce8e-f0db-414a-af86-e6d25b70133c' job_type=<JobType.align_prompt: '/align/prompt'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 9, 49, 20, 885660) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=None sequence_length=None
[33]:
sequences = table_variant['Sequence'].apply(lambda s: s.replace('-', '').encode()).values
print(sequences.shape)


future_variant_const = poet.score(prompt=prompt_const, sequences=sequences)

print(future_variant_const.job)
(1618,)
num_records=4854 job_id='4f8282dd-9ed6-407c-8859-bbb99ddd2a95' job_type=<JobType.poet_score: '/poet/score'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 9, 49, 26, 879280, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None

We can again wait for the job to complete.

[34]:
results_variant_const = future_variant_const.wait(verbose=True)
print(len(results_variant_const))
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:26<00:00,  3.76it/s, status=SUCCESS]
1618
[35]:
scores_variant_const = np.array([r[2] for r in results_variant_const])
score_variant_const = scores_variant_const.mean(axis=1)
scores_variant_const.shape, score_variant_const.shape
[35]:
((1618, 3), (1618,))
[36]:
_, ax = plt.subplots(figsize=(6, 4))

x = score_variant_const
y = table_variant['norm r.e.'].values


correlation, p_value = spearmanr(x, y)

sns.regplot(x=x, y=y)
plt.title(f'Variant Sequences - specialized prompt (Spearman: {correlation:.2f}, p-value: {p_value:.2f})')
plt.xlabel("Score")
plt.ylabel("Enrichment")
plt.tight_layout()
plt.show()
../_images/walkthroughs_predicting-fitness_63_0.png

We can see above that the specialised prompt has improved the spearman correlation from 0.5 (for default prompt parameters) to 0.55 for the variant dataset.

Sequence generation#

Given the good performance of this prompt in scoring sequences, we can re-use it to generate novel sequences de novo.

[37]:
future_gen = poet.generate(prompt_const, num_samples=1000, topp=0.9)
print(future_gen.job)
num_records=1000 job_id='894afa58-f6c4-4167-bd10-bdb5aff832e2' job_type=<JobType.poet_generate: '/poet/generate'> status=<JobStatus.PENDING: 'PENDING'> created_date=datetime.datetime(2024, 10, 17, 9, 50, 1, 236919, tzinfo=TzInfo(UTC)) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 sequence_length=None
[38]:
results_gen = future_gen.wait(verbose=True)
print(len(results_gen))
Waiting: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [09:21<00:00,  5.61s/it, status=SUCCESS]
1000

Let’s look at the top scoring sequences that were generated:

[39]:
sample_names = np.array([s[0] for s in results_gen])
sample_sequences = np.array([s[1] for s in results_gen])
sample_scores = np.array([s[2] for s in results_gen]).mean(axis=1)
order = np.argsort(-sample_scores)
for i in order[:10]:
    print(sample_names[i], sample_scores[i], sample_sequences[i], sep='\t')
generated-sequence-782  -14.697639333333333     EVYEELFAMRRSIDNFDAALVHILAERFQATKRVGILKAKHNLPAGDPGREEAQIARLRAMAKSSLDPDFAEKFLNFIISEVIRHHVRIANEQ
generated-sequence-673  -22.986757333333333     DQNNLLKIREKVSELDLKLLILLAERRQLASDIAQTKLQYNRPIRDKDRERELLDVLTKKGKSYGLDGFYISRLFQMIIEDSVLTQQALLQQH
generated-sequence-499  -27.346926      MGLEAIRQDINAVDRQLIALLEKRMALVDQVAAYKQVHDVPVLDQQREAEVLRSVSQQIRQAFEPAILALFKAIMAASRRYQAEQIAVG
generated-sequence-726  -28.807790999999998     DVRAELARLRASIDNIDAAVVHMLAERFKCTQQVGHLKAQHHLPPADPAREARQIARLRQLAEAKLDPAFAEKLLNFIIAEVIRHHEQIADQP
generated-sequence-516  -29.91244766666667      MSELEAVRKEIEEIDREILSLIDKRVNLAERVLESKRINGTSINDRKQNEVVINRALHAATLNLDIGSIKDIFEILIRMSIERQNELSGKG
generated-sequence-237  -34.573171      PEKLSIDELRHEIDRLDAEILAAVKRRTEVSQAIGKVRMASGGTRLVHSREMKVIERYSELGDGKDLAILLLRLGRGRLGH
generated-sequence-987  -36.74955566666667      DIRAELEDIRASIDNIDAALVHLLAERFKCTQRVGRLKATHDLPPSDPGREERQIARLRALAGADLDPAFAEKFLTFIIDEVIRHHEAIRS
generated-sequence-77   -37.99225366666666      DTLENLRAQINVVDGQLIDLLGKRMKVADQIGELKKEQNVAVLQSRRWNEILGNMILEGQKGLSEEFVLRMFKAIHQESINHQESIIKG
generated-sequence-702  -38.767573999999996     REEQQIARLRQLAQAHLDPDFAEKFLNFIIDEVIRHHERLRD
generated-sequence-908  -41.33622733333333      SELEAVRKEIEQIDREILALIDRRVNLAERVLESKRINGTSINDQKQNEVVINRALHTATLNLDLGSIKTIFETLIRMSIERQNELSGKG

We can also compare (below) the scores for the generated sequences with the scores of the variants from Russ 2020. Note that the generated sequences (orange) are generally right-shifted (i.e. higher scoring). Given the strong correlation (>0.5) we saw above between enrichment performance and PoET score, there is reason to believe these de novo generated sequences will perform better than the EDCA ranked sequences.

[40]:
plt.title("PoET Generated vs EDCA Variant sequences")
mi = np.min(score_variant_const)
ma = max(np.max(score_variant_const), np.max(sample_scores))
bins = np.linspace(mi, ma, 31)

plt.xlabel("PoET Score")
plt.ylabel("Frequency")
_ = plt.hist(score_variant_const, bins=bins, alpha=0.5, label='Variants')
_ = plt.hist(sample_scores, bins=bins, alpha=0.5, label="Generated")
plt.legend();
../_images/walkthroughs_predicting-fitness_71_0.png

Generated sequences#

Let’s also compare the top ranked sequences to the original EcCM:

[41]:
from Bio import pairwise2
import Bio.Align.substitution_matrices
blosum62 = Bio.Align.substitution_matrices.load('BLOSUM62')
[42]:
for i in order[:5]:
    print("---"*10)
    x = sample_sequences[i]
    print(f'>{sample_names[i]} {sample_scores[i]}')
    for a in pairwise2.align.globalds(seed, x, blosum62, -11, -1):
        print(pairwise2.format_alignment(*a))

        break
------------------------------
>generated-sequence-782 -14.697639333333333
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.....|.|.|..|...|..|...||||......||..|.....|..|..||.....||....|. .||..........||..............
EVYEELFAMRRSIDNFDAALVHILAERFQATKRVGILKAKHNLPAGDPGREEAQIARLRAMAKS-SLDPDFAEKFLNFIISEVIRHHVRIANEQ
  Score=74

------------------------------
>generated-sequence-673 -22.986757333333333
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
 ..|.||..|||.|.||.|||.||||||.||......||...||.||.||||.||..|...||...||..||.||||.||||||||||||||||
-DQNNLLKIREKVSELDLKLLILLAERRQLASDIAQTKLQYNRPIRDKDRERELLDVLTKKGKSYGLDGFYISRLFQMIIEDSVLTQQALLQQH
  Score=275

------------------------------
>generated-sequence-499 -27.346926
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
..   |.|.|..|.|.|..|.|||..|..|...|...|.....||.|..||...|.............|  |..||..|...|...|.......
MG---LEAIRQDINAVDRQLIALLEKRMALVDQVAAYKQVHDVPVLDQQREAEVLRSVSQQIRQAFEPA--ILALFKAIMAASRRYQAEQIAVG
  Score=54

------------------------------
>generated-sequence-726 -28.807790999999998
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.....|..||..|...|......||||......||..|...|.|..|..||.....||..|..| .||......|...||............|.
DVRAELARLRASIDNIDAAVVHMLAERFKCTQQVGHLKAQHHLPPADPAREARQIARLRQLAEA-KLDPAFAEKLLNFIIAEVIRHHEQIADQP
  Score=82

------------------------------
>generated-sequence-516 -29.91244766666667
TSENPLLALREKISALDEKLLALLAERRELAVEVGKAKLLSHRPVRDIDRERDLLERLITLGKAHHLDAHYITRLFQLIIEDSVLTQQALLQQH
.||  |.|.|..|...|...|.|...|..||..|...|........|.........|....... .||...|...|...|..|...|..|....
MSE--LEAVRKEIEEIDREILSLIDKRVNLAERVLESKRINGTSINDRKQNEVVINRALHAATL-NLDIGSIKDIFEILIRMSIERQNELSGKG
  Score=69

We can see there is significant sequence diversity in the generated sequences Vs EcCM. Given the histogram above, we might expect some of these newly generated sequences to outperform the original variants from Russ 2020.

Sequence UMAP#

We can gain more insight into the sequence space occupied by our generated sequences vs the variant and natural sequences in our original dataset by running a UMAP dimension reduction and visualization pipeline.

To visualize the generated sequences in the same space as the natural and variant sequences from the paper, we need to use an external multiple sequence alignment tool, we’ll ouput a fasta, run the alignment elsewhere, and and save the subsequent results to data/chorismate_mutase_combined_seqs.clustal to use in this notebook.

[43]:
seqs = np.stack(
    [np.frombuffer(s.encode().upper(), dtype=np.uint8) for s in table_natural['Sequence']] + [np.frombuffer(s.encode().upper(), dtype=np.uint8) for s in table_variant['Sequence']]
)
seqs.shape
[43]:
(2748, 96)
[44]:
with open('data/chorismate_mutase_combined_seqs.fasta', 'wb') as f:
    for i in range(len(table_natural)):
        name = f'>Natural_{i+1}'.encode()
        f.write(name)
        f.write(b'\n')
        f.write(table_natural.iloc[i]['Sequence'].encode().replace(b'-', b''))
        f.write(b'\n')
    for i in range(len(table_variant)):
        name = f'>Variant_{i+1}'.encode()
        f.write(name)
        f.write(b'\n')
        f.write(table_variant.iloc[i]['Sequence'].encode().replace(b'-', b''))
        f.write(b'\n')
    for i in range(len(sample_sequences)):
        name = f'>poet_generated_{i+1}'.encode()
        f.write(name)
        f.write(b'\n')
        f.write(str(sample_sequences[i]).encode())
        f.write(b'\n')
[45]:
import csv
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio.Align import MultipleSeqAlignment

def csv_to_alignment(csv_file):
    # Open the CSV and read the sequences
    alignment = MultipleSeqAlignment([])
    with open(csv_file, mode='r') as file:
        reader = csv.reader(file)
        for row in reader:
            # Create a SeqRecord for each row
            record = SeqRecord(Seq(row[1]), id=row[0])
            alignment.append(record)
    return alignment

alignment = csv_to_alignment("data/chorismate_mutase_combined_seqs.csv")
# count = SeqIO.write(records, "THIS_IS_YOUR_OUTPUT_FILE.clustal", "clustal")
# from Bio import AlignIO

# with open('data/chorismate_mutase_combined_seqs.csv') as f:
#     alignment = AlignIO.read(f, 'csv')
[46]:
aln = np.stack([np.frombuffer(str(s.seq).encode().upper(), dtype=np.uint8) for s in alignment])

groups = []
measure = []
for i in range(len(alignment)):
    tokens = alignment[i].id.split('_')
    g = '_'.join(tokens[:-1])
    j = int(tokens[-1]) - 1
    v = np.nan
    if g == 'Variant':
        v = table_variant.iloc[j]['norm r.e.']
    elif g == 'Natural':
        v = table_natural.iloc[j]['norm r.e.']

    groups.append(g)
    measure.append(v)
measure = np.array(measure)
groups[:10], measure[:10]
[46]:
(['Natural',
  'Natural',
  'Natural',
  'Natural',
  'Natural',
  'Natural',
  'Natural',
  'Natural',
  'Natural',
  'Natural'],
 array([ 1.  , -0.06,  0.04,  0.65, -0.08,  0.7 ,  0.95, -0.06,  0.84,
        -0.05]))
[47]:
umap = UMAP(metric='hamming')
z = umap.fit_transform(aln)
/Users/jmage/.micromamba/envs/openprotein-sdk/lib/python3.10/site-packages/umap/umap_.py:1880: UserWarning: gradient function is not yet implemented for hamming distance metric; inverse_transform will be unavailable
  warn(

We can see below a UMAP representation of the CM natural sequences (blue) and the subset of sequences that are functional (defined as an enrichment of >0.45). Whilst there is overall good overlap, we can see that some natural sequences that are non-functional clearly cluster together:

[48]:
fig = plt.figure(figsize=(8,8))
s = 20

loc = ~np.isnan(measure)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], data=None, alpha=0.5, s=s, label='Natural sequence space')

loc = ~np.isnan(measure) & (measure > 0.45)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], data=None, alpha=0.5, s=s, label='Functional subspace')

plt.legend(loc='best')
plt.title("Natural and Functional sequences")
plt.xticks([])
plt.yticks([]);
plt.xlabel("UMAP1")
plt.ylabel("UMAP2");
../_images/walkthroughs_predicting-fitness_83_0.png

We can now add in our generated PoET sequences (green):

[49]:
fig = plt.figure(figsize=(8,8))
s = 20

loc = ~np.isnan(measure)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], s=s, label='Natural sequence space')

loc = ~np.isnan(measure) & (measure > 0.45)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], s=s, label='Functional subspace')

loc = np.isnan(measure)
sns.scatterplot(x=z[loc, 0], y=z[loc, 1], s=s, label='PoET Generated')

plt.legend(loc='best')
plt.xticks([])
plt.yticks([])
plt.xlabel("UMAP1")
plt.ylabel("UMAP2");
plt.title("PoET, natural and functional sequences");
../_images/walkthroughs_predicting-fitness_85_0.png

We can see above that the PoET generated sequences do a good job of covering the functional subspace areas (represented by the overlay of green and orange dots). And rarely overlap with the natural, but non CM functional sequences. This suggests PoET is correctly inferring functional properties based on our CM prompt, and again that these sequences might be fruitful for further wetlab analysis!

Lastly, we can make a gif of these results to overlay them one-by-one:

[50]:
from matplotlib.animation import FuncAnimation

fig = plt.figure(figsize=(8,8))
s = 2.0

loc = ~np.isnan(measure)
plt.scatter(z[loc, 0], z[loc, 1], s=s, label='Natural sequence space')
plt.legend(loc='best')
plt.xticks([])
plt.yticks([])
plt.tight_layout()

def animate(step):
  if step == 1:
    loc = ~np.isnan(measure) & (measure > 0.45)
    plt.scatter(z[loc, 0], z[loc, 1], s=s, label='Functional subspace')
  if step == 2:
    loc = np.isnan(measure)
    plt.scatter(z[loc, 0], z[loc, 1], s=s, label='PoET Generated')
  plt.legend(loc='best')


animation = FuncAnimation(
  fig,
  animate,
  frames=[0, 1, 2],
  interval=5.0,
)

animation.save(
  'chorismate_umap.gif',
  dpi=300,
  fps=1.0,
)
plt.title("UMAP sequence space");
plt.xlabel("UMAP1")
plt.ylabel("UMAP2");
../_images/walkthroughs_predicting-fitness_88_0.png

Summary#

In this tutorial we have taken a dataset of Chorismate mutase proteins that have associated wetlab measurements (relative enrichment). We have shown that PoET can accurately score CM variants and natural sequences and that our off-the-shelf solution performs better than another method (Russ, 2020) based on EDCA. Specifically, we are able to achieve a correlation between PoET and enrichment of 0.35 & 0.5 vs 0.04 and 0.41 for natural and variant sequences respectively.

Furthermore, we develop novel CM sequences that may display superior enrichment vs the Russ (2020) proposed sequences.

Overall, we have used PoET to generate new sequences without costly wetlab experimentation and without prior measurements. We could take these sequences for downstream wetlab assays to kickstart a productive protein engineering cycle!