SVD and embeddings#

This tutorial shows you how to fit an SVD embedding.

What you need before getting started#

You need a protein sequence of interest.

Using SVD to get high-fidelity custom-sized embeddings#

Truncated SVD can be used to find reduced sized protein embeddings that retain the most information by finding vectors that best explain the localized sequence space.

This is useful for retrieving large numbers of embeddings for mutagenesis analysis. The SVD can be fit on any set of sequences using the model.fit_svd() function. This function returns a new SVD model object that can be used to embed new sequences.

The default reduction = none can be used with sequences of equal length. Use reduction = mean to return the mean embedding over the sequence length.

Existing SVD models can be listed and retrieved with session.embedding.list_svd() and can be deleted with the svd.delete() function.

Fitting an SVD#

The example uses green fluorescent protein and a variant:

[ ]:
variants = [
    "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPD" +
    "HMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYI" +
    "MADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAG" +
    "ITHGMDELYK",
    "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPD" +
    "HMKQHDFFKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYI" +
    "MADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAG"
    "ITHGMDELYK"
    ]

Access the SVD model through the session’s embedding attribute using the model’s ID and the desired method:

[ ]:
model = session.embedding.get_model("prot-seq")
svd = model.fit_svd(variants, n_components=256)
print(svd.metadata.json(indent=4))
svd.wait_until_done(verbose=True)
{
    "id": "c320a9f2-661a-4a1c-a518-f235042168e9",
    "status": "PENDING",
    "created_date": "2024-06-13T03:13:23.794642+00:00",
    "model_id": "prot-seq",
    "n_components": 2,
    "reduction": null,
    "sequence_length": 238
}
Waiting: 100%|██████████| 100/100 [06:58<00:00,  4.18s/it, status=SUCCESS]
True

Embed the variants. Please wait for the embedding model to finish fitting before calling embed, otherwise you will get an error.

[ ]:
svd_embed_future = svd.embed(variants)
print(svd_embed_future.job)
svd_embed_future.wait_until_done(verbose=True)

svd_embed_results = svd_embed_future.get()
len(svd_embed_results)
status=<JobStatus.PENDING: 'PENDING'> job_id='3f76bc7f-077d-4ab5-8499-57927bd5e85d' job_type=<JobType.svd_embed: '/svd/embed'> created_date=datetime.datetime(2024, 6, 13, 3, 20, 23, 3133, tzinfo=datetime.timezone.utc) start_date=None end_date=None prerequisite_job_id=None progress_message=None progress_counter=0 num_records=2 sequence_length=None
Waiting: 100%|██████████| 100/100 [01:59<00:00,  1.20s/it, status=SUCCESS]
2
[ ]:
svd_embed_results
[('MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK',
  array([-1527.4696  ,   -38.118176], dtype=float32)),
 ('MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFYKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK',
  array([-1533.2668 ,    37.97419], dtype=float32))]

Next steps#

For more information, visit the Embeddings API reference.