openprotein.embeddings#

Create embeddings for your protein sequences using open-source and proprietary models!

Note that for PoET Models, you will also need to utilize our align. workflow.

Interface#

class openprotein.embeddings.EmbeddingsAPI(session)[source]#

Embeddings API providing the interface for creating embeddings using protein language models.

You can access all our models either via get_model() or directly through the session’s embedding attribute using the model’s ID and the desired method. For example, to use the attention method on the protein sequence model, you would use session.embedding.prot_seq.attn().

Examples

Accessing a model’s method:

# To call the attention method on the protein sequence model:
import openprotein
session = openprotein.connect(username="user", password="password")
session.embedding.prot_seq.attn()

Using the get_model method:

# Get a model instance by name:
import openprotein
session = openprotein.connect(username="user", password="password")
# list available models:
print(session.embedding.list_models() )
# init model by name
model = session.embedding.get_model('prot-seq')
Parameters:

session (APISession)

poet2: PoET2Model#

PoET-2 model

poet: PoETModel#

PoET model

prot_seq: OpenProteinModel#

Prot-seq model

rotaprot_large_uniref50w: OpenProteinModel#

Rotaprot model trained on UniRef50

rotaprot_large_uniref90_ft: OpenProteinModel#

Rotaprot model trained on UniRef90

esm1b: ESMModel#

ESM1b model

esm1v: ESMModel#

ESM1v model

esm2: ESMModel#

ESM2 model

list_models()[source]#

list models available for creating embeddings of your sequences

Return type:

list[EmbeddingModel]

get_model(name)[source]#

Get model by model_id.

ProtembedModel allows all the usual job manipulation: e.g. making POST and GET requests for this model specifically.

Parameters:
  • model_id (str) – the model identifier

  • name (str)

Returns:

The model

Return type:

ProtembedModel

Raises:

HTTPError – If the GET request does not succeed.

Models#

class openprotein.embeddings.PoET2Model(session, model_id, metadata=None)[source]#

Class for OpenProtein’s foundation model PoET 2.

PoET functions are dependent on a prompt supplied via the prompt endpoints.

Examples

View specific model details (including supported tokens) with the ? operator.

Examples

>>> import openprotein
>>> session = openprotein.connect(username="user", password="password")
>>> session.embedding.poet2?
Parameters:
  • session (APISession)

  • model_id (list[str] | str)

  • metadata (ModelMetadata | None)

embed(sequences, reduction=ReductionType.MEAN, prompt=None, query=None, use_query_structure_in_decoder=True, decoder_type=None)[source]#

Embed sequences using this model.

Parameters:
  • sequences (list of bytes) – Sequences to embed.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean). Default is ReductionType.MEAN.

  • prompt (str or Prompt or None, optional) – Prompt or prompt_id or prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • decoder_type ({'mlm', 'clm'} or None, optional) – Decoder type. Default is None.

Returns:

A future object that returns the embeddings of the submitted sequences.

Return type:

EmbeddingsResultFuture

logits(sequences, prompt=None, query=None, use_query_structure_in_decoder=True, decoder_type=None)[source]#

Compute logit embeddings for sequences using this model.

Parameters:
  • sequences (list of bytes) – Sequences to analyze.

  • prompt (str or Prompt or None, optional) – Prompt or prompt_id or prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • decoder_type ({'mlm', 'clm'} or None, optional) – Decoder type. Default is None.

Returns:

A future object that returns the logits of the submitted sequences.

Return type:

EmbeddingsResultFuture

score(sequences, prompt=None, query=None, use_query_structure_in_decoder=True, decoder_type=None)[source]#

Score query sequences using the specified prompt.

Parameters:
  • sequences (list of bytes) – Sequences to score.

  • prompt (str or Prompt or None, optional) – Prompt or prompt_id or prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • decoder_type ({'mlm', 'clm'} or None, optional) – Decoder type. Default is None.

Returns:

A future object that returns the scores of the submitted sequences.

Return type:

EmbeddingsScoreFuture

indel(sequence, prompt=None, query=None, use_query_structure_in_decoder=True, decoder_type=None, insert=None, delete=None, **kwargs)[source]#

Score all indels of the query sequence using the specified prompt.

Parameters:
  • sequence (bytes) – Sequence to analyze.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • decoder_type ({'mlm', 'clm'} or None, optional) – Decoder type. Default is None.

  • insert (str or None, optional) – Insertion fragment at each site.

  • delete (list of int or None, optional) – Range of size of fragment to delete at each site.

  • **kwargs – Additional keyword arguments.

Returns:

A future object that returns the scores of the indel-ed sequence.

Return type:

EmbeddingsScoreFuture

Raises:

ValueError – If neither insert nor delete is provided.

single_site(sequence, prompt=None, query=None, use_query_structure_in_decoder=True, decoder_type=None)[source]#

Score all single substitutions of the query sequence using the specified prompt.

Parameters:
  • sequence (bytes) – Sequence to analyze.

  • prompt (str or Prompt or None, optional) – Prompt or prompt_id or prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • decoder_type ({'mlm', 'clm'} or None, optional) – Decoder type. Default is None.

Returns:

A future object that returns the scores of the mutated sequence.

Return type:

EmbeddingsScoreFuture

generate(prompt, query=None, use_query_structure_in_decoder=True, num_samples=100, temperature=1.0, topk=None, topp=None, max_length=1000, seed=None, ensemble_weights=None, ensemble_method=None)[source]#

Generate protein sequences conditioned on a prompt.

Parameters:
  • prompt (str or Prompt) – Prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • num_samples (int, optional) – The number of samples to generate. Default is 100.

  • temperature (float, optional) – The temperature for sampling. Higher values produce more random outputs. Default is 1.0.

  • topk (float or None, optional) – The number of top-k residues to consider during sampling. Default is None.

  • topp (float or None, optional) – The cumulative probability threshold for top-p sampling. Default is None.

  • max_length (int, optional) – The maximum length of generated proteins. Default is 1000.

  • seed (int or None, optional) – Seed for random number generation. Default is None.

  • ensemble_weights (Sequence of float or None, optional) – Weights for combining likelihoods from multiple prompts in the ensemble. The length of this sequence must match the number of prompts. All weights must be finite. If ensemble_method is “arithmetic”, then weights must also be non-negative, and have a non-zero sum.

  • ensemble_method ({'arithmetic', 'geometric'} or None, optional) – Method used to combine likelihoods from multiple prompts in the ensemble. If “arithmetic”, the weighted mean is used; if “geometric”, the weighted geometric mean is used. If None (default), the method defaults to “arithmetic”, but this behavior may change in the future.

Returns:

A future object representing the status and information about the generation job.

Return type:

EmbeddingsGenerateFuture

fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, prompt=None, query=None, use_query_structure_in_decoder=True)[source]#

Fit an SVD on the embedding results of PoET.

This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the arguments.

Parameters:
  • sequences (list of bytes or list of str or None, optional) – Sequences to fit SVD. If None, assay must be provided.

  • assay (AssayDataset or None, optional) – Assay containing sequences to fit SVD. Ignored if sequences are provided.

  • n_components (int, optional) – Number of components in SVD. Determines output shapes. Default is 1024.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean).

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

Returns:

A future that represents the fitted SVD model.

Return type:

SVDModel

fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, prompt=None, query=None, use_query_structure_in_decoder=True)[source]#

Fit a UMAP on assay using PoET and hyperparameters.

This function will create a UMAP based on the embeddings from this PoET model as well as the hyperparameters specified in the arguments.

Parameters:
  • sequences (list of bytes or list of str or None, optional) – Sequences to fit UMAP. If None, assay must be provided.

  • assay (AssayDataset or None, optional) – Assay containing sequences to fit UMAP. Ignored if sequences are provided.

  • n_components (int, optional) – Number of components in UMAP fit. Determines output shapes. Default is 2.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean). Default is ReductionType.MEAN.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

Returns:

A future that represents the fitted UMAP model.

Return type:

UMAPModel

fit_gp(assay, properties, prompt=None, query=None, use_query_structure_in_decoder=True, **kwargs)[source]#

Fit a Gaussian Process (GP) on assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata or AssayDataset or str) – Assay to fit GP on.

  • properties (list of str) – Properties in the assay to fit the GP on.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition PoET model.

  • query (str or bytes or Protein or Query or None, optional) – Query to use with prompt.

  • use_query_structure_in_decoder (bool, optional) – Whether to use query structure in decoder. Default is True.

  • **kwargs – Additional keyword arguments.

Returns:

A future that represents the trained predictor model.

Return type:

PredictorModel

classmethod create(session, model_id, default=None, **kwargs)#

Create and return an instance of the appropriate EmbeddingModel subclass based on the model_id.

Parameters:
  • session (APISession) – The API session to use.

  • model_id (str) – The model identifier.

  • default (type variable of EmbeddingModel or None, optional) – Default EmbeddingModel subclass to use if no match is found.

  • kwargs – Additional keyword arguments to pass to the model constructor.

Returns:

An instance of the appropriate EmbeddingModel subclass.

Return type:

EmbeddingModel

Raises:

ValueError – If no suitable EmbeddingModel subclass is found and no default is provided.

get_metadata()#

Get model metadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

classmethod get_model()#

Get the model_id(s) for this EmbeddingModel subclass.

Returns:

List of model_id strings associated with this class.

Return type:

list of str

property metadata#

ModelMetadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

class openprotein.embeddings.PoETModel(session, model_id, metadata=None)[source]#

Class for OpenProtein’s foundation model PoET.

Note

PoET functions are dependent on a prompt supplied via the prompt endpoints.

Examples

View specific model details (including supported tokens) with the ? operator.

>>> import openprotein
>>> session = openprotein.connect(username="user", password="password")
>>> session.embedding.poet.<embeddings_method>
Parameters:
  • session (APISession)

  • model_id (list[str] | str)

  • metadata (ModelMetadata | None)

embed(sequences, prompt=None, reduction=ReductionType.MEAN, **kwargs)[source]#

Embed sequences using the PoET model.

Parameters:
  • sequences (list of bytes) – Sequences to embed.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g., mean). Default is ReductionType.MEAN.

  • **kwargs – Additional keyword arguments.

Returns:

Future object that returns the embeddings of the submitted sequences.

Return type:

EmbeddingsResultFuture

logits(sequences, prompt=None, **kwargs)[source]#

Compute logits for sequences using the PoET model.

Parameters:
  • sequences (list of bytes) – Sequences to analyze.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • **kwargs – Additional keyword arguments.

Returns:

Future object that returns the logits of the submitted sequences.

Return type:

EmbeddingsResultFuture

score(sequences, prompt=None, **kwargs)[source]#

Score query sequences using the specified prompt.

Parameters:
  • sequences (list of bytes) – Sequences to score.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • **kwargs – Additional keyword arguments.

Returns:

Future object that returns the scores of the submitted sequences.

Return type:

EmbeddingsScoreFuture

indel(sequence, prompt=None, insert=None, delete=None, **kwargs)[source]#

Score all indels of the query sequence using the specified prompt.

Parameters:
  • sequence (bytes) – Sequence to analyze.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • insert (str or None, optional) – Insertion fragment at each site.

  • delete (list of int or None, optional) – Range of size of fragment to delete at each site.

  • **kwargs – Additional keyword arguments.

Returns:

Future object that returns the scores of the indel-ed sequence.

Return type:

EmbeddingsScoreFuture

Raises:

ValueError – If neither insert nor delete is provided.

single_site(sequence, prompt=None, **kwargs)[source]#

Score all single substitutions of the query sequence using the specified prompt.

Parameters:
  • sequence (bytes) – Sequence to analyze.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • **kwargs – Additional keyword arguments.

Returns:

Future object that returns the scores of the mutated sequence.

Return type:

EmbeddingsScoreFuture

generate(prompt, num_samples=100, temperature=1.0, topk=None, topp=None, max_length=1000, seed=None, **kwargs)[source]#

Generate protein sequences conditioned on a prompt.

Parameters:
  • prompt (str or Prompt) – Prompt from an align workflow to condition the PoET model.

  • num_samples (int, optional) – Number of samples to generate. Default is 100.

  • temperature (float, optional) – Temperature for sampling. Higher values produce more random outputs. Default is 1.0.

  • topk (float or None, optional) – Number of top-k residues to consider during sampling. Default is None.

  • topp (float or None, optional) – Cumulative probability threshold for top-p sampling. Default is None.

  • max_length (int, optional) – Maximum length of generated proteins. Default is 1000.

  • seed (int or None, optional) – Seed for random number generation. Default is None.

  • **kwargs – Additional keyword arguments.

Returns:

Future object representing the status and information about the generation job.

Return type:

EmbeddingsGenerateFuture

fit_svd(prompt=None, sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)[source]#

Fit an SVD on the embedding results of PoET.

This function creates an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the arguments.

Parameters:
  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • sequences (list of bytes or list of str or None, optional) – Sequences to use for SVD.

  • assay (AssayDataset or None, optional) – Assay dataset to use for SVD.

  • n_components (int, optional) – Number of components in SVD. Determines output shapes. Default is 1024.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g., mean).

  • **kwargs – Additional keyword arguments.

Returns:

Future that represents the fitted SVD model.

Return type:

SVDModel

fit_umap(prompt=None, sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)[source]#

Fit a UMAP on assay using PoET and hyperparameters.

This function creates a UMAP based on the embeddings from this PoET model as well as the hyperparameters specified in the arguments.

Parameters:
  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • sequences (list of bytes or list of str or None, optional) – Optional sequences to fit UMAP with. Either use sequences or assay. Sequences is preferred.

  • assay (AssayDataset or None, optional) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.

  • n_components (int, optional) – Number of components in UMAP fit. Determines output shapes. Default is 2.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g., mean). Default is ReductionType.MEAN.

  • **kwargs – Additional keyword arguments.

Returns:

Future that represents the fitted UMAP model.

Return type:

UMAPModel

fit_gp(assay, properties, prompt=None, **kwargs)[source]#

Fit a Gaussian Process (GP) on assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata or AssayDataset or str) – Assay to fit GP on.

  • properties (list of str) – Properties in the assay to fit the GP on.

  • prompt (str or Prompt or None, optional) – Prompt from an align workflow to condition the PoET model.

  • **kwargs – Additional keyword arguments.

Returns:

Future that represents the trained predictor model.

Return type:

PredictorModel

classmethod create(session, model_id, default=None, **kwargs)#

Create and return an instance of the appropriate EmbeddingModel subclass based on the model_id.

Parameters:
  • session (APISession) – The API session to use.

  • model_id (str) – The model identifier.

  • default (type variable of EmbeddingModel or None, optional) – Default EmbeddingModel subclass to use if no match is found.

  • kwargs – Additional keyword arguments to pass to the model constructor.

Returns:

An instance of the appropriate EmbeddingModel subclass.

Return type:

EmbeddingModel

Raises:

ValueError – If no suitable EmbeddingModel subclass is found and no default is provided.

get_metadata()#

Get model metadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

classmethod get_model()#

Get the model_id(s) for this EmbeddingModel subclass.

Returns:

List of model_id strings associated with this class.

Return type:

list of str

property metadata#

ModelMetadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

class openprotein.embeddings.OpenProteinModel(session, model_id, metadata=None)[source]#

Proprietary protein embedding models served by OpenProtein.

Examples

View specific model details (inc supported tokens) with the ? operator.

>>> import openprotein
>>> session = openprotein.connect(username="user", password="password")
>>> session.embedding.prot_seq?
Parameters:
  • session (APISession)

  • model_id (list[str] | str)

  • metadata (ModelMetadata | None)

attn(sequences, **kwargs)#

Compute attention embeddings for sequences using this model.

Parameters:
  • sequences (list of bytes or list of str) – Sequences to compute attention embeddings for.

  • kwargs – Additional keyword arguments to be used from foundational models.

Returns:

Future object representing the attention result.

Return type:

EmbeddingsResultFuture

classmethod create(session, model_id, default=None, **kwargs)#

Create and return an instance of the appropriate EmbeddingModel subclass based on the model_id.

Parameters:
  • session (APISession) – The API session to use.

  • model_id (str) – The model identifier.

  • default (type variable of EmbeddingModel or None, optional) – Default EmbeddingModel subclass to use if no match is found.

  • kwargs – Additional keyword arguments to pass to the model constructor.

Returns:

An instance of the appropriate EmbeddingModel subclass.

Return type:

EmbeddingModel

Raises:

ValueError – If no suitable EmbeddingModel subclass is found and no default is provided.

embed(sequences, reduction=ReductionType.MEAN, **kwargs)#

Embed sequences using this model.

Parameters:
  • sequences (list of bytes or list of str) – Sequences to embed.

  • reduction (ReductionType or None, optional) – Reduction to use (e.g. mean). Defaults to mean embedding.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

Future object representing the embedding result.

Return type:

EmbeddingsResultFuture

fit_gp(assay, properties, reduction, name=None, description=None, **kwargs)#

Fit a Gaussian Process (GP) on an assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata, AssayDataset, or str) – Assay to fit GP on.

  • properties (list of str) – Properties in the assay to fit the GP on.

  • reduction (ReductionType) – Type of embedding reduction to use for computing features. PLM must use reduction.

  • name (str or None, optional) – Optional name for the predictor model.

  • description (str or None, optional) – Optional description for the predictor model.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

The fitted predictor model.

Return type:

PredictorModel

Raises:

InvalidParameterError – If no properties are provided, properties are not a subset of assay measurements, or multitask GP is requested.

fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)#

Fit an SVD on the embedding results of this model.

This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the arguments.

Parameters:
  • sequences (list of bytes or list of str or None, optional) – Sequences to fit SVD on.

  • assay (AssayDataset or None, optional) – Assay containing sequences to fit SVD on.

  • n_components (int, optional) – Number of components in SVD. Determines output shapes. Default is 1024.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean).

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

The fitted SVD model.

Return type:

SVDModel

Raises:

InvalidParameterError – If neither or both of assay and sequences are provided.

fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)#

Fit a UMAP on the embedding results of this model.

This function will create a UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the arguments.

Parameters:
  • sequences (list of bytes or list of str or None, optional) – Optional sequences to fit UMAP with. Either use sequences or assay. Sequences is preferred.

  • assay (AssayDataset or None, optional) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.

  • n_components (int, optional) – Number of components in UMAP fit. Determines output shapes. Default is 2.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

The fitted UMAP model.

Return type:

UMAPModel

Raises:

InvalidParameterError – If neither or both of assay and sequences are provided.

get_metadata()#

Get model metadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

classmethod get_model()#

Get the model_id(s) for this EmbeddingModel subclass.

Returns:

List of model_id strings associated with this class.

Return type:

list of str

logits(sequences, **kwargs)#

Compute logit embeddings for sequences using this model.

Parameters:
  • sequences (list of bytes or list of str) – Sequences to compute logits for.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

Future object representing the logits result.

Return type:

EmbeddingsResultFuture

property metadata#

ModelMetadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

class openprotein.embeddings.ESMModel(session, model_id, metadata=None)[source]#

Class providing inference endpoints for Facebook’s ESM protein language models.

Examples

View specific model details (inc supported tokens) with the ? operator.

>>> import openprotein
>>> session = openprotein.connect(username="user", password="password")
>>> session.embedding.esm2_t12_35M_UR50D?
Parameters:
  • session (APISession)

  • model_id (list[str] | str)

  • metadata (ModelMetadata | None)

attn(sequences, **kwargs)#

Compute attention embeddings for sequences using this model.

Parameters:
  • sequences (list of bytes or list of str) – Sequences to compute attention embeddings for.

  • kwargs – Additional keyword arguments to be used from foundational models.

Returns:

Future object representing the attention result.

Return type:

EmbeddingsResultFuture

classmethod create(session, model_id, default=None, **kwargs)#

Create and return an instance of the appropriate EmbeddingModel subclass based on the model_id.

Parameters:
  • session (APISession) – The API session to use.

  • model_id (str) – The model identifier.

  • default (type variable of EmbeddingModel or None, optional) – Default EmbeddingModel subclass to use if no match is found.

  • kwargs – Additional keyword arguments to pass to the model constructor.

Returns:

An instance of the appropriate EmbeddingModel subclass.

Return type:

EmbeddingModel

Raises:

ValueError – If no suitable EmbeddingModel subclass is found and no default is provided.

embed(sequences, reduction=ReductionType.MEAN, **kwargs)#

Embed sequences using this model.

Parameters:
  • sequences (list of bytes or list of str) – Sequences to embed.

  • reduction (ReductionType or None, optional) – Reduction to use (e.g. mean). Defaults to mean embedding.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

Future object representing the embedding result.

Return type:

EmbeddingsResultFuture

fit_gp(assay, properties, reduction, name=None, description=None, **kwargs)#

Fit a Gaussian Process (GP) on an assay using this embedding model and hyperparameters.

Parameters:
  • assay (AssayMetadata, AssayDataset, or str) – Assay to fit GP on.

  • properties (list of str) – Properties in the assay to fit the GP on.

  • reduction (ReductionType) – Type of embedding reduction to use for computing features. PLM must use reduction.

  • name (str or None, optional) – Optional name for the predictor model.

  • description (str or None, optional) – Optional description for the predictor model.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

The fitted predictor model.

Return type:

PredictorModel

Raises:

InvalidParameterError – If no properties are provided, properties are not a subset of assay measurements, or multitask GP is requested.

fit_svd(sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)#

Fit an SVD on the embedding results of this model.

This function will create an SVDModel based on the embeddings from this model as well as the hyperparameters specified in the arguments.

Parameters:
  • sequences (list of bytes or list of str or None, optional) – Sequences to fit SVD on.

  • assay (AssayDataset or None, optional) – Assay containing sequences to fit SVD on.

  • n_components (int, optional) – Number of components in SVD. Determines output shapes. Default is 1024.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean).

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

The fitted SVD model.

Return type:

SVDModel

Raises:

InvalidParameterError – If neither or both of assay and sequences are provided.

fit_umap(sequences=None, assay=None, n_components=2, reduction=ReductionType.MEAN, **kwargs)#

Fit a UMAP on the embedding results of this model.

This function will create a UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the arguments.

Parameters:
  • sequences (list of bytes or list of str or None, optional) – Optional sequences to fit UMAP with. Either use sequences or assay. Sequences is preferred.

  • assay (AssayDataset or None, optional) – Optional assay containing sequences to fit UMAP with. Either use sequences or assay. Ignored if sequences are provided.

  • n_components (int, optional) – Number of components in UMAP fit. Determines output shapes. Default is 2.

  • reduction (ReductionType or None, optional) – Embeddings reduction to use (e.g. mean). Defaults to MEAN.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

The fitted UMAP model.

Return type:

UMAPModel

Raises:

InvalidParameterError – If neither or both of assay and sequences are provided.

get_metadata()#

Get model metadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

classmethod get_model()#

Get the model_id(s) for this EmbeddingModel subclass.

Returns:

List of model_id strings associated with this class.

Return type:

list of str

logits(sequences, **kwargs)#

Compute logit embeddings for sequences using this model.

Parameters:
  • sequences (list of bytes or list of str) – Sequences to compute logits for.

  • kwargs – Additional keyword arguments to be used from foundational models, e.g. prompt_id for PoET models.

Returns:

Future object representing the logits result.

Return type:

EmbeddingsResultFuture

property metadata#

ModelMetadata for this model.

Returns:

The metadata associated with this model.

Return type:

ModelMetadata

Transform models#

These models are overlaid on top of the base embeddings models to produce reduced/transformed embeddings. Refer to their detailed documentation in openprotein.svd and openprotein.umap.

class openprotein.svd.SVDModel(session, job=None, metadata=None)[source]#

SVD model that can be used to create reduced embeddings.

The model is also implemented as a Future to allow waiting for a fit job.

Parameters:
class openprotein.umap.UMAPModel(session, job=None, metadata=None)[source]#

UMAP model that can be used to create projected embeddings.

The model is also implemented as a Future to allow waiting for a fit job. The projected embeddings of the sequences used to fit the UMAP can be accessed using embeddings.

Parameters:

Results#

class openprotein.embeddings.EmbeddingsResultFuture(session, job, sequences=None, max_workers=10)[source]#

Future for manipulating results for embeddings-related requests.

Parameters:
  • session (APISession)

  • job (EmbeddingsJob | AttnJob | LogitsJob)

  • sequences (list[bytes] | list[str] | None)

  • max_workers (int)

stream()[source]#

Retrieve results for this job as a stream.

Returns:

A generator that yields (key, value) tuples.

Return type:

Generator

get(verbose=False)[source]#

Return all results from the job by consuming the stream.

Parameters:
  • verbose (bool, optional) – If True, display a progress bar. Defaults to False.

  • **kwargs – Keyword arguments passed to the stream method.

Returns:

A list containing all results from the job.

Return type:

list

property id#

The unique identifier of the job.

get_item(sequence)[source]#

Get embedding results for specified sequence.

Parameters:

sequence (bytes) – sequence to fetch results for

Returns:

embeddings

Return type:

np.ndarray

cancelled()#

Check if the job has been cancelled.

Returns:

True if the job is cancelled, False otherwise.

Return type:

bool

property created_date: datetime#

The creation timestamp of the job.

done()#

Check if the job has completed.

Returns:

True if the job is done, False otherwise.

Return type:

bool

property end_date: datetime | None#

The end timestamp of the job.

property job_id: str#

The unique identifier of the job.

property job_type: str#

The type of the job.

property progress_counter: int#

The progress counter of the job.

refresh()#

Refresh the job status and internal job object.

property start_date: datetime | None#

The start timestamp of the job.

property status: JobStatus#

The current status of the job.

wait(interval=5, timeout=None, verbose=False)#

Wait for the job to complete, then fetch results.

Parameters:
  • interval (int, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int | None, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

The results of the job.

Return type:

Any

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for the job to complete.

Parameters:
  • interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

True if the job completed successfully.

Return type:

bool

Notes

This method does not fetch the job results, unlike wait().

class openprotein.embeddings.EmbeddingsScoreFuture(session, job, sequences=None)[source]#

Future for manipulating results for embeddings score-related requests.

Parameters:
  • session (APISession)

  • job (ScoreJob | ScoreIndelJob | ScoreSingleSiteJob)

  • sequences (list[bytes] | list[str] | None)

stream()[source]#

Return the results from this job as a generator.

Parameters:

**kwargs – Keyword arguments passed to the streaming implementation.

Returns:

A generator that yields job results.

Return type:

Generator

Raises:

NotImplementedError – This is an abstract method and must be implemented by a subclass.

cancelled()#

Check if the job has been cancelled.

Returns:

True if the job is cancelled, False otherwise.

Return type:

bool

property created_date: datetime#

The creation timestamp of the job.

done()#

Check if the job has completed.

Returns:

True if the job is done, False otherwise.

Return type:

bool

property end_date: datetime | None#

The end timestamp of the job.

get(verbose=False, **kwargs)#

Return all results from the job by consuming the stream.

Parameters:
  • verbose (bool, optional) – If True, display a progress bar. Defaults to False.

  • **kwargs – Keyword arguments passed to the stream method.

Returns:

A list containing all results from the job.

Return type:

list

property id: str#

The unique identifier of the job.

property job_id: str#

The unique identifier of the job.

property job_type: str#

The type of the job.

property progress_counter: int#

The progress counter of the job.

refresh()#

Refresh the job status and internal job object.

property start_date: datetime | None#

The start timestamp of the job.

property status: JobStatus#

The current status of the job.

wait(interval=5, timeout=None, verbose=False)#

Wait for the job to complete, then fetch results.

Parameters:
  • interval (int, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int | None, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

The results of the job.

Return type:

Any

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for the job to complete.

Parameters:
  • interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

True if the job completed successfully.

Return type:

bool

Notes

This method does not fetch the job results, unlike wait().

class openprotein.embeddings.EmbeddingsGenerateFuture(session, job, sequences=None)[source]#

Future for manipulating results for embeddings generate-related requests.

Parameters:
  • session (APISession)

  • job (GenerateJob)

  • sequences (list[bytes] | list[str] | None)

cancelled()#

Check if the job has been cancelled.

Returns:

True if the job is cancelled, False otherwise.

Return type:

bool

property created_date: datetime#

The creation timestamp of the job.

done()#

Check if the job has completed.

Returns:

True if the job is done, False otherwise.

Return type:

bool

property end_date: datetime | None#

The end timestamp of the job.

get(verbose=False, **kwargs)#

Return all results from the job by consuming the stream.

Parameters:
  • verbose (bool, optional) – If True, display a progress bar. Defaults to False.

  • **kwargs – Keyword arguments passed to the stream method.

Returns:

A list containing all results from the job.

Return type:

list

property id: str#

The unique identifier of the job.

property job_id: str#

The unique identifier of the job.

property job_type: str#

The type of the job.

property progress_counter: int#

The progress counter of the job.

refresh()#

Refresh the job status and internal job object.

property start_date: datetime | None#

The start timestamp of the job.

property status: JobStatus#

The current status of the job.

stream()#

Return the results from this job as a generator.

Parameters:

**kwargs – Keyword arguments passed to the streaming implementation.

Returns:

A generator that yields job results.

Return type:

Generator

Raises:

NotImplementedError – This is an abstract method and must be implemented by a subclass.

wait(interval=5, timeout=None, verbose=False)#

Wait for the job to complete, then fetch results.

Parameters:
  • interval (int, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int | None, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

The results of the job.

Return type:

Any

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for the job to complete.

Parameters:
  • interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

True if the job completed successfully.

Return type:

bool

Notes

This method does not fetch the job results, unlike wait().

Base model#

The base embedding model is the base class of all the embedding models.

class openprotein.embeddings.EmbeddingModel(session, model_id, metadata=None)[source]#

Base embeddings model used to understand and provide embeddings from sequences.

Parameters:
  • session (APISession)

  • model_id (list[str] | str)

  • metadata (ModelMetadata | None)