openprotein.umap#

Fit and transform your embeddings using UMAP to help visualize your embeddings.

Interface#

class openprotein.umap.UMAPAPI(session)[source]#

UMAP API providing the interface to fit and run UMAP visualizations.

Parameters:

session (APISession)

fit_umap(model, feature_type=None, sequences=None, assay=None, n_components=2, n_neighbors=15, min_dist=0.1, reduction=None, **kwargs)[source]#

Fit an UMAP on the sequences with the specified model_id and hyperparameters (n_components).

Parameters:
  • sequences (list of bytes or None, optional) – Optional sequences to fit UMAP with. Either use sequences or assay_id. sequences is preferred.

  • assay (AssayMetadata or AssayDataset or str or None, optional) – Optional assay containing sequences to fit SVD with. Or its assay_id. Either use sequences or assay. Ignored if sequences are provided.

  • model (EmbeddingModel or SVDModel or str) – Instance of either EmbeddingModel or SVDModel to use depending on feature type. Can also be a str specifying the model id, but then feature_type would have to be specified.

  • feature_type (FeatureType or None, optional) – Type of features to use for encoding sequences. “SVD” or “PLM”. None would require model to be EmbeddingModel or SVDModel.

  • n_components (int, optional) – Number of UMAP components to fit. Defaults to 2.

  • n_neighbors (int, optional) – Number of neighbors to use for fitting. Defaults to 15.

  • min_dist (float, optional) – Minimum distance in UMAP fitting. Defaults to 0.1.

  • reduction (str or None, optional) – Type of embedding reduction to use for computing features. E.g. “MEAN” or “SUM”. Useful when dealing with variable length sequence. Defaults to None.

  • kwargs – Additional keyword arguments to be passed to foundational models, e.g. prompt_id for PoET models.

Returns:

The UMAP model being fit.

Return type:

UMAPModel

get_umap(umap_id)[source]#

Get UMAP job results. Including UMAP dimension and sequence lengths.

Requires a successful UMAP job from fit_umap.

Parameters:

umap_id (str) – The ID of the UMAP job.

Returns:

The model with the UMAP fit.

Return type:

UMAPModel

list_umap()[source]#

List UMAP models made by user.

Takes no args.

Returns:

UMAPModels

Return type:

list[UMAPModel]

Results#

class openprotein.umap.UMAPModel(session, job=None, metadata=None)[source]#

UMAP model that can be used to create projected embeddings.

The model is also implemented as a Future to allow waiting for a fit job. The projected embeddings of the sequences used to fit the UMAP can be accessed using embeddings.

Parameters:
property id#

UMAP unique identifier.

property n_components#

Number of components specified for the UMAP.

property n_neighbors#

Number of neighbors specified for the UMAP.

property min_dist#

Minimum distance specified for the UMAP.

property sequence_length#

Sequence length constraint of the UMAP.

property reduction#

Reduction used to fit the UMAP.

property metadata#

Metadata of the UMAP.

property sequences#

The sequences used to fit the UMAP.

property embeddings#

The projected embeddings of the sequences used to fit the UMAP.

property model: EmbeddingModel#

Base embeddings model used for the UMAP.

delete()[source]#

Delete this UMAP model.

Return type:

bool

get(verbose=False)[source]#

Retrieve this UMAP model itself.

Parameters:

verbose (bool)

get_inputs()[source]#

Get sequences used for umap job.

Returns:

list of sequences

Return type:

list[bytes]

embed(sequences, **kwargs)[source]#

Use this UMAP model to get projected embeddings from input sequences.

Parameters:

sequences (List[bytes]) – List of protein sequences.

Returns:

Future result containing the projected embeddings.

Return type:

UMAPEmbeddingsResultFuture

class openprotein.umap.UMAPEmbeddingsResultFuture(session, job, sequences=None, max_workers=10)[source]#

UMAP embeddings results represented as a future.

Parameters:
wait(interval=5, timeout=None, verbose=False)[source]#

Wait for the UMAP embeddings job and retrieve the embeddings.

Parameters:
  • interval (int)

  • timeout (int | None)

  • verbose (bool)

Return type:

list[ndarray]

get(verbose=False)[source]#

Get all the UMAP projected embeddings from the job.

Return type:

list[ndarray]

get_item(sequence)[source]#

Get UMAP embeddings for specified sequence.

Parameters:

sequence (bytes) – Sequence to fetch UMAP embeddings for.

Returns:

UMAP embeddings represented a numpy array.

Return type:

np.ndarray

cancelled()#

Check if the job has been cancelled.

Returns:

True if the job is cancelled, False otherwise.

Return type:

bool

property created_date: datetime#

The creation timestamp of the job.

done()#

Check if the job has completed.

Returns:

True if the job is done, False otherwise.

Return type:

bool

property end_date: datetime | None#

The end timestamp of the job.

property id#

The unique identifier of the job.

property job_id: str#

The unique identifier of the job.

property job_type: str#

The type of the job.

property progress_counter: int#

The progress counter of the job.

refresh()#

Refresh the job status and internal job object.

property start_date: datetime | None#

The start timestamp of the job.

property status: JobStatus#

The current status of the job.

stream()#

Retrieve results for this job as a stream.

Returns:

A generator that yields (key, value) tuples.

Return type:

Generator

wait_until_done(interval=5, timeout=None, verbose=False)#

Wait for the job to complete.

Parameters:
  • interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.

  • timeout (int, optional) – Maximum time in seconds to wait. Defaults to None.

  • verbose (bool, optional) – Verbosity flag. Defaults to False.

Returns:

True if the job completed successfully.

Return type:

bool

Notes

This method does not fetch the job results, unlike wait().

Classes#

class openprotein.umap.UMAPMetadata(*, id, status, created_date=None, model_id, feature_type, n_components=2, n_neighbors=15, min_dist=0.1, reduction=None, sequence_length=None)[source]#
Parameters:
  • id (str)

  • status (JobStatus)

  • created_date (datetime | None)

  • model_id (str)

  • feature_type (FeatureType)

  • n_components (int)

  • n_neighbors (int)

  • min_dist (float)

  • reduction (str | None)

  • sequence_length (int | None)

class openprotein.umap.UMAPFitJob(*, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
Parameters:
  • job_id (str)

  • job_type (Literal[JobType.umap_fit])

  • status (JobStatus)

  • created_date (datetime)

  • start_date (datetime | None)

  • end_date (datetime | None)

  • prerequisite_job_id (str | None)

  • progress_message (str | None)

  • progress_counter (int | None)

  • sequence_length (int | None)

  • extra_data (Any)

class openprotein.umap.UMAPEmbeddingsJob(*, num_records=None, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
Parameters:
  • num_records (int | None)

  • job_id (str)

  • job_type (Literal[JobType.umap_embed])

  • status (JobStatus)

  • created_date (datetime)

  • start_date (datetime | None)

  • end_date (datetime | None)

  • prerequisite_job_id (str | None)

  • progress_message (str | None)

  • progress_counter (int | None)

  • sequence_length (int | None)

  • extra_data (Any)