openprotein.umap#
Fit and transform your embeddings using UMAP to help visualize your embeddings.
Interface#
- class openprotein.umap.UMAPAPI(session)[source]#
UMAP API providing the interface to fit and run UMAP visualizations.
- Parameters:
session (APISession)
- fit_umap(model, feature_type=None, sequences=None, assay=None, n_components=2, n_neighbors=15, min_dist=0.1, reduction=None, **kwargs)[source]#
Fit an UMAP on the sequences with the specified model_id and hyperparameters (n_components).
- Parameters:
sequences (list of bytes or None, optional) – Optional sequences to fit UMAP with. Either use sequences or assay_id. sequences is preferred.
assay (AssayMetadata or AssayDataset or str or None, optional) – Optional assay containing sequences to fit SVD with. Or its assay_id. Either use sequences or assay. Ignored if sequences are provided.
model (EmbeddingModel or SVDModel or str) – Instance of either EmbeddingModel or SVDModel to use depending on feature type. Can also be a str specifying the model id, but then feature_type would have to be specified.
feature_type (FeatureType or None, optional) – Type of features to use for encoding sequences. “SVD” or “PLM”. None would require model to be EmbeddingModel or SVDModel.
n_components (int, optional) – Number of UMAP components to fit. Defaults to 2.
n_neighbors (int, optional) – Number of neighbors to use for fitting. Defaults to 15.
min_dist (float, optional) – Minimum distance in UMAP fitting. Defaults to 0.1.
reduction (str or None, optional) – Type of embedding reduction to use for computing features. E.g. “MEAN” or “SUM”. Useful when dealing with variable length sequence. Defaults to None.
kwargs – Additional keyword arguments to be passed to foundational models, e.g. prompt_id for PoET models.
- Returns:
The UMAP model being fit.
- Return type:
Results#
- class openprotein.umap.UMAPModel(session, job=None, metadata=None)[source]#
UMAP model that can be used to create projected embeddings.
The model is also implemented as a Future to allow waiting for a fit job. The projected embeddings of the sequences used to fit the UMAP can be accessed using embeddings.
- Parameters:
session (APISession)
job (UMAPFitJob)
metadata (UMAPMetadata | None)
- property id#
UMAP unique identifier.
- property n_components#
Number of components specified for the UMAP.
- property n_neighbors#
Number of neighbors specified for the UMAP.
- property min_dist#
Minimum distance specified for the UMAP.
- property sequence_length#
Sequence length constraint of the UMAP.
- property reduction#
Reduction used to fit the UMAP.
- property metadata#
Metadata of the UMAP.
- property sequences#
The sequences used to fit the UMAP.
- property embeddings#
The projected embeddings of the sequences used to fit the UMAP.
- property model: EmbeddingModel#
Base embeddings model used for the UMAP.
- get_inputs()[source]#
Get sequences used for umap job.
- Returns:
list of sequences
- Return type:
list[bytes]
- class openprotein.umap.UMAPEmbeddingsResultFuture(session, job, sequences=None, max_workers=10)[source]#
UMAP embeddings results represented as a future.
- Parameters:
session (APISession)
job (UMAPEmbeddingsJob)
sequences (list[bytes] | list[str] | None)
max_workers (int)
- wait(interval=5, timeout=None, verbose=False)[source]#
Wait for the UMAP embeddings job and retrieve the embeddings.
- Parameters:
interval (int)
timeout (int | None)
verbose (bool)
- Return type:
list[ndarray]
- get(verbose=False)[source]#
Get all the UMAP projected embeddings from the job.
- Return type:
list[ndarray]
- get_item(sequence)[source]#
Get UMAP embeddings for specified sequence.
- Parameters:
sequence (bytes) – Sequence to fetch UMAP embeddings for.
- Returns:
UMAP embeddings represented a numpy array.
- Return type:
np.ndarray
- cancelled()#
Check if the job has been cancelled.
- Returns:
True if the job is cancelled, False otherwise.
- Return type:
bool
- property created_date: datetime#
The creation timestamp of the job.
- done()#
Check if the job has completed.
- Returns:
True if the job is done, False otherwise.
- Return type:
bool
- property end_date: datetime | None#
The end timestamp of the job.
- property id#
The unique identifier of the job.
- property job_id: str#
The unique identifier of the job.
- property job_type: str#
The type of the job.
- property progress_counter: int#
The progress counter of the job.
- refresh()#
Refresh the job status and internal job object.
- property start_date: datetime | None#
The start timestamp of the job.
- property status: JobStatus#
The current status of the job.
- stream()#
Retrieve results for this job as a stream.
- Returns:
A generator that yields (key, value) tuples.
- Return type:
Generator
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for the job to complete.
- Parameters:
interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – Maximum time in seconds to wait. Defaults to None.
verbose (bool, optional) – Verbosity flag. Defaults to False.
- Returns:
True if the job completed successfully.
- Return type:
bool
Notes
This method does not fetch the job results, unlike wait().
Classes#
- class openprotein.umap.UMAPMetadata(*, id, status, created_date=None, model_id, feature_type, n_components=2, n_neighbors=15, min_dist=0.1, reduction=None, sequence_length=None)[source]#
- Parameters:
id (str)
status (JobStatus)
created_date (datetime | None)
model_id (str)
feature_type (FeatureType)
n_components (int)
n_neighbors (int)
min_dist (float)
reduction (str | None)
sequence_length (int | None)
- class openprotein.umap.UMAPFitJob(*, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
- Parameters:
job_id (str)
job_type (Literal[JobType.umap_fit])
status (JobStatus)
created_date (datetime)
start_date (datetime | None)
end_date (datetime | None)
prerequisite_job_id (str | None)
progress_message (str | None)
progress_counter (int | None)
sequence_length (int | None)
extra_data (Any)
- class openprotein.umap.UMAPEmbeddingsJob(*, num_records=None, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
- Parameters:
num_records (int | None)
job_id (str)
job_type (Literal[JobType.umap_embed])
status (JobStatus)
created_date (datetime)
start_date (datetime | None)
end_date (datetime | None)
prerequisite_job_id (str | None)
progress_message (str | None)
progress_counter (int | None)
sequence_length (int | None)
extra_data (Any)