openprotein.svd#
Fit SVD models on top of our protein language models to produce reduced embeddings, which can be used to train predictors!
Interface#
- class openprotein.svd.SVDAPI(session)[source]#
SVD API providing the interface for creating and using SVD models.
- Parameters:
session (APISession)
- fit_svd(model_id, sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)[source]#
Fit an SVD on the sequences with the specified model_id and hyperparameters (n_components).
- Parameters:
model_id (str) – ID of embeddings model to use.
sequences (list of bytes or None, optional) – Optional sequences to fit SVD with. Either use sequences or assay_id. sequences is preferred.
assay (AssayMetadata or AssayDataset or str or None, optional) – Optional assay containing sequences to fit SVD with. Or its assay_id. Either use sequences or assay. Ignored if sequences are provided.
n_components (int, optional) – The number of components for the SVD. Defaults to 1024.
reduction (str or None, optional) – Type of embedding reduction to use for computing features. E.g. “MEAN” or “SUM”. Useful when dealing with variable length sequence. Defaults to None.
kwargs – Additional keyword arguments to be passed to foundational models, e.g. prompt_id for PoET models.
- Returns:
The SVD model being fit.
- Return type:
Results#
- class openprotein.svd.SVDModel(session, job=None, metadata=None)[source]#
SVD model that can be used to create reduced embeddings.
The model is also implemented as a Future to allow waiting for a fit job.
- Parameters:
session (APISession)
job (SVDFitJob)
metadata (SVDMetadata | None)
- property id#
The unique identifier of the job.
- property n_components#
Number of components of the SVD.
- property sequence_length#
Sequence length constraint of the SVD.
- property reduction#
Reduction of embeddings used to fit the SVD.
- property metadata#
Metadata of the SVD.
- property model: EmbeddingModel#
Base embeddings model used for the SVD.
- get_inputs()[source]#
Get sequences used for svd job.
- Returns:
List of sequences
- Return type:
list[bytes]
- embed(sequences, **kwargs)[source]#
Use this SVD model to get reduced embeddings from input sequences.
- Parameters:
sequences (List[bytes]) – List of protein sequences.
- Returns:
Future result containing the reduced embeddings.
- Return type:
- fit_umap(sequences=None, assay=None, n_components=2, **kwargs)[source]#
Fit an UMAP on the embedding results of this model.
This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args.
- Parameters:
sequences (List[bytes]) – sequences to UMAP
n_components (int) – number of components in UMAP. Will determine output shapes
reduction (ReductionType | None) – embeddings reduction to use (e.g. mean)
assay (AssayDataset | None)
- Returns:
UMAP model fitted on the reduced embeddings from provided sequences or assay.
- Return type:
- fit_gp(assay, properties, name=None, description=None, **kwargs)[source]#
Fit a GP on assay using this embedding model and hyperparameters.
- Parameters:
assay (AssayMetadata or AssayDataset or str) – Assay to fit GP on. Or its assay_id.
properties (list of str) – Properties in the assay to fit the gp on.
name (str | None)
description (str | None)
- Returns:
Property predictor model trained using the reduced embeddings with provided assay and properties.
- Return type:
- class openprotein.svd.SVDEmbeddingsResultFuture(session, job, sequences=None, max_workers=10)[source]#
SVD embeddings results represented as a future.
- Parameters:
session (APISession)
job (SVDEmbeddingsJob)
sequences (list[bytes] | list[str] | None)
max_workers (int)
- wait(interval=5, timeout=None, verbose=False)[source]#
Wait for the SVD embeddings job and retrieve the embeddings.
- Parameters:
interval (int)
timeout (int | None)
verbose (bool)
- Return type:
list[ndarray]
- get(verbose=False)[source]#
Get all the SVD reduced embeddings from the job.
- Return type:
list[ndarray]
- get_item(sequence)[source]#
Get SVD embeddings for specified sequence.
- Parameters:
sequence (bytes) – Sequence to fetch SVD embeddings for.
- Returns:
SVD embeddings represented a numpy array.
- Return type:
np.ndarray
- cancelled()#
Check if the job has been cancelled.
- Returns:
True if the job is cancelled, False otherwise.
- Return type:
bool
- property created_date: datetime#
The creation timestamp of the job.
- done()#
Check if the job has completed.
- Returns:
True if the job is done, False otherwise.
- Return type:
bool
- property end_date: datetime | None#
The end timestamp of the job.
- property id#
The unique identifier of the job.
- property job_id: str#
The unique identifier of the job.
- property job_type: str#
The type of the job.
- property progress_counter: int#
The progress counter of the job.
- refresh()#
Refresh the job status and internal job object.
- property start_date: datetime | None#
The start timestamp of the job.
- property status: JobStatus#
The current status of the job.
- stream()#
Retrieve results for this job as a stream.
- Returns:
A generator that yields (key, value) tuples.
- Return type:
Generator
- wait_until_done(interval=5, timeout=None, verbose=False)#
Wait for the job to complete.
- Parameters:
interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL.
timeout (int, optional) – Maximum time in seconds to wait. Defaults to None.
verbose (bool, optional) – Verbosity flag. Defaults to False.
- Returns:
True if the job completed successfully.
- Return type:
bool
Notes
This method does not fetch the job results, unlike wait().
Classes#
- class openprotein.svd.SVDMetadata(*, id, status, created_date=None, model_id, n_components, reduction=None, sequence_length=None)[source]#
- Parameters:
id (str)
status (JobStatus)
created_date (datetime | None)
model_id (str)
n_components (int)
reduction (str | None)
sequence_length (int | None)
- class openprotein.svd.SVDFitJob(*, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
- Parameters:
job_id (str)
job_type (Literal[JobType.svd_fit])
status (JobStatus)
created_date (datetime)
start_date (datetime | None)
end_date (datetime | None)
prerequisite_job_id (str | None)
progress_message (str | None)
progress_counter (int | None)
sequence_length (int | None)
extra_data (Any)
- class openprotein.svd.SVDEmbeddingsJob(*, num_records=None, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
- Parameters:
num_records (int | None)
job_id (str)
job_type (Literal[JobType.svd_embed])
status (JobStatus)
created_date (datetime)
start_date (datetime | None)
end_date (datetime | None)
prerequisite_job_id (str | None)
progress_message (str | None)
progress_counter (int | None)
sequence_length (int | None)
extra_data (Any)