openprotein.svd#
Fit SVD models on top of our protein language models to produce reduced embeddings, which can be used to train predictors!
Interface#
- class openprotein.svd.SVDAPI(session)[source]#
- SVD API providing the interface for creating and using SVD models. - Parameters:
- session (APISession) 
 - fit_svd(model_id, sequences=None, assay=None, n_components=1024, reduction=None, **kwargs)[source]#
- Fit an SVD on the sequences with the specified model_id and hyperparameters (n_components). - Parameters:
- model_id (str or EmbeddingModel) – ID of embeddings model to use. 
- sequences (list of bytes or None, optional) – Optional sequences to fit SVD with. Either use sequences or assay_id. sequences is preferred. 
- assay (AssayMetadata or AssayDataset or str or None, optional) – Optional assay containing sequences to fit SVD with. Or its assay_id. Either use sequences or assay. Ignored if sequences are provided. 
- n_components (int, optional) – The number of components for the SVD. Defaults to 1024. 
- reduction (str or ReductionType or None, optional) – Type of embedding reduction to use for computing features. E.g. “MEAN” or “SUM”. Useful when dealing with variable length sequence. Defaults to None. 
- kwargs – Additional keyword arguments to be passed to foundational models, e.g. prompt_id for PoET models. 
 
- Returns:
- The SVD model being fit. 
- Return type:
 
 
Results#
- class openprotein.svd.SVDModel(session, job=None, metadata=None)[source]#
- SVD model that can be used to create reduced embeddings. - The model is also implemented as a Future to allow waiting for a fit job. - Parameters:
- session (APISession) 
- job (SVDFitJob) 
- metadata (SVDMetadata | None) 
 
 - property id#
- The unique identifier of the job. 
 - property n_components#
- Number of components of the SVD. 
 - property sequence_length#
- Sequence length constraint of the SVD. 
 - property reduction#
- Reduction of embeddings used to fit the SVD. 
 - property metadata#
- Metadata of the SVD. 
 - property model: EmbeddingModel#
- Base embeddings model used for the SVD. 
 - get_inputs()[source]#
- Get sequences used for svd job. - Returns:
- List of sequences 
- Return type:
- list[bytes] 
 
 - embed(sequences, **kwargs)[source]#
- Use this SVD model to get reduced embeddings from input sequences. - Parameters:
- sequences (List[bytes]) – List of protein sequences. 
- Returns:
- Future result containing the reduced embeddings. 
- Return type:
 
 - fit_umap(sequences=None, assay=None, n_components=2, **kwargs)[source]#
- Fit an UMAP on the embedding results of this model. - This function will create an UMAPModel based on the embeddings from this model as well as the hyperparameters specified in the args. - Parameters:
- sequences (List[bytes]) – sequences to UMAP 
- n_components (int) – number of components in UMAP. Will determine output shapes 
- reduction (ReductionType | None) – embeddings reduction to use (e.g. mean) 
- assay (AssayDataset | None) 
 
- Returns:
- UMAP model fitted on the reduced embeddings from provided sequences or assay. 
- Return type:
 
 - fit_gp(assay, properties, name=None, description=None, **kwargs)[source]#
- Fit a GP on assay using this embedding model and hyperparameters. - Parameters:
- assay (AssayMetadata or AssayDataset or str) – Assay to fit GP on. Or its assay_id. 
- properties (list of str) – Properties in the assay to fit the gp on. 
- name (str | None) 
- description (str | None) 
 
- Returns:
- Property predictor model trained using the reduced embeddings with provided assay and properties. 
- Return type:
 
 
- class openprotein.svd.SVDEmbeddingsResultFuture(session, job, sequences=None, max_workers=10)[source]#
- SVD embeddings results represented as a future. - Parameters:
- session (APISession) 
- job (SVDEmbeddingsJob) 
- sequences (list[bytes] | list[str] | None) 
- max_workers (int) 
 
 - wait(interval=5, timeout=None, verbose=False)[source]#
- Wait for the SVD embeddings job and retrieve the embeddings. - Parameters:
- interval (int) 
- timeout (int | None) 
- verbose (bool) 
 
- Return type:
- list[ndarray] 
 
 - get(verbose=False)[source]#
- Get all the SVD reduced embeddings from the job. - Return type:
- list[ndarray] 
 
 - get_item(sequence)[source]#
- Get SVD embeddings for specified sequence. - Parameters:
- sequence (bytes) – Sequence to fetch SVD embeddings for. 
- Returns:
- SVD embeddings represented a numpy array. 
- Return type:
- np.ndarray 
 
 - cancelled()#
- Check if the job has been cancelled. - Returns:
- True if the job is cancelled, False otherwise. 
- Return type:
- bool 
 
 - property created_date: datetime#
- The creation timestamp of the job. 
 - done()#
- Check if the job has completed. - Returns:
- True if the job is done, False otherwise. 
- Return type:
- bool 
 
 - property end_date: datetime | None#
- The end timestamp of the job. 
 - property id#
- The unique identifier of the job. 
 - property job_id: str#
- The unique identifier of the job. 
 - property job_type: str#
- The type of the job. 
 - property progress_counter: int#
- The progress counter of the job. 
 - refresh()#
- Refresh the job status and internal job object. 
 - property start_date: datetime | None#
- The start timestamp of the job. 
 - property status: JobStatus#
- The current status of the job. 
 - stream()#
- Retrieve results for this job as a stream. - Returns:
- A generator that yields (key, value) tuples. 
- Return type:
- Generator 
 
 - wait_until_done(interval=5, timeout=None, verbose=False)#
- Wait for the job to complete. - Parameters:
- interval (float, optional) – Time in seconds between polling. Defaults to config.POLLING_INTERVAL. 
- timeout (int, optional) – Maximum time in seconds to wait. Defaults to None. 
- verbose (bool, optional) – Verbosity flag. Defaults to False. 
 
- Returns:
- True if the job completed successfully. 
- Return type:
- bool 
 - Notes - This method does not fetch the job results, unlike wait(). 
 
Classes#
- class openprotein.svd.SVDMetadata(*, id, status, created_date=None, model_id, n_components, reduction=None, sequence_length=None)[source]#
- Parameters:
- id (str) 
- status (JobStatus) 
- created_date (datetime | None) 
- model_id (str) 
- n_components (int) 
- reduction (str | None) 
- sequence_length (int | None) 
 
 
- class openprotein.svd.SVDFitJob(*, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
- Parameters:
- job_id (str) 
- job_type (Literal[JobType.svd_fit]) 
- status (JobStatus) 
- created_date (datetime) 
- start_date (datetime | None) 
- end_date (datetime | None) 
- prerequisite_job_id (str | None) 
- progress_message (str | None) 
- progress_counter (int | None) 
- sequence_length (int | None) 
- extra_data (Any) 
 
 
- class openprotein.svd.SVDEmbeddingsJob(*, num_records=None, job_id, job_type, status, created_date, start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=None, sequence_length=None, **extra_data)[source]#
- Parameters:
- num_records (int | None) 
- job_id (str) 
- job_type (Literal[JobType.svd_embed]) 
- status (JobStatus) 
- created_date (datetime) 
- start_date (datetime | None) 
- end_date (datetime | None) 
- prerequisite_job_id (str | None) 
- progress_message (str | None) 
- progress_counter (int | None) 
- sequence_length (int | None) 
- extra_data (Any)