Foundation Models#
The Foundation Models API provided by OpenProtein.AI allows you to generate state-of-the-art protein sequence embeddings from both proprietary and open source models.
You can list the available models with /embeddings/models
and view a model summary (including output dimensions, citations and more) with /embeddings/model/{model_id}
.
Currently, we support the following models:
PoET: An OpenProtein.AI conditional protein language model that enables embedding, scoring, and generating sequences conditioned on an input protein family of interest. Reference
Prot-seq: An OpenProtein.AI masked protein language model (~300M parameters) trained on UniRef50 with contact and secondary structure prediction as secondary objectives. This model utilizes random Fourier position embeddings and FlashAttention to enable fast inference. It has a max sequence length of 1024, with dimension 1024. It supports attn, embed, logits as output types.
Rotaprot-large-uniref50w: An OpenProtein.AI masked protein language model (~900M parameters) trained on UniRef100 with sequences weighted inversely proportional to the number of UniRef50 homologs. This model uses rotary relative position embeddings and FlashAttention to enable fast inference. It has a max sequence length of 1024, with dimension 1536. It supports attn, embed, logits as output types.
Rotaprot-large-uniref90-ft: A version of our proprietary rotaprot-large-uniref50w finetuned on UniRef100 with sequences weighted inversely proportional to the number of UniRef90 cluster members. It has a max sequence length of 1024, with dimension 1536. It supports attn, embed, logits as output types.
ESM1 Models: Community based ESM1 models, including: esm1b_t33_650M_UR50S, esm1v_t33_650M_UR90S_1, esm1v_t33_650M_UR90S_2, esm1v_t33_650M_UR90S_3, esm1v_t33_650M_UR90S_4, esm1v_t33_650M_UR90S_5. These are based on the ESM1 language model, with different versions having different model parameters and training data. GitHub link, ESM1b reference, ESM1v reference. Licensed under MIT.
ESM2 Models: Community based ESM2 models, including: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D. These models are based on the ESM2 language model, with different version having different model parameters and training data. GitHub link, Reference. Licensed under MIT.
ProtTrans Models: Transformer-based models from RostLab, including: prot_t5_xl_half_uniref50-enc. These models are based on the ProtTrans models, with different versions having different transformer-based architectures, model parameters and precisions, as well as different training datasets. GitHub link, Reference. Licensed under Academic Free License v3.0 License.