Foundation Models#

The Foundation Models API provided by OpenProtein.AI allows you to generate state-of-the-art protein sequence embeddings from both proprietary and open source models.

You can list the available models with /embeddings/models and view a model summary (including output dimensions, citations and more) with /embeddings/model/{model_id}.

Currently, we support the following models:

  • PoET: An OpenProtein.AI conditional protein language model that enables embedding, scoring, and generating sequences conditioned on an input protein family of interest. Reference

  • Prot-seq: An OpenProtein.AI masked protein language model (~300M parameters) trained on UniRef50 with contact and secondary structure prediction as secondary objectives. This model utilizes random Fourier position embeddings and FlashAttention to enable fast inference. It has a max sequence length of 1024, with dimension 1024. It supports attn, embed, logits as output types.

  • Rotaprot-large-uniref50w: An OpenProtein.AI masked protein language model (~900M parameters) trained on UniRef100 with sequences weighted inversely proportional to the number of UniRef50 homologs. This model uses rotary relative position embeddings and FlashAttention to enable fast inference. It has a max sequence length of 1024, with dimension 1536. It supports attn, embed, logits as output types.

  • Rotaprot-large-uniref90-ft: A version of our proprietary rotaprot-large-uniref50w finetuned on UniRef100 with sequences weighted inversely proportional to the number of UniRef90 cluster members. It has a max sequence length of 1024, with dimension 1536. It supports attn, embed, logits as output types.

  • ESM1 Models: Community based ESM1 models, including: esm1b_t33_650M_UR50S, esm1v_t33_650M_UR90S_1, esm1v_t33_650M_UR90S_2, esm1v_t33_650M_UR90S_3, esm1v_t33_650M_UR90S_4, esm1v_t33_650M_UR90S_5. These are based on the ESM1 language model, with different versions having different model parameters and training data. GitHub link, ESM1b reference, ESM1v reference. Licensed under MIT.

  • ESM2 Models: Community based ESM2 models, including: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D. These models are based on the ESM2 language model, with different version having different model parameters and training data. GitHub link, Reference. Licensed under MIT.

  • ProtTrans Models: Transformer-based models from RostLab, including: prot_t5_xl_half_uniref50-enc. These models are based on the ProtTrans models, with different versions having different transformer-based architectures, model parameters and precisions, as well as different training datasets. GitHub link, Reference. Licensed under Academic Free License v3.0 License.

Endpoints#

OpenProtein Embeddings
 1.0.0 
OAS 3.0

Embeddings API

The Embeddings API provided by OpenProtein.ai allows you to generate state-of-the-art protein sequence embeddings from both proprietary and open source models.

You can list the available models with /embeddings/models and view a model summary (including output dimensions, citations and more) with /embeddings/models/{model_id}/metadata.

Currently, we support the following models:

  • PoET: An OpenProtein.AI conditional protein language model that enables embedding, scoring, and generating sequences conditioned on an input protein family of interest. Reference.
  • Prot-seq: An OpenProtein.AI masked protein language model (~300M parameters) trained on UniRef50 with contact and secondary structure prediction as secondary objectives. This model utilizes random Fourier position embeddings and FlashAttention to enable fast inference. It has a max sequence length of 1024, with dimension 1024. It supports attn, embed, logits as output types.
  • Rotaprot-large-uniref50w: An OpenProtein.AI masked protein language model (~900M parameters) trained on UniRef100 with sequences weighted inversely proportional to the number of UniRef50 homologs. This model uses rotary relative position embeddings and FlashAttention to enable fast inference. It has a max sequence length of 1024, with dimension 1536. It supports attn, embed, logits as output types.
  • Rotaprot-large-uniref90-ft: A version of our proprietary rotaprot-large-uniref50w finetuned on UniRef100 with sequences weighted inversely proportional to the number of UniRef90 cluster members. It has a max sequence length of 1024, with dimension 1536. It supports attn, embed, logits as output types.
  • ESM1 Models: Community based ESM1 models, including: esm1b_t33_650M_UR50S, esm1v_t33_650M_UR90S_1, esm1v_t33_650M_UR90S_2, esm1v_t33_650M_UR90S_3, esm1v_t33_650M_UR90S_4, esm1v_t33_650M_UR90S_5. These are based on the ESM1 language model, with different versions having different model parameters and training data. GitHub link, ESM1b reference, ESM1v reference. Licensed under MIT.
  • ESM2 Models: Community based ESM2 models, including: esm2_t6_8M_UR50D, esm2_t12_35M_UR50D, esm2_t30_150M_UR50D, esm2_t33_650M_UR50D. These models are based on the ESM2 language model, with different version having different model parameters and training data. GitHub link, Reference. Licensed under MIT.
  • ProtTrans Models: Transformer-based models from RostLab, including: prot_t5_xl_half_uniref50-enc. These models are based on the ProtTrans models, with different versions having different transformer-based architectures, model parameters and precisions, as well as different training datasets. GitHub link, Reference. Licensed under Academic Free License v3.0 License.

embeddings

Run computations with our embeddings models

svd

Fit SVDs to use for reduced embeddings

openprotein

Proprietary protein language models developed in-house.

poet

OpenProtein-developed conditional protein language model that enables embedding, scoring, and generating sequences conditioned on an input protein family of interest.

Maximum Sequence Length: 4096

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V

poet-2

OpenProtein-developed conditional and multi-modal protein language model that enables embedding, scoring, and generating sequences conditioned on an input protein family of interest.

Maximum Sequence Length: 4096

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V

prot-seq

Masked protein language model (~300M parameters) trained on UniRef50 with contact and secondary structure prediction as secondary objectives. Uses random Fourier position embeddings and FlashAttention to enable fast inference.

Maximum Sequence Length: 4096

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V

rotaprot-large-uniref50w

Masked protein language model (~900M parameters) trained on UniRef100 with sequences weighted inversely proportional to the number of UniRef50 homologs. Uses rotary relative position embeddings and FlashAttention to enable fast inference.

Maximum Sequence Length: 4096

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V

rotaprot-large-uniref90-ft

rotaprot-large-uniref50w finetuned on UniRef100 with sequences weighted inversely proportional to the number of UniRef90 cluster members.

Maximum Sequence Length: 4096

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V

esm1

Community based ESM1 models, with different versions having different model parameters and training data.

esm1b_t33_650M_UR50S

ESM1b model with 650M parameters

Maximum Sequence Length: 1022

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm1v_t33_650M_UR90S_1

ESM1v model with 650M parameters

Maximum Sequence Length: 1022

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm1v_t33_650M_UR90S_2

ESM1v model with 650M parameters

Maximum Sequence Length: 1022

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm1v_t33_650M_UR90S_3

ESM1v model with 650M parameters

Maximum Sequence Length: 1022

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm1v_t33_650M_UR90S_4

ESM1v model with 650M parameters

Maximum Sequence Length: 1022

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm1v_t33_650M_UR90S_5

ESM1v model with 650M parameters

Maximum Sequence Length: 1022

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm2

Community based ESM2 models, with different versions having different model parameters and training data.

esm2_t6_8M_UR50D

ESM2 model with 8M parameters

Maximum Sequence Length: 4094

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm2_t12_35M_UR50D

ESM2 model with 35M parameters

Maximum Sequence Length: 4094

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm2_t30_150M_UR50D

ESM2 model with 150M parameters

Maximum Sequence Length: 4094

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm2_t33_650M_UR50D

ESM2 model with 650M parameters

Maximum Sequence Length: 4094

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

esm2_t36_3B_UR50D

ESM2 model with 3B parameters

Maximum Sequence Length: 4094

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: <cls>,<pad>,<eos>,<unk>,L,A,G,V,S,E,R,T,I,D,P,K,Q,N,F,Y,M,H,W,C,<null_0>,B,U,Z,O,.,-,<null_1>,X

prottrans

Community based ProtTrans models.

prott5-xl

prott5-xl

Maximum Sequence Length: 4096

Input Tokens: A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,X,O,U,B,Z,-

Output Tokens: