Using AbLang2#
This tutorial shows you how to use the community AbLang2 model (AbLang2Model) to embed antibody sequences and recover residue-level likelihoods. AbLang2 is an antibody-specific protein language model trained on paired heavy and light variable domains, so it produces representations that are tuned to antibody biology rather than generic protein space.
AbLang2 is most useful when you are:
working with paired antibody variable regions (VH + VL) and want a representation that respects chain context,
restoring or scoring missing or uncertain residues (e.g. residues that came in as ambiguous from sequencing),
ranking candidate mutations for antibody design where germline bias in generic PLMs would otherwise dominate.
Upstream reference: oxpig/AbLang2. Olsen, Moal, and Deane, Addressing the antibody germline bias and its effect on language models for improved antibody design, bioRxiv 2024 (10.1101/2024.02.02.578678).
Input format: <VH>:<VL>#
AbLang2 always expects a paired-chain input. Heavy and light chains are concatenated with a single colon (:) delimiter:
<VH-sequence>:<VL-sequence>
The colon is a real token in the AbLang2 vocabulary — it is not a residue and it is not stripped by the backend. Two consequences fall out of this:
Single-chain inputs still need the colon. If you only have a heavy chain, send
<VH>:(trailing colon, empty light side). If you only have a light chain, send:<VL>(leading colon, empty heavy side). Submitting a bare sequence without a colon will be rejected as malformed antibody input.Order matters. The heavy chain always goes on the left of the colon, the light chain on the right. Swapping them will silently produce embeddings that are biologically meaningless because the chain-position priors AbLang2 learned during training no longer line up.
Masking residues with X#
Across the OpenProtein platform, X in an input sequence means “mask this position” — the model treats it as a position to predict rather than as an unknown amino acid. For AbLang2 this is what you want when computing residue likelihoods: X is rewritten to the ablang mask token (*) before tokenization, so logits at that position are masked-LM predictions over the full vocabulary, matching what AbLang2’s pseudo_log_likelihood mode does locally.
masked = f"X{vh[1:]}:{vl}" # mask the first heavy-chain residue
This convention is consistent across embedding models on the platform — you don’t need to look up each model’s mask character.
What you need before getting started#
You need one or more antibody sequences expressed in the <VH>:<VL> format described above. The example below uses trastuzumab (Herceptin) variable regions:
[1]:
import openprotein
# Login to your session
session = openprotein.connect()
# Trastuzumab VH and VL.
vh = "EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCSRWGGDGFYAMDYWGQGTLVTVSS"
vl = "DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSRSGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGQGTKVEIK"
paired = f"{vh}:{vl}"
heavy_only = f"{vh}:" # trailing colon — heavy chain alone
light_only = f":{vl}" # leading colon — light chain alone
Getting the model#
Create the AbLang2Model object from your session:
[2]:
ablang2 = session.embedding.ablang2
help(ablang2)
Help on AbLang2Model in module openprotein.embeddings.ablang:
ablang2
AbLang2 foundational model for antibodies.
max_sequence_length = 4096
supported outputs = ['embed', 'logits']
supported tokens = ['M', 'R', 'H', 'K', 'D', 'E', 'S', 'T', 'N', 'Q', 'C', 'G', 'P', 'A', 'V', 'I', 'F', 'Y', 'W', 'L', '-', 'X', ':']
The model object exposes its metadata: maximum sequence length, the supported output types (embed(), logits(), etc.), and the full vocabulary — note the : token in supported tokens, which is what makes paired-chain input work.
Embedding a paired antibody#
Submit your <VH>:<VL> sequence to embed(). As with other embedding models you can pass a list of sequences to batch the request, and an optional reduction (defaulting to MEAN over residues):
[3]:
future = ablang2.embed(sequences=[paired.encode()])
future
[3]:
EmbeddingsJob(num_records=1, job_id='f95c3911-7329-4db4-b34b-e0c150f3e424', job_type=<JobType.embeddings_embed: '/embeddings/embed'>, status=<JobStatus.PENDING: 'PENDING'>, created_date=datetime.datetime(2026, 5, 7, 17, 28, 59, 495307, tzinfo=TzInfo(0)), start_date=None, end_date=None, prerequisite_job_id=None, progress_message=None, progress_counter=0, sequence_length=None, failure_message=None)
Wait for the job to complete:
[4]:
future.wait_until_done(verbose=True)
Waiting: 100%|██████████| 100/100 [00:00<00:00, 866.85it/s, status=SUCCESS]
[4]:
True
Fetch the results. The result is a list of (sequence, embedding) tuples. The list is not guaranteed to be in the same order as the input — check the returned sequence:
[5]:
results = future.get()
returned_sequence, embedding = results[0]
print(returned_sequence)
print("Embedding shape:", embedding.shape)
b'EVQLVESGGGLVQPGGSLRLSCAASGFNIKDTYIHWVRQAPGKGLEWVARIYPTNGYTRYADSVKGRFTISADTSKNTAYLQMNSLRAEDTAVYYCSRWGGDGFYAMDYWGQGTLVTVSS:DIQMTQSPSSLSASVGDRVTITCRASQDVNTAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSRSGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGQGTKVEIK'
Embedding shape: (480,)
With the default mean reduction the embedding is a single vector per input. If you want per-residue embeddings instead, pass reduction=None:
[6]:
from openprotein.common.reduction import ReductionType
per_residue = ablang2.embed(
sequences=[paired.encode()],
reduction=None,
).wait()
seq, emb = per_residue[0]
print("Per-residue shape:", emb.shape)
Per-residue shape: (228, 480)
The length axis includes the : position, so per-residue outputs align 1:1 with the input string (heavy residues, the chain delimiter, then light residues).
Embedding a single chain#
For heavy-only or light-only inputs you keep the colon as a positional marker for the missing chain:
[7]:
heavy_future = ablang2.embed(sequences=[heavy_only.encode()])
light_future = ablang2.embed(sequences=[light_only.encode()])
heavy_future.wait_until_done(verbose=True)
light_future.wait_until_done(verbose=True)
heavy_seq, heavy_emb = heavy_future.get()[0]
light_seq, light_emb = light_future.get()[0]
print("Heavy-only embedding:", heavy_emb.shape)
print("Light-only embedding:", light_emb.shape)
Waiting: 100%|██████████| 100/100 [00:00<00:00, 949.20it/s, status=SUCCESS]
Waiting: 100%|██████████| 100/100 [00:00<00:00, 941.23it/s, status=SUCCESS]
Heavy-only embedding: (480,)
Light-only embedding: (480,)
Note that heavy-only and light-only embeddings live in the same embedding space as paired embeddings, but they are not directly comparable to a paired embedding for the same antibody — the model sees a different conditioning context in each case.
Residue likelihoods (logits)#
AbLang2 also exposes per-position logits via logits(), which are useful for scoring mutations or restoring uncertain residues. Logits are returned as raw (un-normalised) values; pass them through softmax / log_softmax when you need probabilities.
[8]:
logits_future = ablang2.logits(sequences=[paired.encode()])
logits_future.wait_until_done(verbose=True)
returned_sequence, logits = logits_future.get()[0]
print("Logits shape:", logits.shape)
print("Vocabulary:", ablang2.metadata.output_tokens)
Waiting: 100%|██████████| 100/100 [00:00<00:00, 932.01it/s, status=SUCCESS]
Logits shape: (228, 26)
Vocabulary: ['<', 'M', 'R', 'H', 'K', 'D', 'E', 'S', 'T', 'N', 'Q', 'C', 'G', 'P', 'A', 'V', 'I', 'F', 'Y', 'W', 'L', '-', 'X', ':', 'X', ':']
The first axis runs over positions in the input — including the : position — and the second axis runs over the AbLang2 vocabulary.
Next steps#
Fit an SVD on AbLang2 embeddings to get a compact feature for downstream regression.
Train a property regression model on antibody assay data using AbLang2 as the feature backbone.
Browse other foundation models in the Foundation Models index.
For more information, visit the Embeddings API reference.