openprotein.protein.Protein#

class openprotein.protein.Protein(sequence=None, coordinates=None, plddt=None, name=None)[source]#

Represents a protein with optional sequence, atomic coordinates, per-residue confidence scores (pLDDT), and name.

This class supports partial or complete information: users may initialize a Protein with only a sequence, only a structure, or both. The class ensures that all provided fields have consistent residue-level lengths and provides convenient methods for indexing, masking, and structural comparisons.

sequence#

Amino acid sequence as bytes. Unknown or masked residues are represented as b”X”.

coordinates#

an array containing the 3D coordinates of the heavy atoms of the protein in atom37 format. It has shape (L, 37, 3), where L is the length of the protein, 37 is the number of heavy atoms, and 3 is the number of coordinates (x, y, and z).

plddt#

an array of shape (L,). For predicted structures, this contains the pLDDT of each residue, which is a measure of prediction confidence. For experimental structures, this should be set to 100 if the coordinates of the alpha carbon are known, and NaN otherwise.

name#

Optional identifier for the protein as a string.

Conventions:
  • Missing or unknown residues in the sequence are denoted by b”X”.

  • Missing structural data (coordinates or pLDDT) are represented by NaN.

  • Residue indices are 1-based for user-facing methods (e.g., mask_sequence_at), but internally stored as 0-based arrays.

Examples

Create a Protein from sequence only:

Protein(sequence=”ACDEFGHIK”)

Create a Protein from sequence and name:

Protein(sequence=”ACDEFGHIK”, name=”my_protein”)

Create a Protein with sequence and structure:

Protein(sequence=”ACD”, coordinates=coords_array, plddt=plddt_array)

Raises:
  • ValueError – If sequence, coordinates, or pLDDT are specified with inconsistent lengths.

  • ValueError – If none of sequence, coordinates, or pLDDT are provided.

Parameters:
  • sequence (bytes | str | None)

  • coordinates (ndarray[tuple[Any, ...], dtype[float32]] | None)

  • plddt (ndarray[tuple[Any, ...], dtype[float32]] | None)

  • name (bytes | str | None)

__init__(sequence=None, coordinates=None, plddt=None, name=None)[source]#
Parameters:
  • sequence (bytes | str | None)

  • coordinates (ndarray[tuple[Any, ...], dtype[float32]] | None)

  • plddt (ndarray[tuple[Any, ...], dtype[float32]] | None)

  • name (bytes | str | None)

Methods

__init__([sequence, coordinates, plddt, name])

at(positions)

Return a new Protein object containing residues at given 1-indexed positions.

from_filepath(path, chain_id[, ...])

Create a Protein from a structure file.

from_string(filestring, format, chain_id[, ...])

from_structure(structure, chain_id[, ...])

make_cif_string()

make_fasta_bytes()

make_pdb_string()

mask_sequence_at(positions)

Mask sequence at given 1-indexed positions.

mask_sequence_except_at(positions)

Mask sequence at all positions except the given 1-indexed positions.

mask_structure_at(positions)

Mask structure at given 1-indexed positions.

mask_structure_except_at(positions)

Mask structure at all positions except the given 1-indexed positions.

rmsd(tgt[, backbone_only])

Compute the root-mean-square deviation (RMSD) between this Protein and a target Protein.

Attributes

chain_id

coordinates

cyclic

has_structure

Whether or not the structure is known at any position in the protein.

msa

name

plddt

sequence