Aligning variable length sequences using Python#

This walkthrough demonstrates how to convert a dataset of unaligned sequences into a dataset of aligned sequences using a Python script. We’ll utilize conda or Mamba to set up a Python environment and Multiple Alignment using Fast Fourier Transform (MAFFT) for creating multiple sequence alignments.

Note

You’ll need to install either conda or mamba.

For this walkthrough, we recommend using conda. If you’re using mamba, replace instances of conda below with mamba.
Additionally, you’ll require a dataset formatted as a CSV file. This guide uses angle brackets < > to denote where your own values should be inserted.
We’ll use the file example_dataset.csv, which contains a fabricated dataset of unaligned sequences. The column containing sequences is named sequence, and the dataset includes three sequences along with measurements for three properties.

Creating a Conda Environment and Installing MAFFT#

First, clone the GitHub repository:

git clone https://github.com/OpenProteinAI/tool-make-aligned-dataset.git
cd tool-make-aligned-datasets

Now, create a conda environment with the necessary dependencies.

For Linux or non-Apple Silicon Macs:

conda env create -n tool-make-aligned-dataset -f environment.yml

MAFFT will be automatically installed through conda.

For Windows (non-WSL):

conda env create -n tool-make-aligned-dataset -f environment-no-MAFFT.yml
# Then install MAFFT using the instructions on the MAFFT website.

For Apple Silicon:

conda env create -n tool-make-aligned-dataset -f environment-no-MAFFT.yml
brew install MAFFT

Activating Your Environment and Aligning Your Sequences#

Activate the conda environment:

conda activate tool-make-aligned-dataset

Run the make_aligned_dataset.py script on your dataset:

python make_aligned_dataset.py
    --dataset <path-to-dataset>
    --sequence_column_name <name-of-column-with-sequences>

For our example_dataset.csv, the command looks like:

python make_aligned_dataset.py
     --dataset example_dataset.csv
     --sequence_column_name sequence

The output of this script is a file with the original dataset name with _aligned appended to it. In our example, the new file will be named example_dataset_aligned.csv.

This newly aligned dataset file will be located in the same directory as the input file. It contains the same data as the original example_dataset.csv, but with the sequence column now containing aligned sequences.