multiple sequence alignment

abutils can perform multiple sequence alignments using either MAFFT, MUSCLE, or FAMSA (using the pyfamsa package). All multiple sequence alignment functions return a abutils.tl.MultipleSequenceAlignment object, which builds on the Bio.Align.MultipleSeqAlignment class from Biopython. This object provides a number of convenient methods for working with the alignment, including writing to file, trimming, and calculating consensus sequences. Additionally, because the same object is returned regardless of the alignment method used, the user can easily switch between alignment methods with minimal changes to their code.


alignment method

function

MAFFT

abutils.tl.mafft()

MUSCLE

abutils.tl.muscle()

FAMSA

abutils.tl.famsa()

examples

multiple sequence alignment with MAFFT

Each of the multiple seqeunce alignment functions can accept a path to a FASTA file, a FASTA-formatted string, a list of abutils.Sequence objects, or a list of anything accepted by abutils.Sequence. By default, calling the alignment function with a list of sequences will return a abutils.tl.MultipleSequenceAlignment.

import abutils

msa = abutils.tl.mafft('path/to/sequences.fasta')

multiple sequence alignment with MUSCLE, with the results written to file

Rather than returning a abutils.tl.MultipleSequenceAlignment object, the user can specify a path to which the alignment file should be written. This is done by passing a file path to the alignment_file argument.

import abutils

# read in a fasta file
seqs = abutils.io.read_fasta('path/to/sequences.fasta')

# align the sequences using MUSCLE and write the results to a file
abutils.tl.muscle(
    seqs,
    alignment_file='path/to/alignment.fasta',
    as_file=True
)

using a custom binary for multiple sequence alignment

abutils packages binaries for both MAFFT and MUSCLE, meaning these packages don’t need to be separately installed by the user. However, both abutils.tl.mafft() and abutils.tl.muscle() allow the user to specify the path to a custom binary if desired. For MAFFT, this is done using the mafft_bin argument, and for MUSCLE, the muscle_bin argument. Both accept a path to the binary as a string.

import abutils

# align the sequences using a custom MAFFT binary
mafft_msa = abutils.tl.mafft(
    'path/to/sequences.fasta',
    mafft_bin='path/to/mafft'
)

# align the sequences using a custom MUSCLE binary
muscle_msa = abutils.tl.muscle(
    'path/to/sequences.fasta',
    muscle_bin='path/to/muscle'
)

api

abutils.tl.mafft(sequences: str | Iterable, alignment_file: str | None = None, fmt: str = 'fasta', threads: int = -1, as_file: bool = False, as_string: bool = False, reorder: bool = True, mafft_bin: str | None = None, id_key: str | None = None, seq_key: str | None = None, debug: bool = False, fasta: str | None = None) MultipleSeqAlignment | str

Performs multiple sequence alignment with MAFFT.

Parameters:
  • sequences ((str, iterable)) –

    Can be one of several things:
    1. path to a FASTA-formatted file

    2. a FASTA-formatted string

    3. a list of BioPython SeqRecord objects

    4. a list of abutils Sequence objects

    5. a list of lists/tuples, of the format [sequence_id, sequence]

  • alignment_file (str, optional) – Path for the output alignment file. Required if as_file is True.

  • fmt (str, default='fasta') – Format of the output alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’. Default is ‘fasta’.

  • threads (int, default=-1) – Number of threads for MAFFT to use. Default is -1, which uses all available CPUs.

  • as_file (bool, default=False) – If True, returns the path to the alignment file. If False, returns either a BioPython MultipleSeqAlignment object or the alignment output as a str, depending on as_string. If alignment_file is not provided, a temporary file will be created with tempfile.NamedTemporaryFile. Note that this temporary file is created in "/tmp" and may be removed by the operating system without notice.

  • as_string (bool, default=False) – If True, returns a the alignment output as a string. If False, returns a BioPython MultipleSeqAlignment object (obtained by calling Bio.AlignIO.read() on the alignment file).

  • mafft_bin (str, optional) – Path to a MAFFT executable. Default is None, which results in using the default system MAFFT binary (just calling 'mafft').

  • id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing, Sequence.id is used.

  • sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing, Sequence.sequence is used.

  • debug (bool, default=False) – If True, prints MAFFT’s standard output and standard error. Default is False.

  • fasta (str, optional) – Path to a FASTA-formatted input file. Depricated (use sequences for all types if input), but retained for backwards compatibility.

Returns:

alignment – If as_file is True, returns a path to the output alignment file, Otherwise, returns a BioPython MultipleSeqAlignment object.

Return type:

str or MultipleSeqAlignment

abutils.tl.muscle(sequences: str | Iterable, alignment_file: str | None = None, as_file: bool = False, as_string: bool = False, muscle_bin: str | None = None, threads: int | None = None, id_key: str | None = None, seq_key: str | None = None, debug: bool = False, fasta: str | None = None) MultipleSeqAlignment | str

Performs multiple sequence alignment with MUSCLE (version 5).

Parameters:
  • sequences ((str, iterable)) –

    Can be one of several things:
    1. path to a FASTA-formatted file

    2. a FASTA-formatted string

    3. a list of BioPython SeqRecord objects

    4. a list of abutils Sequence objects

    5. a list of lists/tuples, of the format [sequence_id, sequence]

  • alignment_file (str, optional) – Path for the output alignment file. Required if as_file is True.

  • as_file (bool, default=False) – If True, returns the path to the alignment file. If False, returns either a BioPython MultipleSeqAlignment object or the alignment output as a str, depending on as_string. If alignment_file is not provided, a temporary file will be created with tempfile.NamedTemporaryFile. Note that this temporary file is created in "/tmp" and may be removed by the operating system without notice.

  • as_string (bool, default=False) – If True, returns a the alignment output as a string. If False, returns a BioPython MultipleSeqAlignment object (obtained by calling Bio.AlignIO.read() on the alignment file).

  • muscle_bin (str, optional) – Path to a MUSCLE executable. If not provided, the MUSCLE binary bundled with abutils will be used.

  • threads (int, default=None) – Number of threads for MUSCLE to use. If not provided, MUSCLE uses all available CPUs.

  • id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing, Sequence.id is used.

  • sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing, Sequence.sequence is used.

  • debug (bool, default=False) – If True, prints MAFFT’s standard output and standard error. Default is False.

  • fasta (str, optional) – Path to a FASTA-formatted input file. Depricated (use sequences for all types if input), but retained for backwards compatibility.

Returns:

alignment – If as_file is True, returns a path to the output alignment file, Otherwise, returns a BioPython MultipleSeqAlignment object.

Return type:

str or MultipleSeqAlignment

abutils.tl.famsa(sequences: str | Iterable, alignment_file: str | None = None, fmt: str = 'fasta', as_file: bool = False, as_string: bool = False, threads: int = 0, guide_tree: str = 'sl', tree_heuristic: str | None = None, medoid_threshold: int = 0, n_refinements: int = 100, keep_duplicates: bool = False, refine: bool | None = None, id_key: str | None = None, seq_key: str | None = None, debug: bool = False, fasta: str | None = None) MultipleSeqAlignment | str

Performs multiple sequence alignment with FAMSA using the pyfamsa package.

Parameters:
  • sequences ((str, iterable)) –

    Can be one of several things:
    1. path to a FASTA-formatted file

    2. a FASTA-formatted string

    3. a list of BioPython SeqRecord objects

    4. a list of abutils Sequence objects

    5. a list of lists/tuples, of the format [sequence_id, sequence]

  • alignment_file (str, optional) – Path for the output alignment file. Required if as_file is True.

  • fmt (str, default='fasta') – Format of the output alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’.

  • as_file (bool, default=False) – If True, returns the path to the alignment file. If False, returns either a BioPython MultipleSeqAlignment object or the alignment output as a str, depending on as_string. If alignment_file is not provided, a temporary file will be created with tempfile.NamedTemporaryFile. Note that this temporary file is created in "/tmp" and may be removed by the operating system without notice.

  • as_string (bool, default=False) – If True, returns a the alignment output as a string. If False, returns a BioPython MultipleSeqAlignment object (obtained by calling Bio.AlignIO.read() on the alignment file).

  • threads (int, default=0) – Number of threads for FAMSA to use. Default is 0, which uses all available CPUs.

  • guide_tree (str, default='sl') – Method for building the guide tree. Choices are ‘sl’, ‘slink’, ‘nj’, and ‘upgma’.

  • tree_heuristic (str, default=None) – The heuristic to use for constructing the tree. Supported values are: ‘medoid’, ‘part’, or None to disable heuristics.

  • medoid_threshold (int, default=0) – Minimum number of sequences a set must contain for medoid trees to be used, if enabled with tree_heuristic. Default is 0.

  • n_refinements (int, default=100) – Number of refinement iterations to perform. Default is 100.

  • keep_duplicates (bool, default=False) – If True, duplicate sequences are kept in the alignment. Default is False.

  • refine (bool, default=None) – If True, the alignment is refined. If False, the alignment is not refined. If None, the alignment is refined if the number of sequences is less than 1000.

  • id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing, Sequence.id is used.

  • sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing, Sequence.sequence is used.

  • debug (bool, default=False) – If True, prints MAFFT’s standard output and standard error. Default is False.

  • fasta (str, optional) – Path to a FASTA-formatted input file. Depricated (use sequences for all types if input), but retained for backwards compatibility.

Returns:

alignment – If as_file is True, returns a path to the output alignment file, Otherwise, returns a BioPython MultipleSeqAlignment object.

Return type:

str or MultipleSeqAlignment


class abutils.tl.MultipleSequenceAlignment(input_alignment: str | Iterable | MultipleSeqAlignment, fmt: str = 'fasta')

Class for working with multiple sequence alignments.

Parameters:
  • input_alignment ((str, iterable, MultipleSeqAlignment)) –

    Can be one of several things:
    1. path to an alignment file

    2. an alignment string (for example, the result of calling read() on an alignment file)

    3. a BioPython MultipleSeqAlignment object

    4. a list of aligned BioPython SeqRecord objects

    5. a list of aligned abutils.Sequence objects

  • fmt (str, default='fasta') – Format of the input alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’.

property aln_string: str

Returns the alignment as a string.

property sequences: Iterable[Sequence]

Sequences in the alignment. Note that this is a list of Sequence objects, not BioPython SeqRecord objects, and that the sequences are aligned, meaning they are all the same length and may have gaps.

Return type:

list of Sequence objects

format(format) str

Format the alignment as a string.

Parameters:

format (str) – Format of the output alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’.

make_consensus(name: str | None = None, threshold: float = 0.51, ambiguous: str | None = None, as_string: bool = False) str | Sequence

Make a consensus sequence from the multiple sequence alignment.

Parameters:
  • name (str, optional) – Name of the consensus sequence. If not provided, a random UUID will be used.

  • threshold (float, default=0.51) – Threshold for calling a consensus base. Default is 0.51, meaning that a base will be called if it is present in at least 51% of the sequences.

  • ambiguous (str, optional) – Character to use for ambiguous bases. If not provided, the default is “N” if the sequences contain only standard nucleotides (A, T, G, C), and “X” otherwise.

  • as_string (bool, default=False) – If True, returns the consensus sequence as a string. If False, returns a Sequence object.

Returns:

consensus – If as_string is True, returns the consensus sequence as a string. If False, returns a Sequence object.

Return type:

str or Sequence