multiple sequence alignment¶
abutils can perform multiple sequence alignments using either MAFFT, MUSCLE, or FAMSA (using the pyfamsa package).
All multiple sequence alignment functions return a abutils.tl.MultipleSequenceAlignment object, which builds on
the Bio.Align.MultipleSeqAlignment class from Biopython. This object provides a number of
convenient methods for working with the alignment, including writing to file, trimming, and calculating
consensus sequences. Additionally, because the same object is returned regardless of the alignment method used,
the user can easily switch between alignment methods with minimal changes to their code.
alignment method |
function |
|---|---|
MAFFT |
|
MUSCLE |
|
FAMSA |
examples¶
multiple sequence alignment with MAFFT
Each of the multiple seqeunce alignment functions can accept a path to a FASTA file, a FASTA-formatted string,
a list of abutils.Sequence objects, or a list of anything accepted by abutils.Sequence. By default,
calling the alignment function with a list of sequences will return a abutils.tl.MultipleSequenceAlignment.
import abutils
msa = abutils.tl.mafft('path/to/sequences.fasta')
multiple sequence alignment with MUSCLE, with the results written to file
Rather than returning a abutils.tl.MultipleSequenceAlignment object, the user can specify a path to which
the alignment file should be written. This is done by passing a file path to the alignment_file argument.
import abutils
# read in a fasta file
seqs = abutils.io.read_fasta('path/to/sequences.fasta')
# align the sequences using MUSCLE and write the results to a file
abutils.tl.muscle(
seqs,
alignment_file='path/to/alignment.fasta',
as_file=True
)
using a custom binary for multiple sequence alignment
abutils packages binaries for both MAFFT and MUSCLE, meaning these packages don’t need to be separately installed
by the user. However, both abutils.tl.mafft() and abutils.tl.muscle() allow the
user to specify the path to a custom binary if desired. For MAFFT, this is done using the mafft_bin argument, and
for MUSCLE, the muscle_bin argument. Both accept a path to the binary as a string.
import abutils
# align the sequences using a custom MAFFT binary
mafft_msa = abutils.tl.mafft(
'path/to/sequences.fasta',
mafft_bin='path/to/mafft'
)
# align the sequences using a custom MUSCLE binary
muscle_msa = abutils.tl.muscle(
'path/to/sequences.fasta',
muscle_bin='path/to/muscle'
)
api¶
- abutils.tl.mafft(sequences: str | Iterable, alignment_file: str | None = None, fmt: str = 'fasta', threads: int = -1, as_file: bool = False, as_string: bool = False, reorder: bool = True, mafft_bin: str | None = None, id_key: str | None = None, seq_key: str | None = None, debug: bool = False, fasta: str | None = None) MultipleSeqAlignment | str¶
Performs multiple sequence alignment with MAFFT.
- Parameters:
sequences ((str, iterable)) –
- Can be one of several things:
path to a FASTA-formatted file
a FASTA-formatted string
a list of BioPython
SeqRecordobjectsa list of abutils
Sequenceobjectsa list of lists/tuples, of the format
[sequence_id, sequence]
alignment_file (str, optional) – Path for the output alignment file. Required if
as_fileisTrue.fmt (str, default='fasta') – Format of the output alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’. Default is ‘fasta’.
threads (int, default=-1) – Number of threads for MAFFT to use. Default is
-1, which uses all available CPUs.as_file (bool, default=False) – If
True, returns the path to the alignment file. IfFalse, returns either a BioPythonMultipleSeqAlignmentobject or the alignment output as astr, depending on as_string. If alignment_file is not provided, a temporary file will be created withtempfile.NamedTemporaryFile. Note that this temporary file is created in"/tmp"and may be removed by the operating system without notice.as_string (bool, default=False) – If
True, returns a the alignment output as a string. IfFalse, returns a BioPythonMultipleSeqAlignmentobject (obtained by callingBio.AlignIO.read()on the alignment file).mafft_bin (str, optional) – Path to a MAFFT executable. Default is
None, which results in using the default system MAFFT binary (just calling'mafft').id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing,
Sequence.idis used.sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing,
Sequence.sequenceis used.debug (bool, default=False) – If
True, prints MAFFT’s standard output and standard error. Default isFalse.fasta (str, optional) – Path to a FASTA-formatted input file. Depricated (use sequences for all types if input), but retained for backwards compatibility.
- Returns:
alignment – If
as_fileisTrue, returns a path to the output alignment file, Otherwise, returns a BioPythonMultipleSeqAlignmentobject.- Return type:
str or
MultipleSeqAlignment
- abutils.tl.muscle(sequences: str | Iterable, alignment_file: str | None = None, as_file: bool = False, as_string: bool = False, muscle_bin: str | None = None, threads: int | None = None, id_key: str | None = None, seq_key: str | None = None, debug: bool = False, fasta: str | None = None) MultipleSeqAlignment | str¶
Performs multiple sequence alignment with MUSCLE (version 5).
- Parameters:
sequences ((str, iterable)) –
- Can be one of several things:
path to a FASTA-formatted file
a FASTA-formatted string
a list of BioPython
SeqRecordobjectsa list of abutils
Sequenceobjectsa list of lists/tuples, of the format
[sequence_id, sequence]
alignment_file (str, optional) – Path for the output alignment file. Required if
as_fileisTrue.as_file (bool, default=False) – If
True, returns the path to the alignment file. IfFalse, returns either a BioPythonMultipleSeqAlignmentobject or the alignment output as astr, depending on as_string. If alignment_file is not provided, a temporary file will be created withtempfile.NamedTemporaryFile. Note that this temporary file is created in"/tmp"and may be removed by the operating system without notice.as_string (bool, default=False) – If
True, returns a the alignment output as a string. IfFalse, returns a BioPythonMultipleSeqAlignmentobject (obtained by callingBio.AlignIO.read()on the alignment file).muscle_bin (str, optional) – Path to a MUSCLE executable. If not provided, the MUSCLE binary bundled with
abutilswill be used.threads (int, default=None) – Number of threads for MUSCLE to use. If not provided, MUSCLE uses all available CPUs.
id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing,
Sequence.idis used.sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing,
Sequence.sequenceis used.debug (bool, default=False) – If
True, prints MAFFT’s standard output and standard error. Default isFalse.fasta (str, optional) – Path to a FASTA-formatted input file. Depricated (use sequences for all types if input), but retained for backwards compatibility.
- Returns:
alignment – If
as_fileisTrue, returns a path to the output alignment file, Otherwise, returns a BioPythonMultipleSeqAlignmentobject.- Return type:
str or
MultipleSeqAlignment
- abutils.tl.famsa(sequences: str | Iterable, alignment_file: str | None = None, fmt: str = 'fasta', as_file: bool = False, as_string: bool = False, threads: int = 0, guide_tree: str = 'sl', tree_heuristic: str | None = None, medoid_threshold: int = 0, n_refinements: int = 100, keep_duplicates: bool = False, refine: bool | None = None, id_key: str | None = None, seq_key: str | None = None, debug: bool = False, fasta: str | None = None) MultipleSeqAlignment | str¶
Performs multiple sequence alignment with FAMSA using the pyfamsa package.
- Parameters:
sequences ((str, iterable)) –
- Can be one of several things:
path to a FASTA-formatted file
a FASTA-formatted string
a list of BioPython
SeqRecordobjectsa list of abutils
Sequenceobjectsa list of lists/tuples, of the format
[sequence_id, sequence]
alignment_file (str, optional) – Path for the output alignment file. Required if
as_fileisTrue.fmt (str, default='fasta') – Format of the output alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’.
as_file (bool, default=False) – If
True, returns the path to the alignment file. IfFalse, returns either a BioPythonMultipleSeqAlignmentobject or the alignment output as astr, depending on as_string. If alignment_file is not provided, a temporary file will be created withtempfile.NamedTemporaryFile. Note that this temporary file is created in"/tmp"and may be removed by the operating system without notice.as_string (bool, default=False) – If
True, returns a the alignment output as a string. IfFalse, returns a BioPythonMultipleSeqAlignmentobject (obtained by callingBio.AlignIO.read()on the alignment file).threads (int, default=0) – Number of threads for FAMSA to use. Default is
0, which uses all available CPUs.guide_tree (str, default='sl') – Method for building the guide tree. Choices are ‘sl’, ‘slink’, ‘nj’, and ‘upgma’.
tree_heuristic (str, default=None) – The heuristic to use for constructing the tree. Supported values are: ‘medoid’, ‘part’, or
Noneto disable heuristics.medoid_threshold (int, default=0) – Minimum number of sequences a set must contain for medoid trees to be used, if enabled with
tree_heuristic. Default is0.n_refinements (int, default=100) – Number of refinement iterations to perform. Default is
100.keep_duplicates (bool, default=False) – If
True, duplicate sequences are kept in the alignment. Default isFalse.refine (bool, default=None) – If
True, the alignment is refined. IfFalse, the alignment is not refined. IfNone, the alignment is refined if the number of sequences is less than 1000.id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing,
Sequence.idis used.sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing,
Sequence.sequenceis used.debug (bool, default=False) – If
True, prints MAFFT’s standard output and standard error. Default isFalse.fasta (str, optional) – Path to a FASTA-formatted input file. Depricated (use sequences for all types if input), but retained for backwards compatibility.
- Returns:
alignment – If
as_fileisTrue, returns a path to the output alignment file, Otherwise, returns a BioPythonMultipleSeqAlignmentobject.- Return type:
str or
MultipleSeqAlignment
- class abutils.tl.MultipleSequenceAlignment(input_alignment: str | Iterable | MultipleSeqAlignment, fmt: str = 'fasta')¶
Class for working with multiple sequence alignments.
- Parameters:
input_alignment ((str, iterable, MultipleSeqAlignment)) –
- Can be one of several things:
path to an alignment file
an alignment string (for example, the result of calling
read()on an alignment file)a BioPython
MultipleSeqAlignmentobjecta list of aligned BioPython
SeqRecordobjectsa list of aligned
abutils.Sequenceobjects
fmt (str, default='fasta') – Format of the input alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’.
- property aln_string: str¶
Returns the alignment as a string.
- property sequences: Iterable[Sequence]¶
Sequences in the alignment. Note that this is a list of
Sequenceobjects, not BioPythonSeqRecordobjects, and that the sequences are aligned, meaning they are all the same length and may have gaps.- Return type:
list of
Sequenceobjects
- format(format) str¶
Format the alignment as a string.
- Parameters:
format (str) – Format of the output alignment. Choices are ‘fasta’, ‘phylip’, and ‘clustal’.
- make_consensus(name: str | None = None, threshold: float = 0.51, ambiguous: str | None = None, as_string: bool = False) str | Sequence¶
Make a consensus sequence from the multiple sequence alignment.
- Parameters:
name (str, optional) – Name of the consensus sequence. If not provided, a random UUID will be used.
threshold (float, default=0.51) – Threshold for calling a consensus base. Default is 0.51, meaning that a base will be called if it is present in at least 51% of the sequences.
ambiguous (str, optional) – Character to use for ambiguous bases. If not provided, the default is “N” if the sequences contain only standard nucleotides (A, T, G, C), and “X” otherwise.
as_string (bool, default=False) – If
True, returns the consensus sequence as a string. IfFalse, returns aSequenceobject.
- Returns:
consensus – If
as_stringisTrue, returns the consensus sequence as a string. IfFalse, returns aSequenceobject.- Return type:
str or
Sequence