clustering¶
The primary clustering function is abutils.tl.cluster(), which can cluster sequences
using CD-HIT, VSEARCH, or MMseqs2. By default, the clustering function will use CD-HIT for inputs with fewer than
1000 sequences or MMseqs2 for inputs with 1000 or more sequences, but this can be overridden in cases where a different
clustering algorithm is desired.
abutils.tl.cluster can accept a variety of inputs, including:
a path to a FASTA file
a FASTA-formatted string
a list of
abutils.Sequenceobjectsa list of anything accepted by
abutils.Sequence
The threshold argument is the sequence identity threshold for clustering, and should be between 0.0 and 1.0.
The algo argument selects the clustering algorithm to use. It can be 'cdhit', 'vsearch', or 'mmseqs2'.
If algo is not provided, abutils.tl.cluster will use CD-HIT for inputs with fewer than 1000 sequences or MMseqs2 for inputs with 1000 or more sequences.
abutils.tl.cluster returns a abutils.tl.Clusters object, which contains the clustering results as one or more abutils.tl.Cluster objects.
By default, the abutils.tl.Clusters object is sorted by cluster size in descending order, so the first cluster is the largest.
examples¶
clustering with CD-HIT at 90% identity, using a FASTA file as input
The sequences argument can be a variety of things, but in this cases is being supplied
as the path to a FASTA-formatted file.. The algo argument can be 'cdhit', 'vsearch', or 'mmseqs2'.
If algo is not provided, cluster() will use CD-HIT for inputs with fewer than 1000
sequences or MMseqs2 for inputs with 1000 or more sequences. The threshold argument
is the sequence identity threshold for clustering.
import abutils
clusters = abutils.tl.cluster(
sequences='path/to/sequences.fasta',
algo='cdhit',
threshold=0.9
)
get the largest cluster from VSEARCH clustering of a list of Sequence objects
Iterating over a abutils.tl.Clusters object iterates over all of the clusters it contains,
which are themselves abutils.tl.Cluster objects. By default, they are sorted by size in
descending order, so the first cluster is the largest.
import abutils
sequences = abutils.io.read_fasta('path/to/sequences.fasta')
clusters = abutils.tl.cluster(
sequences=sequences,
algo='vsearch',
threshold=0.9
)
largest_cluster = clusters[0]
calculate the consensus sequence of the largest cluster, using MMSeqs2
abutils.tl.Cluster objects have a consensus property that returns an abutils.Sequence object
representing the consensus sequence of the cluster. The consensus property is lazy,
meaning it is not calculated until it is accessed, and once calculated, it is cached.
Under the hood, the consensus sequence is calculated using the make_consensus()
method, which can also be called directly to provides more control over the consensus
sequence generation process. Calling make_consensus() automatically saves the consensus
sequence to the consensus property (and overwrites any cached consensus sequence).
import abutils
clusters = abutils.tl.cluster(
sequences='path/to/sequences.fasta',
algo='mmseqs2',
threshold=0.9
)
largest_cluster = clusters[0]
consensus = largest_cluster.consensus
# to overwrite the cached consensus sequence
largest_cluster.make_consensus()
api¶
- abutils.tl.cluster(sequences: Iterable | str, threshold: float = 0.975, algo: str = 'auto', temp_dir: str = '/tmp', iddef: int = 0, cluster_mode: str = '2', cov_mode: str = '0', coverage: float = 0.8, alignment_mode: str = '3', seq_id_mode: str = '1', vsearch_bin: str | None = None, mmseqs_bin: str | None = None, cdhit_bin: str | None = None, id_key: str | None = None, seq_key: str | None = None, threads: int | None = None, strand: str = 'plus', as_dict: bool = False, debug: bool = False) dict | Clusters¶
Clusters sequences using CD-HIT, VSEARCH or MMseqs2. By default, sequences will be clustered with VSEARCH if there are fewer than 10,000 nucleotide sequences, with CD-HIT if there are fewer than 10,000 amino acide sequences, and with MMseqs2 if there are more than 10,000 sequences (nucleotide or amino acid). These defaults can be overridden with algo.
- Parameters:
sequences (iterable or string) –
- Input sequences in any of the following formats:
list of abutils
SequenceobjectsFASTA-formatted string
path to a FASTA-formatted file
list of BioPython
SeqRecordobjectslist of lists/tuples, of the format
[sequence_id, sequence]list of strings, with each string being a separate sequence.
Required.
note:: (..) – If
sequencesis a list ofstrobjects, they will be assigned random IDs.threshold (float, default=0.975) – Identity threshold for clustering. Must be between 0 and 1.
algo (float, default="auto") – Algorithm to be used for clustering. Options are
"vsearch","mmseqs","cdhit", or"auto". By default ("auto"), VSEARCH will be used if there are fewer than 10,000 nucleotide sequences, CD-HIT will be used if there are fewer than 10,000 amino acid sequences, and MMseqs2 will be used for 10,000 sequences or more. Providing"vsearch","cdhit", or"mmseqs"will force the use of the desired clustering algorithm regardless of the number or sequences to be clustered.temp_dir (str, default="/tmp") – Path to a directory for temporary storage of clustering files.
iddef (int, default=1) –
- Identity definition, as implemented by VSEARCH. Options are:
CD-HIT definition: (matching columns) / (shortest sequence length).
edit distance: (matching columns) / (alignment length).
edit distance excluding terminal gaps (same as –id).
Marine Biological Lab definition counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended: 1.0 - [(mismatches + gap openings)/(longest sequence length)]
BLAST definition, equivalent to –iddef 1 in a context of global pairwise alignment.
cluster_mode (str, default="2") –
- Clustering mode, as implemented by MMseqs2. Options are:
greedy set cover
connected component
greedy incremental (CD-HIT-like)
cov_mode (str, default="0") –
- Coverage mode, as implemented by MMseqs2. Options are:
bidirectional
target coverage
query coverage
target-in-query length coverage
coverage (float, default=0.8) – Coverage threshold for clustering with MMseqs2. Must be between 0 and 1.
alignment_mode (str, default="3") –
- Alignment mode, as implemented by MMseqs2. Options are:
automatic
only score and end_pos
also start_pos and cov
also seq.id
only ungapped alignment
seq_id_mode (str, default="1") –
- Sequence ID mode, as implemented by MMseqs2. Options are:
alignment length
shorter sequence length
longer sequence length
vsearch_bin (str, optional) – Path to a VSEARCH executable. If not provided, the VSEARCH binary bundled with
abutilswill be used.mmseqs_bin (str, optional) – Path to a MMseqs2 executable. If not provided, the MMseqs2 binary bundled with
abutilswill be used.cdhit_bin (str, optional) – Path to a CD-HIT executable. If not provided, the CD-HIT binary bundled with
abutilswill be used.id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing,
Sequence.idis used.sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing,
Sequence.sequenceis used.strand (str, default="plus") – Strand of the sequences to align. Options are
"plus"and"both".as_dict (bool, default=False) –
If
True, return clustering results as adictrather than aClustersobject. thedictis of the format:- {“cluster1_name”: {“centroid”: cluster1_centroid,
”seqs”: [seq1, seq2, seq3, …]},
- ”cluster2_name”: {“centroid”: cluster2_centroid,
”seqs”: [seq4, seq5, seq6, …]},
}
debug (bool, default=False) – If
True, prints MAFFT’s standard output and standard error. Default isFalse.
- Returns:
clusters
- Return type:
Clustersordict
- class abutils.tl.Clusters(clusters=None)¶
Class for a collection of clusters.
- clusters¶
List of clusters. Clusters are sorted in descending order by size, with the largest cluster first.
- Type:
list
- centroids¶
List of cluster centroids.
- Type:
list
- largest_cluster¶
The largest cluster in the collection.
- Type:
Cluster
- count¶
Number of clusters in the collection.
- Type:
int
- class abutils.tl.Cluster(name, sequences, centroid=None)¶
Class for a sequence cluster.
- name¶
Name of the cluster.
- Type:
str
- sequences¶
List of sequences in the cluster.
- Type:
list
- size¶
Number of sequences in the cluster.
- Type:
int
- seq_ids¶
List of sequence IDs in the cluster.
- Type:
list
- centroid¶
Centroid sequence of the cluster.
- Type:
Sequence
- consensus¶
Consensus sequence of the cluster.
- Type:
Sequence
- make_consensus(threshold: float = 0.51, algo: str = 'mafft', ambiguous: str | None = None) Sequence¶
Makes a consensus sequence from the cluster’s sequences.
- Parameters:
threshold (float, default=0.51) – Threshold for calling a consensus base. Must be between 0 and 1.
algo (str, default="mafft") – Algorithm to be used for generating the consensus sequence. Options are
"mafft","famsa", or"muscle".ambiguous (str, default=None) – Character to use for ambiguous bases. If not provided, the default ambiguous base for the sequence type will be used.
- Returns:
consensus
- Return type:
Sequence