clustering

The primary clustering function is abutils.tl.cluster(), which can cluster sequences using CD-HIT, VSEARCH, or MMseqs2. By default, the clustering function will use CD-HIT for inputs with fewer than 1000 sequences or MMseqs2 for inputs with 1000 or more sequences, but this can be overridden in cases where a different clustering algorithm is desired.

abutils.tl.cluster can accept a variety of inputs, including:

  • a path to a FASTA file

  • a FASTA-formatted string

  • a list of abutils.Sequence objects

  • a list of anything accepted by abutils.Sequence

The threshold argument is the sequence identity threshold for clustering, and should be between 0.0 and 1.0.

The algo argument selects the clustering algorithm to use. It can be 'cdhit', 'vsearch', or 'mmseqs2'. If algo is not provided, abutils.tl.cluster will use CD-HIT for inputs with fewer than 1000 sequences or MMseqs2 for inputs with 1000 or more sequences.

abutils.tl.cluster returns a abutils.tl.Clusters object, which contains the clustering results as one or more abutils.tl.Cluster objects. By default, the abutils.tl.Clusters object is sorted by cluster size in descending order, so the first cluster is the largest.


examples


clustering with CD-HIT at 90% identity, using a FASTA file as input

The sequences argument can be a variety of things, but in this cases is being supplied as the path to a FASTA-formatted file.. The algo argument can be 'cdhit', 'vsearch', or 'mmseqs2'. If algo is not provided, cluster() will use CD-HIT for inputs with fewer than 1000 sequences or MMseqs2 for inputs with 1000 or more sequences. The threshold argument is the sequence identity threshold for clustering.

import abutils

clusters = abutils.tl.cluster(
    sequences='path/to/sequences.fasta',
    algo='cdhit',
    threshold=0.9
)

get the largest cluster from VSEARCH clustering of a list of Sequence objects

Iterating over a abutils.tl.Clusters object iterates over all of the clusters it contains, which are themselves abutils.tl.Cluster objects. By default, they are sorted by size in descending order, so the first cluster is the largest.

import abutils

sequences = abutils.io.read_fasta('path/to/sequences.fasta')
clusters = abutils.tl.cluster(
    sequences=sequences,
    algo='vsearch',
    threshold=0.9
)
largest_cluster = clusters[0]

calculate the consensus sequence of the largest cluster, using MMSeqs2

abutils.tl.Cluster objects have a consensus property that returns an abutils.Sequence object representing the consensus sequence of the cluster. The consensus property is lazy, meaning it is not calculated until it is accessed, and once calculated, it is cached. Under the hood, the consensus sequence is calculated using the make_consensus() method, which can also be called directly to provides more control over the consensus sequence generation process. Calling make_consensus() automatically saves the consensus sequence to the consensus property (and overwrites any cached consensus sequence).

import abutils

clusters = abutils.tl.cluster(
    sequences='path/to/sequences.fasta',
    algo='mmseqs2',
    threshold=0.9
)
largest_cluster = clusters[0]
consensus = largest_cluster.consensus

# to overwrite the cached consensus sequence
largest_cluster.make_consensus()

api

abutils.tl.cluster(sequences: Iterable | str, threshold: float = 0.975, algo: str = 'auto', temp_dir: str = '/tmp', iddef: int = 0, cluster_mode: str = '2', cov_mode: str = '0', coverage: float = 0.8, alignment_mode: str = '3', seq_id_mode: str = '1', vsearch_bin: str | None = None, mmseqs_bin: str | None = None, cdhit_bin: str | None = None, id_key: str | None = None, seq_key: str | None = None, threads: int | None = None, strand: str = 'plus', as_dict: bool = False, debug: bool = False) dict | Clusters

Clusters sequences using CD-HIT, VSEARCH or MMseqs2. By default, sequences will be clustered with VSEARCH if there are fewer than 10,000 nucleotide sequences, with CD-HIT if there are fewer than 10,000 amino acide sequences, and with MMseqs2 if there are more than 10,000 sequences (nucleotide or amino acid). These defaults can be overridden with algo.

Parameters:
  • sequences (iterable or string) –

    Input sequences in any of the following formats:
    1. list of abutils Sequence objects

    2. FASTA-formatted string

    3. path to a FASTA-formatted file

    4. list of BioPython SeqRecord objects

    5. list of lists/tuples, of the format [sequence_id, sequence]

    6. list of strings, with each string being a separate sequence.

    Required.

  • note:: (..) – If sequences is a list of str objects, they will be assigned random IDs.

  • threshold (float, default=0.975) – Identity threshold for clustering. Must be between 0 and 1.

  • algo (float, default="auto") – Algorithm to be used for clustering. Options are "vsearch", "mmseqs", "cdhit", or "auto". By default ("auto"), VSEARCH will be used if there are fewer than 10,000 nucleotide sequences, CD-HIT will be used if there are fewer than 10,000 amino acid sequences, and MMseqs2 will be used for 10,000 sequences or more. Providing "vsearch", "cdhit", or "mmseqs" will force the use of the desired clustering algorithm regardless of the number or sequences to be clustered.

  • temp_dir (str, default="/tmp") – Path to a directory for temporary storage of clustering files.

  • iddef (int, default=1) –

    Identity definition, as implemented by VSEARCH. Options are:
    1. CD-HIT definition: (matching columns) / (shortest sequence length).

    2. edit distance: (matching columns) / (alignment length).

    3. edit distance excluding terminal gaps (same as –id).

    4. Marine Biological Lab definition counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended: 1.0 - [(mismatches + gap openings)/(longest sequence length)]

    5. BLAST definition, equivalent to –iddef 1 in a context of global pairwise alignment.

  • cluster_mode (str, default="2") –

    Clustering mode, as implemented by MMseqs2. Options are:
    1. greedy set cover

    2. connected component

    3. greedy incremental (CD-HIT-like)

  • cov_mode (str, default="0") –

    Coverage mode, as implemented by MMseqs2. Options are:
    1. bidirectional

    2. target coverage

    3. query coverage

    4. target-in-query length coverage

  • coverage (float, default=0.8) – Coverage threshold for clustering with MMseqs2. Must be between 0 and 1.

  • alignment_mode (str, default="3") –

    Alignment mode, as implemented by MMseqs2. Options are:
    1. automatic

    2. only score and end_pos

    3. also start_pos and cov

    4. also seq.id

    5. only ungapped alignment

  • seq_id_mode (str, default="1") –

    Sequence ID mode, as implemented by MMseqs2. Options are:
    1. alignment length

    2. shorter sequence length

    3. longer sequence length

  • vsearch_bin (str, optional) – Path to a VSEARCH executable. If not provided, the VSEARCH binary bundled with abutils will be used.

  • mmseqs_bin (str, optional) – Path to a MMseqs2 executable. If not provided, the MMseqs2 binary bundled with abutils will be used.

  • cdhit_bin (str, optional) – Path to a CD-HIT executable. If not provided, the CD-HIT binary bundled with abutils will be used.

  • id_key (str, default=None) – Key to retrieve the sequence ID. If not provided or missing, Sequence.id is used.

  • sequence_key (str, default=None) – Key to retrieve the sequence. If not provided or missing, Sequence.sequence is used.

  • strand (str, default="plus") – Strand of the sequences to align. Options are "plus" and "both".

  • as_dict (bool, default=False) –

    If True, return clustering results as a dict rather than a Clusters object. the dict is of the format:

    {“cluster1_name”: {“centroid”: cluster1_centroid,

    ”seqs”: [seq1, seq2, seq3, …]},

    ”cluster2_name”: {“centroid”: cluster2_centroid,

    ”seqs”: [seq4, seq5, seq6, …]},

    }

  • debug (bool, default=False) – If True, prints MAFFT’s standard output and standard error. Default is False.

Returns:

clusters

Return type:

Clusters or dict


class abutils.tl.Clusters(clusters=None)

Class for a collection of clusters.

clusters

List of clusters. Clusters are sorted in descending order by size, with the largest cluster first.

Type:

list

centroids

List of cluster centroids.

Type:

list

largest_cluster

The largest cluster in the collection.

Type:

Cluster

count

Number of clusters in the collection.

Type:

int


class abutils.tl.Cluster(name, sequences, centroid=None)

Class for a sequence cluster.

name

Name of the cluster.

Type:

str

sequences

List of sequences in the cluster.

Type:

list

size

Number of sequences in the cluster.

Type:

int

seq_ids

List of sequence IDs in the cluster.

Type:

list

centroid

Centroid sequence of the cluster.

Type:

Sequence

consensus

Consensus sequence of the cluster.

Type:

Sequence

make_consensus(threshold: float = 0.51, algo: str = 'mafft', ambiguous: str | None = None) Sequence

Makes a consensus sequence from the cluster’s sequences.

Parameters:
  • threshold (float, default=0.51) – Threshold for calling a consensus base. Must be between 0 and 1.

  • algo (str, default="mafft") – Algorithm to be used for generating the consensus sequence. Options are "mafft", "famsa", or "muscle".

  • ambiguous (str, default=None) – Character to use for ambiguous bases. If not provided, the default ambiguous base for the sequence type will be used.

Returns:

consensus

Return type:

Sequence