abutils: utilities for AIRR analysis¶
Antibody repertoire sequencing is an increasingly important tool for detailed characterization of the immune response to infection and immunization. We built abutils to provide a cohesive set of tools designed for the specific challenges inherent in working with antibody repertoire data. The components in abutils were designed to be flexible: equally at home when used used interactively (in a Jupyter Notebook, for example) or when integrated into more complex programs and/or pipelines (such as abstar, which is capable of annotating billions of antibody sequences).
core models¶
To represent antibody repertoire data at varying levels of granularity, abutils provides three core models:
Sequence: model for representing a single antibody sequnce (either heavy or light chain). Provides a means to store and access abstar annotations. Includes common methods of sequence manipulation, including slicing, reverse-complement, and conversion to FASTA format. The
Sequenceobject is used extensively throughout the ab[x] toolkit.Pair: model for representing paired (heavy and light) antibody sequences. Comprised of one or more
Sequenceobjects. Heavily used in scab, which is our toolkit for analyzing adaptive immune single cell datasets.Lineage: model for representing an antibody clonal lineage. Comprised of one or more
Pairobjects. Includes methods for lineage manipulation, including generating dot alignments and UCA calculation.
These models are heirarchical – a Lineage is composed of one
or more Pair objects, a Pair is composed of one or more Sequence objects – and contain methods
appropriate for each level of granularity.
data (abutils.io)¶
To simplify data manipulation and facilitate the integration of abutils into
existing pipelines, abutils provides a set of functions for reading and writing
sequence data:
read: read sequences from a variety of file formats, including FASTA, FASTQ, AIRR-C, and others.
write: write sequences to a variety of file formats, including FASTA, FASTQ, AIRR-C, and others.
convert: convert between
SequenceorPairobjects and Pandas or Polars DataFrames.paths: functions for working with file paths and directories.
All of the IO functions are accessible via abutils.io.
tools (abutils.tl)¶
In addition to the core models, abutils provides a number of commonly used functions. These functions are widely used throughput the ab[x] toolkit and can be easily integrated into custom pipelines or for use when performing interactive analyses:
pairwise alignment: local (Smith-Waterman), global (Needleman-Wunsch) and semi-global pairwise sequence alignment using parasail.
multiple sequence alignment using MAFFT, MUSCLE, or FAMSA
clustering: identity-based sequence clustering with VSEARCH, CDHIT, or MMseqs2
preprocessing: preprocessing functions for sequence data, including merging paired-end reads
clonify: assigning antibody sequences to clonal lineages using the clonify algorithm
phylogeny: computing phylogenies with FastTree or IgPhyML, tree drawing with baltic
All of the tool functions are accessible via abutils.tl.
plots (abutils.pl)¶
abutils provides a number of plotting functions for visualizing antibody repertoire data.
These functions are built on top of matplotlib and seaborn and are designed to be easily
integrated into custom analyses or pipelines. Plotting functions are designed to work with
Sequence, Pair, and Lineage objects, and fully support AIRR-C annotation formats for
plotting adaptive immune receptor features like CDR3 length distributions and germline gene usage.
All of the plotting functions are accessible via abutils.pl
colors (abutils.cl)¶
abutils provides utility functions that are generally useful when working with antibody repertoire data.
These include functions for monitoring multiprocessing jobs, creating and manipulating color palettes, and more.
colors: comprehensive functions for working with colors and color palettes, accessible through the
abutils.clmodule