abutils: utilities for AIRR analysis

Antibody repertoire sequencing is an increasingly important tool for detailed characterization of the immune response to infection and immunization. We built abutils to provide a cohesive set of tools designed for the specific challenges inherent in working with antibody repertoire data. The components in abutils were designed to be flexible: equally at home when used used interactively (in a Jupyter Notebook, for example) or when integrated into more complex programs and/or pipelines (such as abstar, which is capable of annotating billions of antibody sequences).


core models

To represent antibody repertoire data at varying levels of granularity, abutils provides three core models:

  • Sequence: model for representing a single antibody sequnce (either heavy or light chain). Provides a means to store and access abstar annotations. Includes common methods of sequence manipulation, including slicing, reverse-complement, and conversion to FASTA format. The Sequence object is used extensively throughout the ab[x] toolkit.

  • Pair: model for representing paired (heavy and light) antibody sequences. Comprised of one or more Sequence objects. Heavily used in scab, which is our toolkit for analyzing adaptive immune single cell datasets.

  • Lineage: model for representing an antibody clonal lineage. Comprised of one or more Pair objects. Includes methods for lineage manipulation, including generating dot alignments and UCA calculation.

These models are heirarchical – a Lineage is composed of one or more Pair objects, a Pair is composed of one or more Sequence objects – and contain methods appropriate for each level of granularity.


data (abutils.io)

To simplify data manipulation and facilitate the integration of abutils into existing pipelines, abutils provides a set of functions for reading and writing sequence data:

  • read: read sequences from a variety of file formats, including FASTA, FASTQ, AIRR-C, and others.

  • write: write sequences to a variety of file formats, including FASTA, FASTQ, AIRR-C, and others.

  • convert: convert between Sequence or Pair objects and Pandas or Polars DataFrames.

  • paths: functions for working with file paths and directories.

All of the IO functions are accessible via abutils.io.


tools (abutils.tl)

In addition to the core models, abutils provides a number of commonly used functions. These functions are widely used throughput the ab[x] toolkit and can be easily integrated into custom pipelines or for use when performing interactive analyses:

All of the tool functions are accessible via abutils.tl.


plots (abutils.pl)

abutils provides a number of plotting functions for visualizing antibody repertoire data. These functions are built on top of matplotlib and seaborn and are designed to be easily integrated into custom analyses or pipelines. Plotting functions are designed to work with Sequence, Pair, and Lineage objects, and fully support AIRR-C annotation formats for plotting adaptive immune receptor features like CDR3 length distributions and germline gene usage.

  • bar: plot categorical data or frequency distributions as a bar plot

  • scatter: plot two-dimensional data as a scatter plot

  • kde: plot one- or two-dimensional data as a kernel density estimate

  • donut: plot categorical data (such as lineages or germline genes) as a donut plot

All of the plotting functions are accessible via abutils.pl

colors (abutils.cl)

abutils provides utility functions that are generally useful when working with antibody repertoire data. These include functions for monitoring multiprocessing jobs, creating and manipulating color palettes, and more.

  • colors: comprehensive functions for working with colors and color palettes, accessible through the abutils.cl module


index