read sequence data¶

abutils provides functions for reading/parsing sequence data from a variety of commonly used file formats. This includes raw sequence data in FASTA or FASTQ format as well as annotated sequence data in AIRR-C, CSV, or Parquet formats.

format	function	notes
FASTA/Q	abutils.io.read_fastx()	returns a list of `Sequence` objects
FASTA/Q	abutils.io.parse_fastx()	yields single `Sequence` objects
FASTA	abutils.io.read_fasta()	returns a list of `Sequence` objects
FASTA	abutils.io.parse_fasta()	yields single `Sequence` objects
FASTQ	abutils.io.read_fastq()	returns a list of `Sequence` objects
FASTQ	abutils.io.parse_fastq()	yields single `Sequence` objects
AIRR	abutils.io.read_airr()	only supports `Sequence` objects
Parquet	abutils.io.read_parquet()	supports `Sequence` or `Pair` objects
CSV	abutils.io.read_csv()	supports `Sequence` or `Pair` objects

fasta/q files¶

The primary differences between read and parse functions are:

read functions read an entire file into memory and return a list of Sequence objects.
parse functions yield Sequence objects one at a time.

parse functions are generally more memory efficient for large files, but read functions may be more convenient for smaller files or when sequences need to be processed as a group rather than one at a time:

# read entire file into memory
sequences = abutils.io.read_fasta("sequences.fasta")

# parse file one record at a time
for sequence in abutils.io.parse_fastq("sequences.fastq"):
    print(sequence)

read_fastx() and parse_fastx() are the most flexible and can read/parse either FASTA or FASTQ files. This is particularly useful when building pipelines in which users may want to process both file types or when the source file may not be known in advance:

# FASTA file
sequences = abutils.io.read_fastx("sequences.fasta")

# FASTQ file
for sequence in abutils.io.parse_fastx("sequences.fastq"):
    print(sequence)

All of the FASTA/Q/X read and parse functions can handle gzip-compressed files automatically:

# FASTA file
sequences = abutils.io.read_fastx("sequences.fasta.gz")

# FASTQ file
for sequence in abutils.io.parse_fastx("sequences.fastq.gz"):
    print(sequence)

annotated sequence files¶

read_airr() can read AIRR-C formatted sequence data from a tab-delimited file, returing a list of Sequence objects:

sequences = abutils.io.read_airr("sequences.tsv")

read_parquet() and read_csv() can read Parquet and CSV formatted annotated sequence data, and expect the annotations to be in AIRR-C format – the only difference is in the file format, which can be either Parquet or CSV instead of the AIRR-C tab-delimited format:

# read CSV file of annotated sequences
sequences = abutils.io.read_csv("sequences.csv")

# read Parquet file of annotated paired sequences
pairs = abutils.io.read_parquet("pairs.parquet")

Both read_csv() and read_parquet() support reading annotations from paired sequences, which is a custom extension of the AIRR-C format. Each row in the CSV or Parquet file contains annotations for both heavy and light chains. All annotation fields in the AIRR-C format are conserved for each chain, with heavy chains appending ":0" to the end of each annotation field name and light chains appending ":1". The row also contains a "name" field so that the name of he paired sequence can be distinct from the names of the individual chains.

Note

read_parquet() and read_csv() will automatically detect whether the input file contains Sequence or Pair objects based on the file schema.

# read CSV file of annotated paired sequences
pairs = abutils.io.read_csv("pairs.csv")

# read Parquet file of annotated paired sequences
pairs = abutils.io.read_parquet("pairs.parquet")

All of the functions for reading annotated sequence data include a match parameter that can be used to filter the sequences or pairs that are read from the file. This is useful when only a fraction of the sequences or pairs in the file are desired:

# read an AIRR file of sequences and return only those that use IGHV1-2
sequences = abutils.io.read_airr(
    "sequences.tsv",
    match={"v_gene": "IGHV1-2"},
)

# read Parquet file of paired sequences and return only those
# that have a productive heavy chain and light chain
pairs = abutils.io.read_parquet(
    "pairs.parquet",
    match={
        "productive:0": True,
        "productive:1": True,
    },
)

api¶

abutils.io.read_fastx(fastx: str) → Iterable[Sequence]¶

Reads FASTA or FASTQ-formatted sequence data and returns Sequence objects. Gzipped files are supported.

Parameters:: fastx (str) – Path to a FASTA or FASTQ-formatted file, optionally gzip-compressed. Required.
Returns:: sequences
Return type:: list of Sequences

abutils.io.parse_fastx(fastx: str) → Sequence¶

Parses FASTA or FASTQ-formatted sequence data and returns Sequence objects. This differs from read_fastx in that it yields sequences one at a time rather than reading all into memory and returning a list. This method is safe for extremely large files that are potentially too large to fit into memory.

Parameters:: fastx (str) – Path to a FASTA or FASTQ-formatted file, optionally gzip-compressed. Required.
Yields:: sequences (Sequence)

abutils.io.read_fasta(fasta: str) → Iterable[Sequence]¶

Reads FASTA-formatted sequence data and returns Sequence objects. Gzipped files are supported.

Parameters:: fasta (str) – Either a FASTA-formatted string or the path to a FASTA file. FASTA files can be gzip-compressed. Required.
Returns:: sequences
Return type:: list of Sequences

abutils.io.parse_fasta(fasta: str) → Sequence¶

Parses FASTA-formatted sequence data and returns Sequence objects. This differs from read_fasta in that it yields sequences one at a time rather than reading all into memory and returning a list. This method is safe for extremely large files that are potentially too large to fit into memory.

Parameters:: fasta (str) – Path to a FASTA-formatted file, optionally gzip-compressed. Required.
Yields:: sequences (Sequence)

abutils.io.read_fastq(fastq: str) → Iterable[Sequence]¶

Reads FASTQ-formatted sequence data and returns Sequence objects. Gzipped files are supported.

Parameters:: fastq (str) – Either a FASTQ-formatted string or the path to a FASTQ file. FASTQ files can be gzip-compressed. Required.
Returns:: sequences
Return type:: list of Sequences

abutils.io.parse_fastq(fastq: str) → Sequence¶

Parses FASTQ-formatted sequence data and returns Sequence objects. This differs from read_fasta in that it yields sequences one at a time rather than reading all into memory and returning a list. This method is safe for extremely large files that are potentially too large to fit into memory.

Parameters:: fastq (str) – Path to a FASTQ-formatted file, optionally gzip-compressed. Required.
Yields:: sequences (Sequence)

abutils.io.read_airr(airr_file: str, match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence', as_dataframe: bool = False) → Iterable[Sequence | Pair]¶

Reads a CSV file and returns a list of Sequence or Pair objects.

Parameters:

csv_file (str) – Path to the CSV file to be read.
separator (str, default=",") – Column separator. Default is ",".
match (dict, optional) –
A dict for filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the following dict will filter out all sequences for which the 'v_gene:0' field is not 'IGHV1-2':
```
{'v_gene:0': 'IGHV1-2'}
```
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is "sequence_id".
sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is "sequence".
as_dataframe (bool, default=False) –
If True, the function will return a polars.DataFrame object.

Note

If True, fields will be used to select columns from the input file, but match will be ignored.

Returns:

A list of Sequence or Pair objects or a polars.DataFrame object.

Return type:

Iterable[Union[Sequence, Pair]] or polars.DataFrame

abutils.io.read_parquet(parquet_file: str, match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence', as_dataframe: bool = False) → Iterable[Sequence | Pair]¶

Reads a Parquet file and returns a list of Sequence or Pair objects.

Parameters:

parquet_file (str) – Path to the Parquet file to be read.
match (dict, optional) –
A dict for filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the following dict will filter out all sequences for which the 'v_gene:0' field is not 'IGHV1-2':
```
{'v_gene:0': 'IGHV1-2'}
```
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is "sequence_id".
sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is "sequence".
as_dataframe (bool, default=False) –
If True, the function will return a polars.DataFrame object.

Note

If True, fields will be used to select columns from the input file, but match will be ignored.

Returns:

A list of Sequence or Pair objects or a polars.DataFrame object.

Return type:

Iterable[Union[Sequence, Pair]] or polars.DataFrame

abutils.io.read_csv(csv_file: str, separator: str = ',', match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence', as_dataframe: bool = False) → Iterable[Sequence | Pair]¶

Reads a CSV file and returns a list of Sequence or Pair objects.

Parameters:

csv_file (str) – Path to the CSV file to be read.
separator (str, default=",") – Column separator. Default is ",".
match (dict, optional) –
A dict for filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the following dict will filter out all sequences for which the 'v_gene:0' field is not 'IGHV1-2':
```
{'v_gene:0': 'IGHV1-2'}
```
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is "sequence_id".
sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is "sequence".
as_dataframe (bool, default=False) –
If True, the function will return a polars.DataFrame object.

Note

If True, fields will be used to select columns from the input file, but match will be ignored.

Returns:

A list of Sequence or Pair objects or a polars.DataFrame object.

Return type:

Iterable[Union[Sequence, Pair]] or polars.DataFrame