read sequence data¶
abutils provides functions for reading/parsing sequence data from a variety of commonly
used file formats. This includes raw sequence data in FASTA or FASTQ format as well as
annotated sequence data in AIRR-C, CSV, or Parquet formats.
format |
function |
notes |
|---|---|---|
FASTA/Q |
returns a list of |
|
yields single |
||
FASTA |
returns a list of |
|
yields single |
||
FASTQ |
returns a list of |
|
yields single |
||
AIRR |
only supports |
|
Parquet |
supports |
|
CSV |
supports |
fasta/q files¶
The primary differences between read and parse functions are:
readfunctions read an entire file into memory and return a list ofSequenceobjects.parsefunctions yieldSequenceobjects one at a time.
parse functions are generally more memory efficient for large files, but read
functions may be more convenient for smaller files or when sequences need to be processed
as a group rather than one at a time:
# read entire file into memory
sequences = abutils.io.read_fasta("sequences.fasta")
# parse file one record at a time
for sequence in abutils.io.parse_fastq("sequences.fastq"):
print(sequence)
read_fastx() and parse_fastx() are the most flexible and can read/parse either
FASTA or FASTQ files. This is particularly useful when building pipelines in which users
may want to process both file types or when the source file may not be known in advance:
# FASTA file
sequences = abutils.io.read_fastx("sequences.fasta")
# FASTQ file
for sequence in abutils.io.parse_fastx("sequences.fastq"):
print(sequence)
All of the FASTA/Q/X read and parse functions can handle gzip-compressed files automatically:
# FASTA file
sequences = abutils.io.read_fastx("sequences.fasta.gz")
# FASTQ file
for sequence in abutils.io.parse_fastx("sequences.fastq.gz"):
print(sequence)
annotated sequence files¶
read_airr() can read AIRR-C formatted sequence data from a tab-delimited file,
returing a list of Sequence objects:
sequences = abutils.io.read_airr("sequences.tsv")
read_parquet() and read_csv() can read Parquet and CSV formatted annotated sequence data,
and expect the annotations to be in AIRR-C format – the only difference is in the file format,
which can be either Parquet or CSV instead of the AIRR-C tab-delimited format:
# read CSV file of annotated sequences
sequences = abutils.io.read_csv("sequences.csv")
# read Parquet file of annotated paired sequences
pairs = abutils.io.read_parquet("pairs.parquet")
Both read_csv() and read_parquet() support reading annotations from paired sequences,
which is a custom extension of the AIRR-C format. Each row in the CSV or Parquet file
contains annotations for both heavy and light chains. All annotation
fields in the AIRR-C format are conserved for each chain, with heavy chains appending ":0"
to the end of each annotation field name and light chains appending ":1". The row also contains
a "name" field so that the name of he paired sequence can be distinct from the names of the
individual chains.
Note
read_parquet() and read_csv() will automatically detect whether the input file
contains Sequence or Pair objects based on the file schema.
# read CSV file of annotated paired sequences
pairs = abutils.io.read_csv("pairs.csv")
# read Parquet file of annotated paired sequences
pairs = abutils.io.read_parquet("pairs.parquet")
All of the functions for reading annotated sequence data include a match parameter that
can be used to filter the sequences or pairs that are read from the file. This is useful
when only a fraction of the sequences or pairs in the file are desired:
# read an AIRR file of sequences and return only those that use IGHV1-2
sequences = abutils.io.read_airr(
"sequences.tsv",
match={"v_gene": "IGHV1-2"},
)
# read Parquet file of paired sequences and return only those
# that have a productive heavy chain and light chain
pairs = abutils.io.read_parquet(
"pairs.parquet",
match={
"productive:0": True,
"productive:1": True,
},
)
api¶
- abutils.io.read_fastx(fastx: str) Iterable[Sequence]¶
Reads FASTA or FASTQ-formatted sequence data and returns
Sequenceobjects. Gzipped files are supported.- Parameters:
fastx (str) – Path to a FASTA or FASTQ-formatted file, optionally gzip-compressed. Required.
- Returns:
sequences
- Return type:
list of
Sequences
- abutils.io.parse_fastx(fastx: str) Sequence¶
Parses FASTA or FASTQ-formatted sequence data and returns
Sequenceobjects. This differs fromread_fastxin that it yields sequences one at a time rather than reading all into memory and returning a list. This method is safe for extremely large files that are potentially too large to fit into memory.- Parameters:
fastx (str) – Path to a FASTA or FASTQ-formatted file, optionally gzip-compressed. Required.
- Yields:
sequences (
Sequence)
- abutils.io.read_fasta(fasta: str) Iterable[Sequence]¶
Reads FASTA-formatted sequence data and returns
Sequenceobjects. Gzipped files are supported.- Parameters:
fasta (str) – Either a FASTA-formatted string or the path to a FASTA file. FASTA files can be gzip-compressed. Required.
- Returns:
sequences
- Return type:
list of
Sequences
- abutils.io.parse_fasta(fasta: str) Sequence¶
Parses FASTA-formatted sequence data and returns
Sequenceobjects. This differs fromread_fastain that it yields sequences one at a time rather than reading all into memory and returning a list. This method is safe for extremely large files that are potentially too large to fit into memory.- Parameters:
fasta (str) – Path to a FASTA-formatted file, optionally gzip-compressed. Required.
- Yields:
sequences (
Sequence)
- abutils.io.read_fastq(fastq: str) Iterable[Sequence]¶
Reads FASTQ-formatted sequence data and returns
Sequenceobjects. Gzipped files are supported.- Parameters:
fastq (str) – Either a FASTQ-formatted string or the path to a FASTQ file. FASTQ files can be gzip-compressed. Required.
- Returns:
sequences
- Return type:
list of
Sequences
- abutils.io.parse_fastq(fastq: str) Sequence¶
Parses FASTQ-formatted sequence data and returns
Sequenceobjects. This differs fromread_fastain that it yields sequences one at a time rather than reading all into memory and returning a list. This method is safe for extremely large files that are potentially too large to fit into memory.- Parameters:
fastq (str) – Path to a FASTQ-formatted file, optionally gzip-compressed. Required.
- Yields:
sequences (
Sequence)
- abutils.io.read_airr(airr_file: str, match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence', as_dataframe: bool = False) Iterable[Sequence | Pair]¶
Reads a CSV file and returns a list of
SequenceorPairobjects.- Parameters:
csv_file (str) – Path to the CSV file to be read.
separator (str, default=",") – Column separator. Default is
",".match (dict, optional) –
A
dictfor filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the followingdictwill filter out all sequences for which the'v_gene:0'field is not'IGHV1-2':{'v_gene:0': 'IGHV1-2'}
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is
"sequence_id".sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is
"sequence".as_dataframe (bool, default=False) –
If
True, the function will return apolars.DataFrameobject.Note
If
True,fieldswill be used to select columns from the input file, butmatchwill be ignored.
- Returns:
A list of
SequenceorPairobjects or apolars.DataFrameobject.- Return type:
- abutils.io.read_parquet(parquet_file: str, match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence', as_dataframe: bool = False) Iterable[Sequence | Pair]¶
Reads a Parquet file and returns a list of
SequenceorPairobjects.- Parameters:
parquet_file (str) – Path to the Parquet file to be read.
match (dict, optional) –
A
dictfor filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the followingdictwill filter out all sequences for which the'v_gene:0'field is not'IGHV1-2':{'v_gene:0': 'IGHV1-2'}
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is
"sequence_id".sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is
"sequence".as_dataframe (bool, default=False) –
If
True, the function will return apolars.DataFrameobject.Note
If
True,fieldswill be used to select columns from the input file, butmatchwill be ignored.
- Returns:
A list of
SequenceorPairobjects or apolars.DataFrameobject.- Return type:
- abutils.io.read_csv(csv_file: str, separator: str = ',', match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence', as_dataframe: bool = False) Iterable[Sequence | Pair]¶
Reads a CSV file and returns a list of
SequenceorPairobjects.- Parameters:
csv_file (str) – Path to the CSV file to be read.
separator (str, default=",") – Column separator. Default is
",".match (dict, optional) –
A
dictfor filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the followingdictwill filter out all sequences for which the'v_gene:0'field is not'IGHV1-2':{'v_gene:0': 'IGHV1-2'}
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is
"sequence_id".sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is
"sequence".as_dataframe (bool, default=False) –
If
True, the function will return apolars.DataFrameobject.Note
If
True,fieldswill be used to select columns from the input file, butmatchwill be ignored.
- Returns:
A list of
SequenceorPairobjects or apolars.DataFrameobject.- Return type: