sequence I/O¶
abutils provides a set of functions for reading and writing
sequence data to and from various file formats. Additionally, we can convert lists of
Pair and Sequence objects to and from Pandas or Polars DataFrames.
sequence annotations¶
abutils follows the AIRR-C standard for sequence annotations. In tabular format, such as
tab-delimited (the official AIRR format), CSV, or Parquet, sequence annotations appear as follows,
with one sequence per row:
sequence_id |
sequence |
sequence_aa |
… |
|---|---|---|---|
sequence1 |
ATCG… |
EVQLVE… |
… |
sequence2 |
ATCG… |
QVQLVE… |
… |
sequence3 |
ATCG… |
EVQLVE… |
… |
pair annotations¶
abutils uses a custom extentension of the AIRR-C standard for pair annotations. Each row contains
a heavy/light chain pair. All AIRR-C fields are supported for each chain, with heavy chain annotation fields
appended with ":0" and light chain annotation fields appended with ":1". Additionally, a name field
is included to allow for naming the pair independently of either sequence chain:
name |
sequence_id:0 |
sequence:0 |
… |
sequence_id:1 |
sequence:1 |
… |
|---|---|---|---|---|---|---|
pair1 |
sequence1_heavy |
ATGC… |
… |
sequence1_light |
ATGC… |
… |
pair2 |
sequence2_heavy |
ATGC… |
… |
sequence2_light |
ATGC… |
… |
pair3 |
sequence3_heavy |
ATGC… |
… |
sequence3_light |
ATGC… |
… |
read¶
format |
function |
notes |
|---|---|---|
FASTA/Q |
returns a list of |
|
yields single |
||
FASTA |
returns a list of |
|
yields single |
||
FASTQ |
returns a list of |
|
yields single |
||
AIRR |
only supports |
|
Parquet |
supports |
|
CSV |
supports |
write¶
format |
function |
notes |
|---|---|---|
FASTA |
supports |
|
FASTQ |
supports |
|
AIRR |
only supports |
|
Parquet |
supports |
|
CSV |
supports |
convert¶
format |
function |
notes |
|---|---|---|
Pandas |
to_pandas() |
supports |
from_pandas() |
supports |
|
Polars |
to_polars() |
supports |
from_polars() |
supports |