write sequence data¶
abutils provides functions for writing sequence data to a variety of commonly
used file formats. This includes raw sequence data in FASTA or FASTQ format as well as
annotated sequence data in AIRR-C, CSV, or Parquet formats.
format |
function |
notes |
|---|---|---|
FASTA |
supports |
|
FASTQ |
supports |
|
AIRR |
only supports |
|
Parquet |
supports |
|
CSV |
supports |
fasta/q files¶
abutils can write lists of Seuqence or Pair objects to FASTA or FASTQ files:
Warning
While abutils can write both Sequence and Pair objects to various formats, the input must contains
only one type of object. For example, you cannot mix Sequence and Pair objects in the
same list.
# write list of sequences to FASTA file
abutils.io.to_fasta(
sequences,
"my-output-file.fasta"
)
# write list of pairs to FASTQ file
abutils.io.to_fastq(
pairs,
"my-paired-output-file.fastq"
)
annotated sequence files¶
to_airr() can write Sequence objects to AIRR-C formatted (tab-delimited) files:
# write list of sequences to AIRR-C file
abutils.io.to_airr(
sequences,
"my-airr-output-file.tsv"
)
to_parquet() can write Sequence or Pair objects to Parquet files:
# write list of sequences to Parquet file
abutils.io.to_parquet(
sequences,
"my-parquet-output-file.parquet"
)
# write list of pairs to Parquet file
abutils.io.to_parquet(
pairs,
"my-paired-parquet-output-file.parquet"
)
to_csv() can write Sequence or Pair objects to CSV files:
# write list of sequences to CSV file
abutils.io.to_csv(
sequences,
"my-csv-output-file.csv"
)
# write list of pairs to CSV file
abutils.io.to_csv(
pairs,
"my-paired-csv-output-file.csv"
)
api¶
- abutils.io.to_fasta(sequences: str | Iterable, fasta_file: str | None = None, id_key: str | None = None, sequence_key: str | None = None, tempfile_dir: str | None = None, append_chain: bool = True, as_string: bool = False) str¶
Writes sequences to a FASTA-formatted file or returns a FASTA-formatted string.
- Parameters:
sequences (str or Iterable) –
- Accepts any of the following:
list of abutils
Sequenceand/orPairobjectsFASTA/Q-formatted string
path to a FASTA/Q-formatted file
list of BioPython
SeqRecordobjectslist of lists/tuples, of the format
[sequence_id, sequence]
Required.
Note
Processing a list containing a mixture of
Sequenceand/orPairobjects is supported.fasta_file (str, default=None) – Path to the output FASTA file. If neither fasta_file nor tempfile_dir are provided, a FASTA-formatted string will be returned.
id_key (str, default=None) – Name of the annotation field containing the sequence ID. If not provided,
sequence.idis used.sequence_key (str, default=None) – Name of the annotation field containg the sequence. If not provided,
sequence.sequenceis used.tempfile_dir (str, optional) – If fasta_file is not provided, directory into which the tempfile should be created. If the directory does not exist, it will be created.
append_chain (bool, default=True) –
If
True, the chain (heavy or light) will be appended to the sequence name:>MySequence_heavy.Note
This option is ignored unless a list containing
Pairobjects is provided.as_string (bool, default=False) – Deprecated. Kept for backwards compatibility.
- Returns:
fasta – Path to a FASTA file or a FASTA-formatted string
- Return type:
str
- abutils.io.to_fastq(sequences: str | Iterable[Sequence] | Iterable[SeqRecord] | Iterable[Iterable], fastq_file: str | None = None, as_string: bool = False, id_key: str | None = None, sequence_key: str | None = None, tempfile_dir: str = '/tmp') str¶
Writes sequences to a FASTQ-formatted file or returns a FASTQ-formatted string.
- Parameters:
sequences (Iterable[Sequence]) –
- An iterable of any of the following:
list of abutils
SequenceobjectsFASTQ-formatted string
path to a FASTQ-formatted file
list of BioPython
SeqRecordobjectslist of lists/tuples, of the format
[sequence_id, sequence]
Required.
fastq_file (str, default=None) – Path to the output FASTQ file. If not provided and as_string is
False, a file will be created usingtempfile.NamedTemporaryFile().as_string (bool, default=False) – Return a FASTA-formatted string rather than writing to file.
id_key (str, default=None) – Name of the annotation field containing the sequence ID. If not provided,
sequence.idis used.sequence_key (str, default=None) – Name of the annotation field containg the sequence. If not provided,
sequence.sequenceis used.tempfile_dir (str, default="/tmp") – If fasta_file is not provided, directory into which the tempfile should be created. If the directory does not exist, it will be created. Default is “/tmp”.
- Returns:
fasta – Path to a FASTA file or a FASTA-formatted string
- Return type:
str
- abutils.io.to_airr(sequences: Iterable[Sequence | Pair], airr_file: str, annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) None¶
Saves a list of
Pairobjects to a CSV file.- Parameters:
sequences (Iterable[Sequence, Pair]) – List of
SequenceorPairobjects to be saved to a CSV file. Required.airr_file (str) – Path to the output AIRR file. Required.
annotations (list, default=None) – A list of annotation fields to be included in the CSV file. For
Sequenceobjects, this refers to fields in theannotationsfield. ForPairobjects, this refers to fields in the heavy and light chain annotations.columns (list, default=None) – Used only for
Pairobjects. A list of fields to be retained in the output CSV file. Fields should be column names in the input file, such as"sequence:0","sequence:1","name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to useannotationsinstead.properties (list, default=None) – A list of properties to be included in the CSV file. If not provided, everything in the
annotationsfield of each heavy/light chain will be included.sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from
properties, which refers to properties of thePairobject. These properties are those of the heavy/lightSequenceobjects. Ignored if the input is a list ofSequenceobjects.drop_na_columns (bool, default=True) – If
True, columns with allNaNvalues will be dropped from the CSV file. Default isTrue.order (list, default=None) – A list of fields in the order they should appear in the CSV file.
exclude (str or list, default=None) – Field or list of fields to be excluded from the CSV file.
leading (str or list, default=None) – Field or list of fields to appear first in the CSV file. Supercedes
order, so if both are provided, fields inleadingwill appear first in the CSV file and remaining fields will appear in the order provided inorder.
- Return type:
None
- abutils.io.to_parquet(sequences: Iterable[Sequence | Pair], parquet_file: str, annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) None¶
Saves a list of
Pairobjects to a Parquet file.- Parameters:
sequences (Iterable[Sequence, Pair]) – List of
SequenceorPairobjects to be saved to a Parquet file. Required.parquet_file (str) – Path to the output Parquet file. Required.
annotations (list, default=None) – A list of annotation fields to be included in the Parquet file. For
Sequenceobjects, this refers to fields in theannotationsfield. ForPairobjects, this refers to fields in the heavy and light chain annotations.columns (list, default=None) – Used only for
Pairobjects. A list of fields to be retained in the output Polars DataFrame. Fields should be column names in the input file, such as"sequence:0","sequence:1","name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to useannotationsinstead.properties (list, default=None) – A list of properties to be included in the Parquet file. If not provided, everything in the
annotationsfield of each heavy/light chain will be included.sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from
properties, which refers to properties of thePairobject. These properties are those of the heavy/lightSequenceobjects. Ignored if the input is a list ofSequenceobjects.drop_na_columns (bool, default=True) – If
True, columns with allNaNvalues will be dropped from the Parquet file. Default isTrue.order (list, default=None) – A list of fields in the order they should appear in the Parquet file.
exclude (str or list, default=None) – Field or list of fields to be excluded from the Parquet file.
leading (str or list, default=None) – Field or list of fields to appear first in the Parquet file. Supercedes
order, so if both are provided, fields inleadingwill appear first in the Parquet file and remaining fields will appear in the order provided inorder.
- Return type:
None
- abutils.io.to_csv(sequences: Iterable[Sequence | Pair], csv_file: str, separator: str = ',', header: bool = True, annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) None¶
Saves a list of
Pairobjects to a CSV file.- Parameters:
sequences (Iterable[Sequence, Pair]) – List of
SequenceorPairobjects to be saved to a CSV file. Required.csv_file (str) – Path to the output CSV file. Required.
separator (str, default=",") – Column separator. Default is
",".header (bool, default=True) – If
True, the CSV file will contain a header row. Default isTrue.annotations (list, default=None) – A list of annotation fields to be included in the CSV file. For
Sequenceobjects, this refers to fields in theannotationsfield. ForPairobjects, this refers to fields in the heavy and light chain annotations.columns (list, default=None) – Used only for
Pairobjects. A list of fields to be retained in the output CSV file. Fields should be column names in the input file, such as"sequence:0","sequence:1","name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to useannotationsinstead.properties (list, default=None) – A list of properties to be included in the CSV file. If not provided, everything in the
annotationsfield of each heavy/light chain will be included.sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from
properties, which refers to properties of thePairobject. These properties are those of the heavy/lightSequenceobjects. Ignored if the input is a list ofSequenceobjects.drop_na_columns (bool, default=True) – If
True, columns with allNaNvalues will be dropped from the CSV file. Default isTrue.order (list, default=None) – A list of fields in the order they should appear in the CSV file.
exclude (str or list, default=None) – Field or list of fields to be excluded from the CSV file.
leading (str or list, default=None) – Field or list of fields to appear first in the CSV file. Supercedes
order, so if both are provided, fields inleadingwill appear first in the CSV file and remaining fields will appear in the order provided inorder.
- Return type:
None