write sequence data

abutils provides functions for writing sequence data to a variety of commonly used file formats. This includes raw sequence data in FASTA or FASTQ format as well as annotated sequence data in AIRR-C, CSV, or Parquet formats.


format

function

notes

FASTA

abutils.io.to_fasta()

supports Sequence or Pair objects

FASTQ

abutils.io.to_fastq()

supports Sequence or Pair objects

AIRR

abutils.io.to_airr()

only supports Sequence objects

Parquet

abutils.io.to_parquet()

supports Sequence or Pair objects

CSV

abutils.io.to_csv()

supports Sequence or Pair objects


fasta/q files

abutils can write lists of Seuqence or Pair objects to FASTA or FASTQ files:

Warning

While abutils can write both Sequence and Pair objects to various formats, the input must contains only one type of object. For example, you cannot mix Sequence and Pair objects in the same list.

# write list of sequences to FASTA file
abutils.io.to_fasta(
    sequences,
    "my-output-file.fasta"
)

# write list of pairs to FASTQ file
abutils.io.to_fastq(
    pairs,
    "my-paired-output-file.fastq"
)

annotated sequence files

to_airr() can write Sequence objects to AIRR-C formatted (tab-delimited) files:

# write list of sequences to AIRR-C file
abutils.io.to_airr(
    sequences,
    "my-airr-output-file.tsv"
)

to_parquet() can write Sequence or Pair objects to Parquet files:

# write list of sequences to Parquet file
abutils.io.to_parquet(
    sequences,
    "my-parquet-output-file.parquet"
)

# write list of pairs to Parquet file
abutils.io.to_parquet(
    pairs,
    "my-paired-parquet-output-file.parquet"
)

to_csv() can write Sequence or Pair objects to CSV files:

# write list of sequences to CSV file
abutils.io.to_csv(
    sequences,
    "my-csv-output-file.csv"
)

# write list of pairs to CSV file
abutils.io.to_csv(
    pairs,
    "my-paired-csv-output-file.csv"
)

api

abutils.io.to_fasta(sequences: str | Iterable, fasta_file: str | None = None, id_key: str | None = None, sequence_key: str | None = None, tempfile_dir: str | None = None, append_chain: bool = True, as_string: bool = False) str

Writes sequences to a FASTA-formatted file or returns a FASTA-formatted string.

Parameters:
  • sequences (str or Iterable) –

    Accepts any of the following:
    1. list of abutils Sequence and/or Pair objects

    2. FASTA/Q-formatted string

    3. path to a FASTA/Q-formatted file

    4. list of BioPython SeqRecord objects

    5. list of lists/tuples, of the format [sequence_id, sequence]

    Required.

    Note

    Processing a list containing a mixture of Sequence and/or Pair objects is supported.

  • fasta_file (str, default=None) – Path to the output FASTA file. If neither fasta_file nor tempfile_dir are provided, a FASTA-formatted string will be returned.

  • id_key (str, default=None) – Name of the annotation field containing the sequence ID. If not provided, sequence.id is used.

  • sequence_key (str, default=None) – Name of the annotation field containg the sequence. If not provided, sequence.sequence is used.

  • tempfile_dir (str, optional) – If fasta_file is not provided, directory into which the tempfile should be created. If the directory does not exist, it will be created.

  • append_chain (bool, default=True) –

    If True, the chain (heavy or light) will be appended to the sequence name: >MySequence_heavy.

    Note

    This option is ignored unless a list containing Pair objects is provided.

  • as_string (bool, default=False) – Deprecated. Kept for backwards compatibility.

Returns:

fasta – Path to a FASTA file or a FASTA-formatted string

Return type:

str

abutils.io.to_fastq(sequences: str | Iterable[Sequence] | Iterable[SeqRecord] | Iterable[Iterable], fastq_file: str | None = None, as_string: bool = False, id_key: str | None = None, sequence_key: str | None = None, tempfile_dir: str = '/tmp') str

Writes sequences to a FASTQ-formatted file or returns a FASTQ-formatted string.

Parameters:
  • sequences (Iterable[Sequence]) –

    An iterable of any of the following:
    1. list of abutils Sequence objects

    2. FASTQ-formatted string

    3. path to a FASTQ-formatted file

    4. list of BioPython SeqRecord objects

    5. list of lists/tuples, of the format [sequence_id, sequence]

    Required.

  • fastq_file (str, default=None) – Path to the output FASTQ file. If not provided and as_string is False, a file will be created using tempfile.NamedTemporaryFile().

  • as_string (bool, default=False) – Return a FASTA-formatted string rather than writing to file.

  • id_key (str, default=None) – Name of the annotation field containing the sequence ID. If not provided, sequence.id is used.

  • sequence_key (str, default=None) – Name of the annotation field containg the sequence. If not provided, sequence.sequence is used.

  • tempfile_dir (str, default="/tmp") – If fasta_file is not provided, directory into which the tempfile should be created. If the directory does not exist, it will be created. Default is “/tmp”.

Returns:

fasta – Path to a FASTA file or a FASTA-formatted string

Return type:

str

abutils.io.to_airr(sequences: Iterable[Sequence | Pair], airr_file: str, annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) None

Saves a list of Pair objects to a CSV file.

Parameters:
  • sequences (Iterable[Sequence, Pair]) – List of Sequence or Pair objects to be saved to a CSV file. Required.

  • airr_file (str) – Path to the output AIRR file. Required.

  • annotations (list, default=None) – A list of annotation fields to be included in the CSV file. For Sequence objects, this refers to fields in the annotations field. For Pair objects, this refers to fields in the heavy and light chain annotations.

  • columns (list, default=None) – Used only for Pair objects. A list of fields to be retained in the output CSV file. Fields should be column names in the input file, such as "sequence:0", "sequence:1", "name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to use annotations instead.

  • properties (list, default=None) – A list of properties to be included in the CSV file. If not provided, everything in the annotations field of each heavy/light chain will be included.

  • sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from properties, which refers to properties of the Pair object. These properties are those of the heavy/light Sequence objects. Ignored if the input is a list of Sequence objects.

  • drop_na_columns (bool, default=True) – If True, columns with all NaN values will be dropped from the CSV file. Default is True.

  • order (list, default=None) – A list of fields in the order they should appear in the CSV file.

  • exclude (str or list, default=None) – Field or list of fields to be excluded from the CSV file.

  • leading (str or list, default=None) – Field or list of fields to appear first in the CSV file. Supercedes order, so if both are provided, fields in leading will appear first in the CSV file and remaining fields will appear in the order provided in order.

Return type:

None

abutils.io.to_parquet(sequences: Iterable[Sequence | Pair], parquet_file: str, annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) None

Saves a list of Pair objects to a Parquet file.

Parameters:
  • sequences (Iterable[Sequence, Pair]) – List of Sequence or Pair objects to be saved to a Parquet file. Required.

  • parquet_file (str) – Path to the output Parquet file. Required.

  • annotations (list, default=None) – A list of annotation fields to be included in the Parquet file. For Sequence objects, this refers to fields in the annotations field. For Pair objects, this refers to fields in the heavy and light chain annotations.

  • columns (list, default=None) – Used only for Pair objects. A list of fields to be retained in the output Polars DataFrame. Fields should be column names in the input file, such as "sequence:0", "sequence:1", "name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to use annotations instead.

  • properties (list, default=None) – A list of properties to be included in the Parquet file. If not provided, everything in the annotations field of each heavy/light chain will be included.

  • sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from properties, which refers to properties of the Pair object. These properties are those of the heavy/light Sequence objects. Ignored if the input is a list of Sequence objects.

  • drop_na_columns (bool, default=True) – If True, columns with all NaN values will be dropped from the Parquet file. Default is True.

  • order (list, default=None) – A list of fields in the order they should appear in the Parquet file.

  • exclude (str or list, default=None) – Field or list of fields to be excluded from the Parquet file.

  • leading (str or list, default=None) – Field or list of fields to appear first in the Parquet file. Supercedes order, so if both are provided, fields in leading will appear first in the Parquet file and remaining fields will appear in the order provided in order.

Return type:

None

abutils.io.to_csv(sequences: Iterable[Sequence | Pair], csv_file: str, separator: str = ',', header: bool = True, annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) None

Saves a list of Pair objects to a CSV file.

Parameters:
  • sequences (Iterable[Sequence, Pair]) – List of Sequence or Pair objects to be saved to a CSV file. Required.

  • csv_file (str) – Path to the output CSV file. Required.

  • separator (str, default=",") – Column separator. Default is ",".

  • header (bool, default=True) – If True, the CSV file will contain a header row. Default is True.

  • annotations (list, default=None) – A list of annotation fields to be included in the CSV file. For Sequence objects, this refers to fields in the annotations field. For Pair objects, this refers to fields in the heavy and light chain annotations.

  • columns (list, default=None) – Used only for Pair objects. A list of fields to be retained in the output CSV file. Fields should be column names in the input file, such as "sequence:0", "sequence:1", "name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to use annotations instead.

  • properties (list, default=None) – A list of properties to be included in the CSV file. If not provided, everything in the annotations field of each heavy/light chain will be included.

  • sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from properties, which refers to properties of the Pair object. These properties are those of the heavy/light Sequence objects. Ignored if the input is a list of Sequence objects.

  • drop_na_columns (bool, default=True) – If True, columns with all NaN values will be dropped from the CSV file. Default is True.

  • order (list, default=None) – A list of fields in the order they should appear in the CSV file.

  • exclude (str or list, default=None) – Field or list of fields to be excluded from the CSV file.

  • leading (str or list, default=None) – Field or list of fields to appear first in the CSV file. Supercedes order, so if both are provided, fields in leading will appear first in the CSV file and remaining fields will appear in the order provided in order.

Return type:

None