convert sequence data¶

abutils provides functions for converting annotated sequence data between Sequence and Pair objects and Polars and Pandas DataFrames. All annotations are assumed to be in AIRR-C format.

format	function	notes
Pandas	abutils.io.to_pandas()	returns a Pandas DataFrame
Pandas	abutils.io.from_pandas()	returns a list of `Sequence` or `Pair` objects
Polars	abutils.io.to_polars()	returns a Polars DataFrame
Polars	abutils.io.from_polars()	returns a list of `Sequence` or `Pair` objects

Pandas¶

abutils can convert lists of Sequence or Pair objects to and from Pandas DataFrames:

Warning

While abutils can convert between Sequence and Pair objects and Pandas DataFrames, the input must contains only one type of object. For example, you cannot mix Sequence and Pair objects in the same list.

# convert list of sequences to Pandas DataFrame
sequences_df = abutils.io.to_pandas(
    sequences,
)

# convert Pandas DataFrame back to list of sequences
sequences = abutils.io.from_pandas(sequences_df)

# convert list of pairs to Pandas DataFrame
pairs_df = abutils.io.to_pandas(
    pairs,
)

# convert Pandas DataFrame back to list of pairs
pairs = abutils.io.from_pandas(pairs_df)

Polars¶

abutils can convert lists of Sequence or Pair objects to and from Pandas DataFrames:

# convert list of sequences to Polars DataFrame
sequences_df = abutils.io.to_polars(
    sequences,
)

# convert Polars DataFrame back to list of sequences
sequences = abutils.io.from_polars(sequences_df)

# convert list of pairs to Polars DataFrame
pairs_df = abutils.io.to_polars(
    pairs,
)

# convert Polars DataFrame back to list of pairs
pairs = abutils.io.from_polars(pairs_df)

api¶

Saves a list of Pair objects to a Pandas DataFrame.

Parameters:

sequences (Iterable[Sequence, Pair]) – List of Sequence or Pair objects to be saved to a Pandas DataFrame. Required.
annotations (list, default=None) – A list of annotation fields to be included in the Pandas DataFrame. For Sequence objects, this refers to fields in the annotations field. For Pair objects, this refers to fields in the heavy and light chain annotations.
columns (list, default=None) – Used only for Pair objects. A list of fields to be retained in the output Pandas DataFrame. Fields should be column names in the input file, such as "sequence:0", "sequence:1", "name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to use annotations instead.
properties (list, default=None) – A list of properties to be included in the Pandas DataFrame. If not provided, everything in the annotations field of each heavy/light chain will be included.
sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from properties, which refers to properties of the Pair object. These properties are those of the heavy/light Sequence objects. Ignored if the input is a list of Sequence objects.
drop_na_columns (bool, default=True) – If True, columns with all NaN values will be dropped from the Pandas DataFrame. Default is True.
order (list, default=None) – A list of fields in the order they should appear in the Pandas DataFrame.
exclude (str or list, default=None) – Field or list of fields to be excluded from the Pandas DataFrame.
leading (str or list, default=None) – Field or list of fields to appear first in the Pandas DataFrame. Supercedes order, so if both are provided, fields in leading will appear first in the Pandas DataFrame and remaining fields will appear in the order provided in order.

Returns:

A polars.DataFrame object.

Return type:

pl.DataFrame

abutils.io.from_pandas(df: <MagicMock name='mock.DataFrame' id='135598408818944'>, match: dict | None = None, fields: ~typing.Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence') → Iterable[Sequence | Pair]¶

Reads a Pandas DataFrame and returns a list of Sequence or Pair objects.

Parameters:

df (pd.DataFrame) – The input Pandas DataFrame.
match (dict, optional) –
A dict for filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the following dict will filter out all sequences for which the 'v_gene:0' field is not 'IGHV1-2':
```
{'v_gene:0': 'IGHV1-2'}
```
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is "sequence_id".
sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is "sequence".

Returns:

A list of Sequence or Pair objects.

Return type:

Iterable[Union[Sequence, Pair]]

Saves a list of Pair objects to a Polars DataFrame.

Parameters:

sequences (Iterable[Sequence, Pair]) – List of Sequence or Pair objects to be saved to a Polars DataFrame. Required.
annotations (list, default=None) – A list of annotation fields to be included in the Polars DataFrame. For Sequence objects, this refers to fields in the annotations field. For Pair objects, this refers to fields in the heavy and light chain annotations.
columns (list, default=None) – Used only for Pair objects. A list of fields to be retained in the output Polars DataFrame. Fields should be column names in the input file, such as "sequence:0", "sequence:1", "name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to use annotations instead.
properties (list, default=None) – A list of properties to be included in the Polars DataFrame. If not provided, everything in the annotations field of each heavy/light chain will be included.
sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from properties, which refers to properties of the Pair object. These properties are those of the heavy/light Sequence objects. Ignored if the input is a list of Sequence objects.
drop_na_columns (bool, default=True) – If True, columns with all NaN values will be dropped from the Polars DataFrame. Default is True.
order (list, default=None) – A list of fields in the order they should appear in the Polars DataFrame.
exclude (str or list, default=None) – Field or list of fields to be excluded from the Polars DataFrame.
leading (str or list, default=None) – Field or list of fields to appear first in the Polars DataFrame. Supercedes order, so if both are provided, fields in leading will appear first in the Polars DataFrame and remaining fields will appear in the order provided in order.

Returns:

A polars.DataFrame object.

Return type:

pl.DataFrame

abutils.io.from_polars(df: DataFrame | LazyFrame, match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence') → Iterable[Sequence | Pair]¶

Reads a Polars DataFrame and returns a list of Sequence or Pair objects.

Parameters:

df (polars.DataFrame or polars.LazyFrame) – The input Polars DataFrame or LazyFrame.
match (dict, optional) –
A dict for filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the following dict will filter out all sequences for which the 'v_gene:0' field is not 'IGHV1-2':
```
{'v_gene:0': 'IGHV1-2'}
```
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is "sequence_id".
sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is "sequence".

Returns:

A list of Sequence or Pair objects.

Return type:

Iterable[Union[Sequence, Pair]]