convert sequence data¶
abutils provides functions for converting annotated sequence data between Sequence and
Pair objects and Polars and Pandas DataFrames. All annotations are assumed to be in
AIRR-C format.
format |
function |
notes |
|---|---|---|
Pandas |
returns a Pandas DataFrame |
|
returns a list of |
||
Polars |
returns a Polars DataFrame |
|
returns a list of |
Pandas¶
abutils can convert lists of Sequence or Pair objects to and from Pandas DataFrames:
Warning
While abutils can convert between Sequence and Pair objects and Pandas DataFrames, the input must contains
only one type of object. For example, you cannot mix Sequence and Pair objects in the
same list.
# convert list of sequences to Pandas DataFrame
sequences_df = abutils.io.to_pandas(
sequences,
)
# convert Pandas DataFrame back to list of sequences
sequences = abutils.io.from_pandas(sequences_df)
# convert list of pairs to Pandas DataFrame
pairs_df = abutils.io.to_pandas(
pairs,
)
# convert Pandas DataFrame back to list of pairs
pairs = abutils.io.from_pandas(pairs_df)
Polars¶
abutils can convert lists of Sequence or Pair objects to and from Pandas DataFrames:
# convert list of sequences to Polars DataFrame
sequences_df = abutils.io.to_polars(
sequences,
)
# convert Polars DataFrame back to list of sequences
sequences = abutils.io.from_polars(sequences_df)
# convert list of pairs to Polars DataFrame
pairs_df = abutils.io.to_polars(
pairs,
)
# convert Polars DataFrame back to list of pairs
pairs = abutils.io.from_polars(pairs_df)
api¶
- abutils.io.to_pandas(sequences: Iterable[Sequence | Pair], annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) DataFrame | None¶
Saves a list of
Pairobjects to a Pandas DataFrame.- Parameters:
sequences (Iterable[Sequence, Pair]) – List of
SequenceorPairobjects to be saved to a Pandas DataFrame. Required.annotations (list, default=None) – A list of annotation fields to be included in the Pandas DataFrame. For
Sequenceobjects, this refers to fields in theannotationsfield. ForPairobjects, this refers to fields in the heavy and light chain annotations.columns (list, default=None) – Used only for
Pairobjects. A list of fields to be retained in the output Pandas DataFrame. Fields should be column names in the input file, such as"sequence:0","sequence:1","name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to useannotationsinstead.properties (list, default=None) – A list of properties to be included in the Pandas DataFrame. If not provided, everything in the
annotationsfield of each heavy/light chain will be included.sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from
properties, which refers to properties of thePairobject. These properties are those of the heavy/lightSequenceobjects. Ignored if the input is a list ofSequenceobjects.drop_na_columns (bool, default=True) – If
True, columns with allNaNvalues will be dropped from the Pandas DataFrame. Default isTrue.order (list, default=None) – A list of fields in the order they should appear in the Pandas DataFrame.
exclude (str or list, default=None) – Field or list of fields to be excluded from the Pandas DataFrame.
leading (str or list, default=None) – Field or list of fields to appear first in the Pandas DataFrame. Supercedes
order, so if both are provided, fields inleadingwill appear first in the Pandas DataFrame and remaining fields will appear in the order provided inorder.
- Returns:
A
polars.DataFrameobject.- Return type:
pl.DataFrame
- abutils.io.from_pandas(df: <MagicMock name='mock.DataFrame' id='132958466703616'>, match: dict | None = None, fields: ~typing.Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence') Iterable[Sequence | Pair]¶
Reads a Pandas DataFrame and returns a list of
SequenceorPairobjects.- Parameters:
df (pd.DataFrame) – The input Pandas DataFrame.
match (dict, optional) –
A
dictfor filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the followingdictwill filter out all sequences for which the'v_gene:0'field is not'IGHV1-2':{'v_gene:0': 'IGHV1-2'}
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is
"sequence_id".sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is
"sequence".
- Returns:
A list of
SequenceorPairobjects.- Return type:
- abutils.io.to_polars(sequences: Iterable[Sequence | Pair], annotations: Iterable[str] | None = None, columns: Iterable | None = None, properties: Iterable[str] | None = None, sequence_properties: Iterable[str] | None = None, drop_na_columns: bool = True, order: Iterable[str] | None = None, exclude: str | Iterable[str] | None = None, leading: str | Iterable[str] | None = None) DataFrame | None¶
Saves a list of
Pairobjects to a Polars DataFrame.- Parameters:
sequences (Iterable[Sequence, Pair]) – List of
SequenceorPairobjects to be saved to a Polars DataFrame. Required.annotations (list, default=None) – A list of annotation fields to be included in the Polars DataFrame. For
Sequenceobjects, this refers to fields in theannotationsfield. ForPairobjects, this refers to fields in the heavy and light chain annotations.columns (list, default=None) – Used only for
Pairobjects. A list of fields to be retained in the output Polars DataFrame. Fields should be column names in the input file, such as"sequence:0","sequence:1","name", etc. This option is provided to allow differential selection of heavy and light chain fields. For cases in which the same fields will be selected for both chains, it is recommended to useannotationsinstead.properties (list, default=None) – A list of properties to be included in the Polars DataFrame. If not provided, everything in the
annotationsfield of each heavy/light chain will be included.sequence_properties (list, default=None) – A list of sequence properties to be included. Differs from
properties, which refers to properties of thePairobject. These properties are those of the heavy/lightSequenceobjects. Ignored if the input is a list ofSequenceobjects.drop_na_columns (bool, default=True) – If
True, columns with allNaNvalues will be dropped from the Polars DataFrame. Default isTrue.order (list, default=None) – A list of fields in the order they should appear in the Polars DataFrame.
exclude (str or list, default=None) – Field or list of fields to be excluded from the Polars DataFrame.
leading (str or list, default=None) – Field or list of fields to appear first in the Polars DataFrame. Supercedes
order, so if both are provided, fields inleadingwill appear first in the Polars DataFrame and remaining fields will appear in the order provided inorder.
- Returns:
A
polars.DataFrameobject.- Return type:
pl.DataFrame
- abutils.io.from_polars(df: DataFrame | LazyFrame, match: dict | None = None, fields: Iterable | None = None, id_key: str = 'sequence_id', sequence_key: str = 'sequence') Iterable[Sequence | Pair]¶
Reads a Polars DataFrame and returns a list of
SequenceorPairobjects.- Parameters:
df (polars.DataFrame or polars.LazyFrame) – The input Polars DataFrame or LazyFrame.
match (dict, optional) –
A
dictfor filtering sequences from the input file. Sequences must match all conditions to be returned. For example, the followingdictwill filter out all sequences for which the'v_gene:0'field is not'IGHV1-2':{'v_gene:0': 'IGHV1-2'}
fields (list, optional) – A list of fields to be read from the input file. If not provided, all fields will be read.
id_key (str, default="sequence_id") – The name of the column that contains the sequence IDs. Default is
"sequence_id".sequence_key (str, default="sequence") – The name of the column that contains the sequence data. Default is
"sequence".
- Returns:
A list of
SequenceorPairobjects.- Return type: