sequence

Sequence objects are the fundamental building block for much of the abutils package. Virtually all functions and methods that operate on a one or more sequences will accept Sequence objects as input. Sequence can be created from a variety of inputs, including strings, lists, dictionaries, and BioPython SeqRecord objects. Below are some brief examples of how to create and use Sequence objects.


instantiation

abutils has a number of convenience functions for batch creation of Sequence objects common file formats, including FASTA, FASTQ, AIRR-C, and Parquet. Details and examples of these functions can be found in the sequence I/O section.

Individual Sequence objects can be created from a string:

import abutils

# create a sequence from a string
sequence = abutils.Sequence("ATCG")

Note

If provided a string, the sequence ID will be randomly generated if not specified. To specify the sequence ID, you can pass a Sequence object to the id argument:

# create a sequence from a string
sequence = abutils.Sequence("ATCG", id="my_sequence")

Sequence objects can also be created from a list, of the form [id, sequence]:

# create a sequence from a list
sequence = abutils.Sequence(["my_sequence", "ATCG"])

Sequence objects can also be created from a dictionary, which provides a means for including additoinal annotations beyond just the sequence and ID:

# create a sequence from a dictionary
sequence = abutils.Sequence(
    {
        "sequence_id": "my_sequence",
        "sequence": "ATCG",
        "productive": True,
        "v_call": "IGHV1-2*02",
        "d_call": "IGHD3-3*01",
        "j_call": "IGHJ6*02",
    }
)

# all annotations can be accessed using dictionary-style indexing
sequence["v_call"]
>>> "IGHV1-2*02"

Note

Dictionary keys are expected to follow the naming conventions of the AIRR-C rearrangement schema. The Sequence object will automatically populate the special properties id and sequence from the provided dictionary if the correct key names ("sequence_id" and "sequence", respectively) are used.


usage

Sequence objects have several convenient properties for common sequence manipulations:

# reverse complement
rc = sequence.reverse_complement()

# translate
aa = sequence.translate()

api

class abutils.core.sequence.Sequence(sequence: str | Iterable | dict | SeqRecord, id: str | None = None, qual: str | None = None, annotations: dict | None = None, id_key: str | None = None, seq_key: str | None = None)

Container for biological (RNA, DNA, or protein) sequences.

seq can be one of several things:

  1. a raw sequence, as a string

  2. an iterable, formatted as [seq_id, sequence]

  3. a dict, containing at least the sequence ID and a sequence. Alternate id_key and seq_key can be provided at instantiation.

  4. a Biopython SeqRecord object

  5. an abutils Sequence object

If seq is provided as a string, the sequence ID can optionally be provided via the id keyword argument. If seq is a string and id is not provided, a random sequence ID will be generated with uuid.uuid4().

Quality scores can be supplied with qual or as part of a SeqRecord object. If providing both a SeqRecord object with quality scores and quality scores via qual, the qual scores will override the SeqRecord quality scores.

If seq is a dictionary, typically the result of a MongoDB query, the dictionary can be accessed directly from the Sequence instance (via the annotations property). To retrive the value for 'junc_aa' in the instantiating dictionary, you would simply:

s = Sequence(dict)
junc = s['junc_aa']

If seq is a dictionary, an optional id_key and seq_key can be provided, which tells the Sequence object which field to use to populate Sequence.id and Sequence.sequence. Defaults for both id_key and seq_key are None, which results in abutils trying to determine the appropriate key. For id_key, the following keys are tried: ['seq_id', 'sequence_id']. For seq_key, the following keys are tried: ['vdj_nt', 'sequence_nt', 'sequence']. If none of the attempts are successful, the Sequence.id or Sequence.sequence attributes will be None.

Alternately, the __getitem__() interface can be used to obtain a slice from the sequence attribute. An example of the distinction:

d = {'name': 'MySequence', 'sequence': 'ATGC'}
seq = Sequence(d, id_key='name', seq_key='sequence')

seq['name']  # 'MySequence'
seq[:2]  # 'AT'

If the Sequence is instantiated with a dictionary, calls to __contains__() will return True if the supplied item is a key in the dictionary. In non-dict instantiations, __contains__() will look in the Sequence.sequence field directly (essentially a motif search). For example:

dict_seq = Sequence({'seq_id': 'seq1', 'vdj_nt': 'ACGT'})
'seq_id' in dict_seq  # TRUE
'ACG' in dict_seq     # FALSE

str_seq = Sequence('ACGT', id='seq1')
'seq_id' in str_seq  # FALSE
'ACG' in str_seq     # TRUE

Note

When comparing Sequence objects, they are comsidered equal only if their sequences and IDs are identical. This means that two Sequence objects with identical sequences but without user-supplied IDs won’t be equal, because their IDs will have been randomly generated.

property fasta: str

Returns the sequence, as a FASTA-formatted string.

Note

The FASTA string is built using Sequence.id and Sequence.sequence.

Returns:

Returns the sequence, as a FASTA-formatted string.

Return type:

str

property fastq

Returns the sequence, as a FASTQ-formatted string.

Note

The FASTQ string is built using Sequence.id, Sequence.sequence, and Sequence.qual.

Returns:

Returns the sequence, as a FASTQ-formatted string

Return type:

str

property reverse_complement: str | Sequence

Returns the reverse complement of Sequence.sequence.

Returns:

  • Returns a str if in_place is False, otherwise returns an updated

  • Sequence object in which the sequence property has been replaced – with the reverse complement.

property annotations

Annotations is a dictionary that contains any additional information about the sequence. This can include sequence annotations, such as VDJ annotations, or any other information that might be useful to store with the sequence (like donor, group, etc).

translate(sequence_key: str | None = None, frame: int = 1) str

Translate a nucleotide sequence.

Parameters:
  • sequence_key (str, default=None) – Name of the annotation field containg the sequence to be translated. If not provided, Sequence.sequence is used.

  • frame (int, default=1) – Reading frame to translate. Default is 1.

Returns:

Translated sequence

Return type:

str

codon_optimize(sequence_key: str = 'sequence_aa', id_key: str = 'sequence_id', frame: int | None = None, as_string: bool = True) str | Sequence

Codon optimize a sequence.

Parameters:
  • sequence_key (str, default="sequence_aa") – Name of the annotation field containg the sequence to be translated. If not provided, Sequence.sequence is used.

  • id_key (str, default="sequence_id") – Name of the annotation field containg the sequence id. If not provided, Sequence.id is used.

  • frame (int, default=1) – Reading frame to translate. Default is 1.

  • as_string (bool, default=True) – If True, the optimized sequence will be returned as a str. If False, the optimized sequence will be returned as a Sequence object.

Returns:

Translated sequence as a str (if as_string is True) or Sequence object (if as_string is False).

Return type:

Union[str, Sequence]

as_fasta(name_field: str | None = None, seq_field: str | None = None) str

Returns the sequence, as a FASTA-formatted string.

Parameters:
  • name_field (str, default=None) – Name of the annotation field containing the sequence name. If not provided, Sequence.id is used.

  • seq_field (str, default=None) – Name of the annotation field containing the sequence. If not provided, Sequence.sequence is used.

Returns:

Returns the sequence, as a FASTA-formatted string.

Return type:

str

region(start=0, end=None)

Returns a region of Sequence.sequence, in FASTA format.

If called without kwargs, the entire sequence will be returned.

Parameters:
  • start (int, default=0) – Start position of the region to be returned. Default is 0.

  • end (int, default=None) – End position of the region to be returned. Negative values

Returns:

A region of Sequence.sequence, in FASTA format

Return type:

str

keys()

Returns the keys of the annotations attribute.

values()

Returns the values of the annotations attribute.

get(key: str | int | float, default: str | int | float | None = None) str | int | float | None

Returns the value of a key in the annotations attribute.

Parameters:
  • key (Union[str, int, float]) – Key for the annotation to be returned.

  • default (Union[str, int, float, None], default=None) – Value to be returned if the key is not in the annotations attribute.

Returns:

Value of the key in the annotations attribute.

Return type:

Union[str, int, float, None]


abutils.core.sequence.reverse_complement(sequence: str | Sequence, in_place: bool = False) str | Sequence

Returns the reverse complement of a nucleotide sequence.

Parameters:
  • sequence (Union[str, Sequence]) – Nucleotide sequence to be reverse complemented.

  • in_place (bool, default=False) – If True, the input sequence will be reverse complemented in place. If False, the reverse complemented sequence will be returned as a str.

Returns:

If in_place is False, the reverse complemented sequence will be returned as a str. If in_place is True, the input sequence will be reverse complemented in place.

Return type:

Union[str, Sequence]


abutils.core.sequence.translate(sequence: Sequence, sequence_key: str | None = None, frame: int = 1, allow_dots: bool = False) str

Translates a nucleotide sequence.

Parameters:
  • sequence (Sequence) – Sequence object to be translated. Required.

  • sequence_key (str, default=None) – Name of the annotation field containg the sequence to be translated. If not provided, sequence.sequence is used.

  • frame (int, default=1) – Reading frame to translate. Default is 1.

  • allow_dots (bool, default=False) – If True, an all-dot codon (”…”) will be translated as a single dot (“.”). Useful when translating IMGT-gapped sequences.Default is False.

Returns:

translated – Translated sequence.

Return type:

str