sequence¶
Sequence objects are the fundamental building block for much of the abutils package.
Virtually all functions and methods that operate on a one or more sequences will accept Sequence objects as input.
Sequence can be created from a variety of inputs, including strings, lists, dictionaries, and BioPython SeqRecord
objects. Below are some brief examples of how to create and use Sequence objects.
instantiation¶
abutils has a number of convenience functions for batch creation of Sequence objects
common file formats, including FASTA, FASTQ, AIRR-C, and Parquet. Details and examples of these
functions can be found in the sequence I/O section.
Individual Sequence objects can be created from a string:
import abutils
# create a sequence from a string
sequence = abutils.Sequence("ATCG")
Note
If provided a string, the sequence ID will be randomly generated if not specified.
To specify the sequence ID, you can pass a Sequence object to the id argument:
# create a sequence from a string
sequence = abutils.Sequence("ATCG", id="my_sequence")
Sequence objects can also be created from a list, of the form [id, sequence]:
# create a sequence from a list
sequence = abutils.Sequence(["my_sequence", "ATCG"])
Sequence objects can also be created from a dictionary, which provides a means for
including additoinal annotations beyond just the sequence and ID:
# create a sequence from a dictionary
sequence = abutils.Sequence(
{
"sequence_id": "my_sequence",
"sequence": "ATCG",
"productive": True,
"v_call": "IGHV1-2*02",
"d_call": "IGHD3-3*01",
"j_call": "IGHJ6*02",
}
)
# all annotations can be accessed using dictionary-style indexing
sequence["v_call"]
>>> "IGHV1-2*02"
Note
Dictionary keys are expected to follow the naming conventions of the
AIRR-C rearrangement schema. The Sequence object will automatically populate the special
properties id and sequence from the provided dictionary if the correct key names ("sequence_id"
and "sequence", respectively) are used.
usage¶
Sequence objects have several convenient properties for common sequence manipulations:
# reverse complement
rc = sequence.reverse_complement()
# translate
aa = sequence.translate()
api¶
- class abutils.core.sequence.Sequence(sequence: str | Iterable | dict | SeqRecord, id: str | None = None, qual: str | None = None, annotations: dict | None = None, id_key: str | None = None, seq_key: str | None = None)¶
Container for biological (RNA, DNA, or protein) sequences.
seqcan be one of several things:a raw sequence, as a string
an iterable, formatted as
[seq_id, sequence]a dict, containing at least the sequence ID and a sequence. Alternate
id_keyandseq_keycan be provided at instantiation.a Biopython
SeqRecordobjectan abutils
Sequenceobject
If
seqis provided as a string, the sequence ID can optionally be provided via theidkeyword argument. Ifseqis a string andidis not provided, a random sequence ID will be generated withuuid.uuid4().Quality scores can be supplied with
qualor as part of aSeqRecordobject. If providing both a SeqRecord object with quality scores and quality scores viaqual, thequalscores will override the SeqRecord quality scores.If
seqis a dictionary, typically the result of a MongoDB query, the dictionary can be accessed directly from theSequenceinstance (via theannotationsproperty). To retrive the value for'junc_aa'in the instantiating dictionary, you would simply:s = Sequence(dict) junc = s['junc_aa']
If
seqis a dictionary, an optionalid_keyandseq_keycan be provided, which tells theSequenceobject which field to use to populateSequence.idandSequence.sequence. Defaults for bothid_keyandseq_keyareNone, which results in abutils trying to determine the appropriate key. Forid_key, the following keys are tried:['seq_id', 'sequence_id']. Forseq_key, the following keys are tried:['vdj_nt', 'sequence_nt', 'sequence']. If none of the attempts are successful, theSequence.idorSequence.sequenceattributes will beNone.Alternately, the
__getitem__()interface can be used to obtain a slice from thesequenceattribute. An example of the distinction:d = {'name': 'MySequence', 'sequence': 'ATGC'} seq = Sequence(d, id_key='name', seq_key='sequence') seq['name'] # 'MySequence' seq[:2] # 'AT'
If the
Sequenceis instantiated with a dictionary, calls to__contains__()will returnTrueif the supplied item is a key in the dictionary. In non-dict instantiations,__contains__()will look in theSequence.sequencefield directly (essentially a motif search). For example:dict_seq = Sequence({'seq_id': 'seq1', 'vdj_nt': 'ACGT'}) 'seq_id' in dict_seq # TRUE 'ACG' in dict_seq # FALSE str_seq = Sequence('ACGT', id='seq1') 'seq_id' in str_seq # FALSE 'ACG' in str_seq # TRUE
Note
When comparing
Sequenceobjects, they are comsidered equal only if their sequences and IDs are identical. This means that twoSequenceobjects with identical sequences but without user-supplied IDs won’t be equal, because their IDs will have been randomly generated.- property fasta: str¶
Returns the sequence, as a FASTA-formatted string.
Note
The FASTA string is built using
Sequence.idandSequence.sequence.- Returns:
Returns the sequence, as a FASTA-formatted string.
- Return type:
str
- property fastq¶
Returns the sequence, as a FASTQ-formatted string.
Note
The FASTQ string is built using
Sequence.id,Sequence.sequence, andSequence.qual.- Returns:
Returns the sequence, as a FASTQ-formatted string
- Return type:
str
- property reverse_complement: str | Sequence¶
Returns the reverse complement of
Sequence.sequence.- Returns:
Returns a
strifin_placeisFalse, otherwise returns an updatedSequenceobject in which thesequenceproperty has been replaced – with the reverse complement.
- property annotations¶
Annotations is a dictionary that contains any additional information about the sequence. This can include sequence annotations, such as VDJ annotations, or any other information that might be useful to store with the sequence (like donor, group, etc).
- translate(sequence_key: str | None = None, frame: int = 1) str¶
Translate a nucleotide sequence.
- Parameters:
sequence_key (str, default=None) – Name of the annotation field containg the sequence to be translated. If not provided,
Sequence.sequenceis used.frame (int, default=1) – Reading frame to translate. Default is
1.
- Returns:
Translated sequence
- Return type:
str
- codon_optimize(sequence_key: str = 'sequence_aa', id_key: str = 'sequence_id', frame: int | None = None, as_string: bool = True) str | Sequence¶
Codon optimize a sequence.
- Parameters:
sequence_key (str, default="sequence_aa") – Name of the annotation field containg the sequence to be translated. If not provided,
Sequence.sequenceis used.id_key (str, default="sequence_id") – Name of the annotation field containg the sequence id. If not provided,
Sequence.idis used.frame (int, default=1) – Reading frame to translate. Default is
1.as_string (bool, default=True) – If
True, the optimized sequence will be returned as astr. IfFalse, the optimized sequence will be returned as aSequenceobject.
- Returns:
Translated sequence as a
str(ifas_stringisTrue) orSequenceobject (ifas_stringisFalse).- Return type:
Union[str, Sequence]
- as_fasta(name_field: str | None = None, seq_field: str | None = None) str¶
Returns the sequence, as a FASTA-formatted string.
- Parameters:
name_field (str, default=None) – Name of the annotation field containing the sequence name. If not provided,
Sequence.idis used.seq_field (str, default=None) – Name of the annotation field containing the sequence. If not provided,
Sequence.sequenceis used.
- Returns:
Returns the sequence, as a FASTA-formatted string.
- Return type:
str
- region(start=0, end=None)¶
Returns a region of
Sequence.sequence, in FASTA format.If called without kwargs, the entire sequence will be returned.
- Parameters:
start (int, default=0) – Start position of the region to be returned. Default is 0.
end (int, default=None) – End position of the region to be returned. Negative values
- Returns:
A region of
Sequence.sequence, in FASTA format- Return type:
str
- keys()¶
Returns the keys of the
annotationsattribute.
- values()¶
Returns the values of the
annotationsattribute.
- get(key: str | int | float, default: str | int | float | None = None) str | int | float | None¶
Returns the value of a key in the
annotationsattribute.- Parameters:
key (Union[str, int, float]) – Key for the annotation to be returned.
default (Union[str, int, float, None], default=None) – Value to be returned if the key is not in the
annotationsattribute.
- Returns:
Value of the key in the
annotationsattribute.- Return type:
Union[str, int, float, None]
- abutils.core.sequence.reverse_complement(sequence: str | Sequence, in_place: bool = False) str | Sequence¶
Returns the reverse complement of a nucleotide sequence.
- Parameters:
sequence (Union[str, Sequence]) – Nucleotide sequence to be reverse complemented.
in_place (bool, default=False) – If
True, the input sequence will be reverse complemented in place. IfFalse, the reverse complemented sequence will be returned as astr.
- Returns:
If
in_placeisFalse, the reverse complemented sequence will be returned as astr. Ifin_placeisTrue, the input sequence will be reverse complemented in place.- Return type:
Union[str, Sequence]
- abutils.core.sequence.translate(sequence: Sequence, sequence_key: str | None = None, frame: int = 1, allow_dots: bool = False) str¶
Translates a nucleotide sequence.
- Parameters:
sequence (Sequence) –
Sequenceobject to be translated. Required.sequence_key (str, default=None) – Name of the annotation field containg the sequence to be translated. If not provided,
sequence.sequenceis used.frame (int, default=1) – Reading frame to translate. Default is
1.allow_dots (bool, default=False) – If
True, an all-dot codon (”…”) will be translated as a single dot (“.”). Useful when translating IMGT-gapped sequences.Default isFalse.
- Returns:
translated – Translated sequence.
- Return type:
str