Question

Why does Seq accept only a string as a sequence?

0

Entering edit mode

9.6 years ago

athinkerer • 0

I have an application in which I need to align sequences of words. The application is not related to bioinformatics, but I was hoping to be able to leverage Biopython's support for sequence alignment. I however ran into a TypeError when I tried to create a sequence of words:

In [1]: from Bio.Seq import Seq

In [2]: seq_1 = Seq(['abc', 'def', 'ghi', 'jkl'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-567b8a62bb5f> in <module>()
----> 1 seq_1 = Seq(['abc', 'def', 'ghi', 'jkl'])

/usr/local/lib/python2.7/dist-packages/Bio/Seq.pyc in __init__(self, data, alphabet)
    104         # Enforce string storage
    105         if not isinstance(data, basestring):
--> 106             raise TypeError("The sequence data given to a Seq object should "
    107                             "be a string (not another Seq object etc)")
    108         self._data = data

TypeError: The sequence data given to a Seq object should be a string (not another Seq object etc)

I was curious as to whether there is a reason Biopython enforces sequences to be supplied as a string, essentially supporting alignment only on sequences of characters. Is this because in the domain of bioinformatics only sequences of RNA, DNA and protein, all of which can be encoded as sequences of characters, ever need to be aligned? Or is there a more subtle reason (possibly performance) that mandates this choice?

sequence alignment biopython • 3.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by athinkerer • 0

0

Entering edit mode

Just out of curiosity, what does "align" mean to you in this context? The various sequence aligners are all built on classic string-matching algorithms, which of course necessitate a string-like datatype.

ADD REPLY • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by Dan D 7.4k

Ram · Answer 1 · 2014-09-14

A biological sequence as input to an alignment algorithm is either a nucleotide sequence (DNA) or a amino-acid (peptide, protein) sequence: https://en.wikipedia.org/wiki/Sequence_(biology).They can be represented - or better their linear 1D structure - as strings of 4 nucleotides or 21 amino-acid characters respectively. Algorithms in bioinformatics are often called algorithms on strings or on sequences. See http://www.math.northwestern.edu/~mlerma/courses/cs310-04w/notes/dm-sequences.pdf for an example or citing wikipedia:

Let Σ be a non-empty finite set of symbols (alternatively called characters), called the alphabet. No assumption is made about the nature of the symbols. A string (or word) over Σ is any finite sequence of symbols from Σ.[1]

Sequence algorithms are formally specified operating on a sequence over an arbitrary alphabet and can therefore also be applied in a different context, e.g. linguistics where the alphabet could be defined of entries from a finite dictionary (aka. "words") and an alignment algorithm e.g. Smith-Waterman, can align natural language sentences against a large corpus. SW might not be a good choice for natural language processing because it doesn't treat inversions and natural languages have non-strict word order.

However, the implementations of algorithms and data structures in software tend to sacrifice this flexibility for efficiency as it is made possible by the application domain.