Question: Why does Seq accept only a string as a sequence?
0
gravatar for athinkerer
5.2 years ago by
athinkerer0
United States
athinkerer0 wrote:

I have an application in which I need to align sequences of words. The application is not related to bioinformatics, but I was hoping to be able to leverage Biopython's support for sequence alignment. I however ran into a TypeError when I tried to create a sequence of words:

In [1]: from Bio.Seq import Seq

In [2]: seq_1 = Seq(['abc', 'def', 'ghi', 'jkl'])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-567b8a62bb5f> in <module>()
----> 1 seq_1 = Seq(['abc', 'def', 'ghi', 'jkl'])

/usr/local/lib/python2.7/dist-packages/Bio/Seq.pyc in __init__(self, data, alphabet)
    104         # Enforce string storage
    105         if not isinstance(data, basestring):
--> 106             raise TypeError("The sequence data given to a Seq object should "
    107                             "be a string (not another Seq object etc)")
    108         self._data = data

TypeError: The sequence data given to a Seq object should be a string (not another Seq object etc)

I was curious as to whether there is a reason Biopython enforces sequences to be supplied as a string, essentially supporting alignment only on sequences of characters. Is this because in the domain of bioinformatics only sequences of RNA, DNA and protein, all of which can be encoded as sequences of characters, ever need to be aligned? Or is there a more subtle reason (possibly performance) that mandates this choice?

biopython alignment sequence • 1.7k views
ADD COMMENTlink modified 5.2 years ago by Michael Dondrup46k • written 5.2 years ago by athinkerer0

Just out of curiosity, what does "align" mean to you in this context? The various sequence aligners are all built on classic string-matching algorithms, which of course necessitate a string-like datatype.

ADD REPLYlink written 5.2 years ago by Dan D6.9k
1
gravatar for Michael Dondrup
5.2 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

A biological sequence as input to an alignment algorithm is either a nucleotide sequence (DNA) or a amino-acid (peptide, protein) sequence: https://en.wikipedia.org/wiki/Sequence_(biology).They can be represented - or better their linear 1D structure - as strings of 4 nucleotides or 21 amino-acid characters respectively. Algorithms in bioinformatics are often called algorithms on strings or on sequences. See http://www.math.northwestern.edu/~mlerma/courses/cs310-04w/notes/dm-sequences.pdf for an example or citing wikipedia:   

Let Σ be a non-empty finite set of symbols (alternatively called characters), called the alphabet. No assumption is made about the nature of the symbols. A string (or word) over Σ is any finite sequence of symbols from Σ.[1]

Sequence algorithms are formally specified operating on a sequence over an arbitrary alphabet and can therefore also be applied in a different context, e.g. linguistics where the alphabet could be defined of entries from a finite dictionary (aka. "words") and an alignment algorithm e.g. Smith-Waterman, can align natural language sentences against a large corpus. SW might not be a good choice for natural language processing because it doesn't treat inversions and natural languages have non-strict word order. 

However, the implementations of algorithms and datastructures in software tend to sacrifice this flexibility for efficiency as it is made possible by the application domain.   

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 675 users visited in the last hour