Question

Checking Validity Of Dna Sequence In Bio* Toolkits?

2

Entering edit mode

11.1 years ago

Hernán ▴ 220

Given a presumed DNA sequence. How to check if it is actually a valid sequence using the Bio* toolkits? (BioRuby, Biopython, BioPerl, Bioconductor, BioJava, etc) For example, given:

'actgtactgatcga'

is a valid sequence, because it contains the . But

'actgtactgzzoootcga'

is not a valid sequence (because of letters zzooo).

For the sake of reference, this is the way I am currently parsing in BioSmalltalk:

#dnaSequence asParser matches: 'gtgacttagcgacttagc'

bioperl biopython biojava bioconductor • 8.8k views

ADD COMMENT • link updated 11.1 years ago by John 13k • written 11.1 years ago by Hernán ▴ 220

0

Entering edit mode

Here are the IUPAC codes for possible letters in DNA sequences:http://www.bioinformatics.org/sms/iupac.html

You can just use a regex to check if any letters other than the IUPAC letters appear in your string.

ADD REPLY • link 11.1 years ago by Damian Kao 16k

0

Entering edit mode

Thanks Damian. Regular expressions are a form of declarative programming which are not appropriate for my current context. Besides regex are really hard to debug, is too easy to make mistakes, you have to compile them, and possible backtracking during matching, etc.

ADD REPLY • link 11.1 years ago by Hernán ▴ 220

score 1 · Answer 1 · 2013-03-28

1

Entering edit mode

11.1 years ago

Istvan Albert 100k

Usually the more advanced toolkits will offer you the choice of an Alphabet that restricts the codes to be between certain limits. For example for BioPython you would use something like

Seq("AGTACACTGGT", IUPAC.unambiguous_dna)

see more at http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc17

That being said when this check is done from BioPython it will probably add a substantial overhead. If you need to do this hundreds of millions of times (say filter high throughput data) you will be better off using tools like grep and its ilk.

ADD COMMENT • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

Thanks Albert. I am aware of the sequence objects in Biopython and similar toolkits, however I want to parse a given DNA, not to build a Sequence object.

ADD REPLY • link 11.1 years ago by Hernán ▴ 220

0

Entering edit mode

what I wrote would still apply, parse with SeqIO then transform each record into another sequence but now with an alphabet.

but then as it turns out in BioPython the the SeqIO.parse method already takes an alphabet

ADD REPLY • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

I have checked the SeqIO methods, but it seems they are intended to read or parse from files with a specific format (e.g. "fasta"). Furthermore, the parse() method creates iterator and additional SeqRecord object which is beyond my needs. It could be even costly for critical requirements. Haven't found yet a way to parse "just a DNA string".

ADD REPLY • link 11.1 years ago by Hernán ▴ 220

0

Entering edit mode

what you say is very contradictory - a statement that you want to parse a sequence but not create a Sequence object makes very little sense. The definition of parsing is to transform the data into something else. What you seem to need is validating a string that it contains just certain characters.

That being said (and this is why actually do this) one can learn surprising things when trying answer other people's questions. As it turns out Biopython will not enforce the Alphabet, it will happily chug along even if you give it an invalid data that does not match the alphabet. Moreover there does not seem to be a recommended way to actually validate the alphabet, I found only the a strangely named, seemingly private module level function called _verify_alphabet so then the full example would be:

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio import Alphabet

s = Seq("AGTACACTGGTZ", IUPAC.unambiguous_dna)

print Alphabet._verify_alphabet(s)

the actual python code to do it is also very simple:

def check(sequence,  code="ATGC"):
    for base in sequence: 
        if base not in code: 
            return False 
    return True

s = "AGTACACTGGTZ"
print check(s)

ADD REPLY • link 11.1 years ago by Istvan Albert 100k

0

Entering edit mode

Let me clarify that there is meaning behind my use case. Biologists in my lab usually copy and paste strings. Sometimes they want to validate a supposedly DNA string as it may contain invalid characters. Building a sequence object seems to be expensive for this case. I imagined the case for guessing alphabets with a matching percentage, but that would be impossible with your example as the sequence is created with a specific IUPAC alphabet. I expect to create a reified parser, and be able to build a parse tree or attach optional semantic actions as you've commented. Thanks for the insight of the Biopython library BTW.

ADD REPLY • link 11.1 years ago by Hernán ▴ 220

0

Entering edit mode

Maybe this is a case of you over-optimizing before you've even written the function. Are you expecting to deploy this function on gigabytes of sequence data with hundreds of users? Or is it just a simple tool for biologists to copy a few kb of sequences and spit out an answer? If it is the latter, I don't think you really need to worry that much about computational expenditure.

ADD REPLY • link 11.1 years ago by Damian Kao 16k

3

Entering edit mode

I think this is a case of abstraction inversion antipattern. It seems the parser interface is not exposed (at least in Biopython), maybe it is a design decision? Why? Even if you cannot anticipate user requirements and assuming occasional execution, decoupling interfaces could even provide better cyclomatic complexity and make software more maintainable. API design is a really difficult practice.

However, there are performance penalities by creating unnecessary sequence objects. I've adapted and measured both implementations using the code provided by Istvan. The Biopython code for creating random sequences from 1000 to 40000 bases by steps of 1000:

import time
start = time.clock()
import random
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio import Alphabet
for y in xrange(1000,40000,1000):
    dna_string = ''.join([random.choice('AGTC') for x in xrange(1,y)])
    s = Seq(dna_string, IUPAC.unambiguous_dna)
    Alphabet._verify_alphabet(s)    
end = time.clock()
print "Execution time:" , end - start

Execution time: 2.25471379742
Execution time: 2.25709713106
Execution time: 2.22140774875

and the code without Biopython:

import time
start = time.clock()
import random
for y in xrange(1000,40000,1000):
    dna_string = ''.join([random.choice('AGTC') for x in xrange(1,y)])
    for base in dna_string: 
        if base not in "ATGC": 
            result = False 
        else: 
            result = True
end = time.clock()
print "Execution time:" , end - start

Execution time: 1.60646611511
Execution time: 1.63292038513
Execution time: 1.59401362464

ADD REPLY • link 11.1 years ago by Hernán ▴ 220

score 1 · Answer 2 · 2013-03-28

if yourString.strip('acgt') == '':
    print "No problems here dawg.."
elif yourString.strip('acgtryswkmbdhvn.-') == '':
    print "Looks like we're dealing with a badass over here!"
else:
    print "Your string is wack fool. Maybe try installing a few packages for no reason because i'm being sarcastic."

score 0 · Answer 3 · 2013-03-28

0

Entering edit mode

11.1 years ago

Damian Kao 16k

If you want to do it without biopython and regex, you can do something like this in python:

seq = 'actgtactgzzoootcga'

seq = set(seq.lower())
IUPAC = ['a','c','g','t','r','y','s','w','k','m','b','d','h','v','n','.','-']

for char in IUPAC:
   if char in seq:
      seq.remove(char)

if len(seq) > 0:
   print 'sequence contains non IUPAC characters'
else:
   print 'sequence passes check'

ADD COMMENT • link 11.1 years ago by Damian Kao 16k

0

Entering edit mode

Thanks for your snippet Damian. Actually I am searching for how it is implemented using the Bio* toolkits. I guess that Biopython would require fewer SLOC.

ADD REPLY • link 11.1 years ago by Hernán ▴ 220