Wrong Alphabet In Biopython.Motifs When Trying To Pssm.Calculate
1
1
Entering edit mode
10.9 years ago

I'm trying to find motif occurences in the upstream:

upstream = Seq(mysequence, IUPAC.unambiguous_dna)
pssm.calculate(upstream)

get error:

ValueError: Wrong alphabet! Use only with DNA motifs

if I check alphabets:

>>>pssm.alphabet
IUPACUnambiguousDNA()
>>>upstream.alphabet
IUPACUnambiguousDNA()

The source code where is fails:

  if self.alphabet!=IUPAC.unambiguous_dna:
        raise ValueError("Wrong alphabet! Use only with DNA motifs")

What could I do wrong? What's really the difference between IUPACUnambiguousDNA() and IUPAC.unambiguous_dna?

biopython motif • 3.9k views
ADD COMMENT
4
Entering edit mode
10.9 years ago

The problem is with biopython - it tests instances for equality instead of testing them for having the same class.

IUPACUnambiguousDNA() and IUPAC.unambiguous_dna may actually represent different instances of the same class you can see that by doing something like:

print id(pssm.alphabet)
print id(upstream.alphabet)

These will not test as being equal when checking with the = operator.

When you create the pssm object make sure to use the the IUPAC.unambiguous_dna in the constructor.

ADD COMMENT
2
Entering edit mode

I agree with your diagnosis and the suggested workaround Istvan. We'll try to fix this... http://lists.open-bio.org/pipermail/biopython-dev/2013-June/010639.html - Thanks!

ADD REPLY
0
Entering edit mode

Indeed

   >>id(upstream.alphabet)
   4414891728
   >>id(pssm.alphabet)
   4428032464

I'm not sure I use alphabet in the constructor properly:

   instances = []
           for line in alignment:
               instances.append(Seq(line, IUPAC.unambiguous_dna))
           m = motifs.create(instances)
           pwm = m.counts.normalize(pseudocounts=PSEUDOCOUNT)
           background = {'A':0.18,'C':0.32,'G':0.32,'T':0.18}
           pssm = pwm.log_odds(background)

What I also tried to do and it didn't work:

   upstreamList = [Seq(u, pssm.alphabet) for u in upstreamList]

previously it was

   upstreamList = [Seq(u, IUPAC.unambiguous_dna) for u in upstreamList]

The likely reason is that I create pssm in one run and cPickle it and then unpickle when I need to apply. It works fine if I create pssm in the same run as I apply, but constructing matrix takes time, so I'd like to figure out how to pickle it if it's possible at all.

ADD REPLY
0
Entering edit mode

Pickling and unpickling will result in a new instance of the alphabet class, and you'll hit the bug again. You'll have to wait for the next Biopython release where this is fixed, or apply the fix locally. See the mailing list discussion linked to above.

ADD REPLY

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6