Question

Can Biopython Test For Equality Sequences With Ambiguity Codes

0

Entering edit mode

10.2 years ago

tflutre ▴ 580

I have multiplexed data from a RADseq experiment (digestion with ApeKI). Each sample has a tag. For each read, I want to check if the tag occurs next to what remains of the ApeKI motif after the cut. For instance, individual 1 has tag ATAT and ApeKI cuts at G/CWGC (where W is A or T in the IUPAC code). I therefore should see something like "ATATCAGC..." at the beginning of my read. What would be the best way to do this with Biopython?

My attempt 1: From the cmd-line, I ask the user to provide the following information "ApeKI=G/CWGC" and then I parse it internally. However, as testing for sequence equality is a difficult matter (it's contexte dependent), Biopython doesn't provide much support for it:

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> x = Seq("CWGC", IUPAC.ambiguous_dna)
>>> y = Seq("CAGC", IUPAC.ambiguous_dna)
>>> x == y
...<warning>...
False
>>> str(x) == str(y)
False

My attempt 2: another way is to create a motif:

>>> from Bio import motifs
>>> m = motifs.create([Seq("CAGC"), Seq("CTGC")])
>>> m.degenerate_consensus
Seq('CWGC', IUPACAmbiguousDNA())
>>> "CAGC" in [str(i) for i in m.instances]
True

However, if the user provides me with "ApeKI=G/CWGC", I need to know that W is A or T in the IUPAC code. But it seems that this information is not in Biopython. I can only get a list of letters:

>>> IUPAC.ambiguous_dna.letters
'GATCRYWSMKHBVDN'

Would it be possible to add a dictionary as attribute, so that IUPAC.ambiguous_dna.dict["W"] would return ["A", "T"] for instance?

(I edited the first code block to reflect Peter's comment.)

sequence biopython • 4.4k views

ADD COMMENT • link 10.2 years ago by tflutre ▴ 580

0

Entering edit mode

Could you paste the actual code and output? Your imports are inconsistent with the code. Also didn't you get a warning about comparison? Perhaps your Biopython is very old?

>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> x = Seq("CWGC", IUPAC.ambiguous_dna)
>>> y = Seq("CAGC", IUPAC.ambiguous_dna)
>>> x == y
/Library/Python/2.7/site-packages/Bio/Seq.py:179: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning.
  "and to avoid this warning.", FutureWarning)
False

ADD REPLY • link 10.2 years ago by Peter 6.0k

0

Entering edit mode

@Peter sorry, you're right, I updated my code block

ADD REPLY • link 10.2 years ago by tflutre ▴ 580

score 1 · Answer 1 · 2014-02-24

1

Entering edit mode

10.2 years ago

Peter 6.0k

It isn't clear what you are actually asking, but the final question, how to find out that W is A or T in the IUPAC code, can be done with one of the dictionaries included in Biopython:

>>> from Bio.Data.IUPACData import ambiguous_dna_values
>>> ambiguous_dna_values["W"]
'AT'

P.S. Biopython has an entire module for restriction digest enzymes.

ADD COMMENT • link 10.1 years ago by Peter 6.0k

0

Entering edit mode

@Peter thanks, that's what I was looking for but couldn't find it anywhere in the doc. Thanks also for mentionning Bio.Restriction (http://biopython.org/DIST/docs/cookbook/Restriction.html), I'll definitely look at it.

ADD REPLY • link 10.2 years ago by tflutre ▴ 580