I have multiplexed data from a RADseq experiment (digestion with ApeKI). Each sample has a tag. For each read, I want to check if the tag occurs next to what remains of the ApeKI motif after the cut. For instance, individual 1 has tag ATAT and ApeKI cuts at G/CWGC (where W is A or T in the IUPAC code). I therefore should see something like "ATATCAGC..." at the beginning of my read. What would be the best way to do this with Biopython?
My attempt 1: From the cmd-line, I ask the user to provide the following information "ApeKI=G/CWGC" and then I parse it internally. However, as testing for sequence equality is a difficult matter (it's contexte dependent), Biopython doesn't provide much support for it:
>>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> x = Seq("CWGC", IUPAC.ambiguous_dna) >>> y = Seq("CAGC", IUPAC.ambiguous_dna) >>> x == y ...<warning>... False >>> str(x) == str(y) False
My attempt 2: another way is to create a motif:
>>> from Bio import motifs >>> m = motifs.create([Seq("CAGC"), Seq("CTGC")]) >>> m.degenerate_consensus Seq('CWGC', IUPACAmbiguousDNA()) >>> "CAGC" in [str(i) for i in m.instances] True
However, if the user provides me with "ApeKI=G/CWGC", I need to know that W is A or T in the IUPAC code. But it seems that this information is not in Biopython. I can only get a list of letters:
>>> IUPAC.ambiguous_dna.letters 'GATCRYWSMKHBVDN'
Would it be possible to add a dictionary as attribute, so that IUPAC.ambiguous_dna.dict["W"] would return ["A", "T"] for instance?
(I edited the first code block to reflect Peter's comment.)