Question: Biopython pairwise sequence comparison using IUPAC?
0
c.doorenweerd0 wrote:

Hi,

I am trying to count the number of haplotypes in a group of sequences using biopython. My sequences have IUPAC ambiguities, which should not count as differences. E.g.:

ACTYG == ACTCG should be: True

I've tried to compare using == but biopython seems to read the sequences as strings for this comparison and does not incorporate the IUPAC codes?

I've also tried to use

``````hapcount = len(recordlist)
for a, b in itertools.combinations(recordlist, 2):
if calculator._pairwise(a, b) == 0:
print("match")
hapcount -= 1
``````

but the pairwise comparison seems to count IUPAC codes as differences too??

Any ideas on how to do this would be much appreciated. It seems like something simple.

biopython • 130 views
written 8 weeks ago by c.doorenweerd0
1

The answer to your question is not that trivial, because we can't say the two sequences are "equal", but rather "compatible". For example:

ACT and ACA -> False
ACT and ACN -> True
ACY and ACN -> True
ACY and ACR -> False
ACY and ACM -> Maybe

I don't know the format of your sequences, but maybe you can use this:

``````from itertools import product
from Bio.Data.IUPACData import ambiguous_dna_values

def ambiguous_dna_list(seq):
"""return list of all possible sequences given an ambiguous DNA input"""
d = ambiguous_dna_values
return list(map("".join, product(*map(d.get, seq))))

myseq1 = "ACTYG"
myseq2 = "ACTCG"

seq1_list = ambiguous_dna_list(myseq1)
seq2_list = ambiguous_dna_list(myseq2)

for seq in seq1_list:
for seq2 in seq2_list:
if seq == seq2:
print("match")
``````

In my case, it says there is a match.

Thank you very much! You made me rethink what I was asking. I was not necessarily looking for sequences that are compatible, but sequences that are confidently 'unique'; so that ambiguities would not count as differences. I managed to write this function that does what I need:

``````def IUPACdistance_verbose(seq1, seq2):
ignorelist = ["N","?","-","M","R","W","S","Y","K","V","H","D","B"]
unamblengthseq1 = len(seq1.translate(str.maketrans('','','N?-MRWSYKVHDB')))
print("seq1 bp: " + str(unamblengthseq1))
unamblengthseq2 = len(seq2.translate(str.maketrans('','','N?-MRWSYKVHDB')))
print("seq2 bp: " + str(unamblengthseq2))
comparedlength = min(unamblengthseq1, unamblengthseq2)
print("compared length: " + str(comparedlength))
difference = 0
for x, y in zip(seq1.upper(), seq2.upper()):
if x in ignorelist or y in ignorelist:
difference += 0
elif x != y:
difference += 1
print("difference:" + str(difference))
dpairwise = (difference / comparedlength)
return dpairwise

seq1 = 'AARCTGACACT'
seq2 = 'TRCCTGN-ACT'

IUPACdistance_verbose(seq1, seq2)

seq1 bp: 10
seq2 bp: 8
compared length: 8
difference:1
Out:
0.125
``````