Question: Biopython pairwise sequence comparison using IUPAC?
0
gravatar for c.doorenweerd
8 weeks ago by
c.doorenweerd0 wrote:

Hi,

I am trying to count the number of haplotypes in a group of sequences using biopython. My sequences have IUPAC ambiguities, which should not count as differences. E.g.:

ACTYG == ACTCG should be: True

I've tried to compare using == but biopython seems to read the sequences as strings for this comparison and does not incorporate the IUPAC codes?

I've also tried to use

hapcount = len(recordlist)
for a, b in itertools.combinations(recordlist, 2):
    if calculator._pairwise(a, b) == 0:
        print("match")
        hapcount -= 1

but the pairwise comparison seems to count IUPAC codes as differences too??

Any ideas on how to do this would be much appreciated. It seems like something simple.

biopython • 130 views
ADD COMMENTlink written 8 weeks ago by c.doorenweerd0
1

The answer to your question is not that trivial, because we can't say the two sequences are "equal", but rather "compatible". For example:

ACT and ACA -> False
ACT and ACN -> True
ACY and ACN -> True
ACY and ACR -> False
ACY and ACM -> Maybe

I don't know the format of your sequences, but maybe you can use this:

from itertools import product
from Bio.Data.IUPACData import ambiguous_dna_values

def ambiguous_dna_list(seq):
    """return list of all possible sequences given an ambiguous DNA input"""
    d = ambiguous_dna_values
    return list(map("".join, product(*map(d.get, seq))))


myseq1 = "ACTYG"
myseq2 = "ACTCG"

seq1_list = ambiguous_dna_list(myseq1)
seq2_list = ambiguous_dna_list(myseq2)

for seq in seq1_list:
    for seq2 in seq2_list:
        if seq == seq2:
            print("match")

In my case, it says there is a match.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by lumal2980

Thank you very much! You made me rethink what I was asking. I was not necessarily looking for sequences that are compatible, but sequences that are confidently 'unique'; so that ambiguities would not count as differences. I managed to write this function that does what I need:

def IUPACdistance_verbose(seq1, seq2):
    ignorelist = ["N","?","-","M","R","W","S","Y","K","V","H","D","B"]
    unamblengthseq1 = len(seq1.translate(str.maketrans('','','N?-MRWSYKVHDB')))
    print("seq1 bp: " + str(unamblengthseq1))
    unamblengthseq2 = len(seq2.translate(str.maketrans('','','N?-MRWSYKVHDB')))
    print("seq2 bp: " + str(unamblengthseq2))
    comparedlength = min(unamblengthseq1, unamblengthseq2)
    print("compared length: " + str(comparedlength))
    difference = 0
    for x, y in zip(seq1.upper(), seq2.upper()):
        if x in ignorelist or y in ignorelist:
            difference += 0
        elif x != y:
            difference += 1
    print("difference:" + str(difference))
    dpairwise = (difference / comparedlength)
    return dpairwise 

seq1 = 'AARCTGACACT'
seq2 = 'TRCCTGN-ACT'

IUPACdistance_verbose(seq1, seq2)

seq1 bp: 10
seq2 bp: 8
compared length: 8
difference:1
Out[14]:
0.125
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by c.doorenweerd0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1685 users visited in the last hour