Biopython pairwise sequence comparison using IUPAC?
0
1
Entering edit mode
3.9 years ago

Hi,

I am trying to count the number of haplotypes in a group of sequences using biopython. My sequences have IUPAC ambiguities, which should not count as differences. E.g.:

ACTYG == ACTCG should be: True

I've tried to compare using == but biopython seems to read the sequences as strings for this comparison and does not incorporate the IUPAC codes?

I've also tried to use

hapcount = len(recordlist)
for a, b in itertools.combinations(recordlist, 2):
    if calculator._pairwise(a, b) == 0:
        print("match")
        hapcount -= 1

but the pairwise comparison seems to count IUPAC codes as differences too??

Any ideas on how to do this would be much appreciated. It seems like something simple.

Biopython • 1.7k views
ADD COMMENT
1
Entering edit mode

The answer to your question is not that trivial, because we can't say the two sequences are "equal", but rather "compatible". For example:

ACT and ACA -> False
ACT and ACN -> True
ACY and ACN -> True
ACY and ACR -> False
ACY and ACM -> Maybe

I don't know the format of your sequences, but maybe you can use this:

from itertools import product
from Bio.Data.IUPACData import ambiguous_dna_values

def ambiguous_dna_list(seq):
    """return list of all possible sequences given an ambiguous DNA input"""
    d = ambiguous_dna_values
    return list(map("".join, product(*map(d.get, seq))))


myseq1 = "ACTYG"
myseq2 = "ACTCG"

seq1_list = ambiguous_dna_list(myseq1)
seq2_list = ambiguous_dna_list(myseq2)

for seq in seq1_list:
    for seq2 in seq2_list:
        if seq == seq2:
            print("match")

In my case, it says there is a match.

ADD REPLY
0
Entering edit mode

Thank you very much! You made me rethink what I was asking. I was not necessarily looking for sequences that are compatible, but sequences that are confidently 'unique'; so that ambiguities would not count as differences. I managed to write this function that does what I need:

def IUPACdistance_verbose(seq1, seq2):
    ignorelist = ["N","?","-","M","R","W","S","Y","K","V","H","D","B"]
    unamblengthseq1 = len(seq1.translate(str.maketrans('','','N?-MRWSYKVHDB')))
    print("seq1 bp: " + str(unamblengthseq1))
    unamblengthseq2 = len(seq2.translate(str.maketrans('','','N?-MRWSYKVHDB')))
    print("seq2 bp: " + str(unamblengthseq2))
    comparedlength = min(unamblengthseq1, unamblengthseq2)
    print("compared length: " + str(comparedlength))
    difference = 0
    for x, y in zip(seq1.upper(), seq2.upper()):
        if x in ignorelist or y in ignorelist:
            difference += 0
        elif x != y:
            difference += 1
    print("difference:" + str(difference))
    dpairwise = (difference / comparedlength)
    return dpairwise 

seq1 = 'AARCTGACACT'
seq2 = 'TRCCTGN-ACT'

IUPACdistance_verbose(seq1, seq2)

seq1 bp: 10
seq2 bp: 8
compared length: 8
difference:1
Out[14]:
0.125
ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6