Approximate Dna Pattern Matching In Python
4
6
Entering edit mode
11.9 years ago
Abhi ★ 1.6k

Hey guys

I am trying to find something already written in python which will allow me to do approximate pattern matching. So I have billions of query sequences which I want to match against just one search sequence or pattern on both strands allowing up to n mismatches.

Anything out there that would help.

Thanks! -Abhi

python • 11k views
0
Entering edit mode

Hi Abhi,

Were you able to implement approximate searching?

Thanks, Jitendra

5
Entering edit mode
11.9 years ago

You should try MOODS: it's a suite of algorithms for matching position weight matrices (PWM) against DNA sequences. I use it on a daily basis, it's a very good piece of software written in C++ with interface for python (a simple import MOODS, and up you go!). The must difficult step is to convert your query sequences in PWM. Here is the function I use:

def primer2pwm(primer):
"""
Write a primer sequence as a position weight matrix.
"""
# Create 4 lists of length equal to primer's length.
matrix = [[0] * len(primer) for i in range(4)]
# List of correspondance IUPAC.
IUPAC = {
"A" : ["A"],
"C" : ["C"],
"G" : ["G"],
"T" : ["T"],
"U" : ["U"],
"R" : ["G", "A"],
"Y" : ["T", "C"],
"K" : ["G", "T"],
"M" : ["A", "C"],
"S" : ["G", "C"],
"W" : ["A", "T"],
"B" : ["C", "G", "T"],
"D" : ["A", "G", "T"],
"H" : ["A", "C", "T"],
"V" : ["A", "C", "G"],
"N" : ["A", "C", "G", "T"]
}
# Position of nucleotides in the PWM.
dico = {"A" : 0,  "C" : 1, "G" : 2, "T" : 3}
# Read each IUPAC letter in the primer.
for index, letter in enumerate(primer):
for nuc in IUPAC.get(letter):
i = dico.get(nuc)
matrix[i][index] = 1
return matrix

0
Entering edit mode

Thanks for a quick reply. Just wondering if PWM model will be right in this case. I dont have any emperical observation score for my pattern ?

0
Entering edit mode

No problem. Your PWM model will be filled with zeros and ones (for instance, if your first nucleotide is a G, your PWM will start with A = 0 ; C = 0 ; G = 1 ; T = 0).

0
Entering edit mode

Ok great and one more question. For reverse compliment matches do I need to construct to matches and subsequently calls moods twice for each query ?

0
Entering edit mode

I think so. Anyway, its easy to reverse-complement a sequence with biopython. See http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc23 for examples.

0
Entering edit mode

Is there any limitation on size of query sequence?

2
Entering edit mode
11.9 years ago

Perhaps it would also be helpful to use the "matching core" of cutadapt?

It's used to strip adapters from sequencing data, which sounds like a problem very much (in spirit) to what you are trying to solve.

0
Entering edit mode

Steve : I think I might just be able to use cutadapt as it. You gussed it right. I am trying to get rid of the linker which is somewhat similar to an adaptor but there are some edge cases where this method might not be directly applicable. I will try talking to authors and better understand the core.

2
Entering edit mode
11.9 years ago

There is a fast Levenshtein edit distance module for Python here:

0
Entering edit mode

Albert : thanks for your comment. I think for this particular comparison using Levenshtein distance might not be appropriate as I dont want to allow indels and more so the length of the pattern and query strings are different so edit distance will always be > 1. This is based on my limited understanding. Feel free to correct me if I am wrong.

1
Entering edit mode
11.9 years ago
Andreas ★ 2.5k

Have a look at Pyhton's difflib. You can use it for approximate pattern matching.

An example taken from there:

s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75


The first argument to SequenceMatcher allows you to ignore certain characters, which might be handy for ambiguity characters like N. If you want to search on the reverse complement as well, then you will have to create it (e.g. using Biopython's reverse_complement())

Matching against billions of sequences will be slow though. Two additional functions (quick_ratio() and real_quick_ratio()) might help a bit, but if calling an external program is an option for you, then have a look at e.g. Mummer or similar programs.

Andreas