Question: Dna Pattern Matching In Python for Analysis
gravatar for drenz
13 months ago by
drenz20 wrote:

Hi Guys, I am new to the forum to come directly with a question / problem.

I want to write a program that counts the patterns like this. I have given such sequences as an example:


A patten is a group of four letters of the four groups. As an example, the first Patten would be "AAGA". Now I wonder how many times that happens in my entire Sequenze. So far I have only found programs that are looking for a given pattern. Does anyone have an idea how to program this or do you already know where this was done?

I'm very thankful for your help!

Best regards

rna-seq sequence • 1.0k views
ADD COMMENTlink modified 13 months ago by Kevin Blighe49k • written 13 months ago by drenz20

Is this a homework assignment? What have you tried?

ADD REPLYlink written 13 months ago by Benn7.8k

No, this is not a homework assignment. I'm getting into bioinformatics and the problem has been with me many times. I have already tried the following approach:

enter link description here

But I did not even get an idea. That's why I'm asking here if anyone has an idea.

ADD REPLYlink written 13 months ago by drenz20

Many thanks for your help! I'm watching this. I still wonder if the kmer-counter is the other way around. A pattern should be a group of four letters. Each of these four letters is from the group A, B, C, D. So the first letter from the first group is A, from the second group A, from the third group G and from the fourth group A. Together, these 4 letters are then a patter. Do you understand what I mean?

ADD REPLYlink written 13 months ago by drenz20

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLYlink written 13 months ago by WouterDeCoster41k

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLYlink written 13 months ago by WouterDeCoster41k
gravatar for sacha
13 months ago by
sacha1.8k wrote:

Something like this ?

Input fasta file : test.fa


Python counter

from Bio import SeqIO
import pandas as pd
from collections import Counter

df = pd.DataFrame()
for record in SeqIO.parse("test.fa", "fasta"):
    df[] = list(str(record.seq))
df = df.transpose()

It returns a dataframe like this :

    0   1   2   3   4   5   6   7   8
A   A   A   A   A   A   A   A   A   A
B   A   A   A   A   A   C   A   A   G
C   G   A   A   A   A   C   G   A   A
D   A   A   A   A   A   A   A   A   T

Then you can count column string occurence with Counter

c = Counter([df[i] for i in df])
# Output : Counter({'AAGA': 2, 'AAAA': 5, 'ACCA': 1, 'AGAT': 1})

Note : It only works if you have few sequence. Loading the dataframe with a huge fasta will fail with low memory. You also may want to see Position Weight Matrix

ADD COMMENTlink modified 13 months ago • written 13 months ago by sacha1.8k

Thank you so much! I think I got it now.

ADD REPLYlink modified 13 months ago • written 13 months ago by drenz20

Is the previous code the expected result? So, if you have a large dataset, you can avoid creation of matrix. This is not efficient, but will work :

patterns = []
seq1 = str(next(SeqIO.parse("test.fa", "fasta")).seq)

for i in range(len(seq1)):
    pattern = list()
    for record in SeqIO.parse("test.fa", "fasta"): 

ADD REPLYlink written 13 months ago by sacha1.8k

Yes, that absolutely meets my expectations! I did not know the position weight matrix before. But I study biomathematics, so that was not a problem now. The advantage of this is also that I can directly spend the probabilities. Thank you, that was really useful and even worked!

ADD REPLYlink written 13 months ago by drenz20
gravatar for Joe
13 months ago by
United Kingdom
Joe14k wrote:

What you are looking for is 'kmer frequency'. There are lots of pre-existing tools to calculate this.

Our very own Alex Reynolds has this github repository for instance:

Kmer frequency will give you the number of occurrences of ALL strings of k letters long. If you're only interested in particular kmer occurences, you can filter the data you get back. You may even be able to specify which kmers you want counted specifically.

The above is not a python solution, but there are plenty out there that are which you should be able to find easily now you have that key terminology.

Here's a tutorial you could follow if you wanted to code something yourself though.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Joe14k
gravatar for Alex Reynolds
13 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could use a Python regular expression with a lookahead for your kmer of interest:

>>> import re
>>> [x.start() for x in re.finditer('(?=AAA)', a)]
[0, 1, 2, 9]
>>> [x.start() for x in re.finditer('(?=AAZ)', a)]

The len() property of the list of "hit" start positions is the number or frequency of the kmer in question.

Consider also what you might need to do in Python, if you need to search the reverse complement of the query sequence.

Another possible approach would be to write a Python generator of all contiguous substrings of a query sequence (and, possibly, its reverse complement). Using a generator would reduce the memory overhead of making a list from all substrings.

One can iterate or walk through the output of a generator, one element at a time. You might compare the generator output with your kmer of interest and increment a counter, if there's a match.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Alex Reynolds29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1574 users visited in the last hour