Question: Dna Pattern Matching In Python for Analysis
gravatar for drenz
6 months ago by
drenz20 wrote:

Hi Guys, I am new to the forum to come directly with a question / problem.

I want to write a program that counts the patterns like this. I have given such sequences as an example:


A patten is a group of four letters of the four groups. As an example, the first Patten would be "AAGA". Now I wonder how many times that happens in my entire Sequenze. So far I have only found programs that are looking for a given pattern. Does anyone have an idea how to program this or do you already know where this was done?

I'm very thankful for your help!

Best regards

rna-seq sequence • 522 views
ADD COMMENTlink modified 6 months ago by Kevin Blighe39k • written 6 months ago by drenz20

Is this a homework assignment? What have you tried?

ADD REPLYlink written 6 months ago by b.nota6.3k

No, this is not a homework assignment. I'm getting into bioinformatics and the problem has been with me many times. I have already tried the following approach:

enter link description here

But I did not even get an idea. That's why I'm asking here if anyone has an idea.

ADD REPLYlink written 6 months ago by drenz20

Many thanks for your help! I'm watching this. I still wonder if the kmer-counter is the other way around. A pattern should be a group of four letters. Each of these four letters is from the group A, B, C, D. So the first letter from the first group is A, from the second group A, from the third group G and from the fourth group A. Together, these 4 letters are then a patter. Do you understand what I mean?

ADD REPLYlink written 6 months ago by drenz20

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

ADD REPLYlink written 6 months ago by WouterDeCoster37k

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLYlink written 6 months ago by WouterDeCoster37k
gravatar for sacha
6 months ago by
sacha1.7k wrote:

Something like this ?

Input fasta file : test.fa


Python counter

from Bio import SeqIO
import pandas as pd
from collections import Counter

df = pd.DataFrame()
for record in SeqIO.parse("test.fa", "fasta"):
    df[] = list(str(record.seq))
df = df.transpose()

It returns a dataframe like this :

    0   1   2   3   4   5   6   7   8
A   A   A   A   A   A   A   A   A   A
B   A   A   A   A   A   C   A   A   G
C   G   A   A   A   A   C   G   A   A
D   A   A   A   A   A   A   A   A   T

Then you can count column string occurence with Counter

c = Counter([df[i] for i in df])
# Output : Counter({'AAGA': 2, 'AAAA': 5, 'ACCA': 1, 'AGAT': 1})

Note : It only works if you have few sequence. Loading the dataframe with a huge fasta will fail with low memory. You also may want to see Position Weight Matrix

ADD COMMENTlink modified 6 months ago • written 6 months ago by sacha1.7k

Thank you so much! I think I got it now.

ADD REPLYlink modified 6 months ago • written 6 months ago by drenz20

Is the previous code the expected result? So, if you have a large dataset, you can avoid creation of matrix. This is not efficient, but will work :

patterns = []
seq1 = str(next(SeqIO.parse("test.fa", "fasta")).seq)

for i in range(len(seq1)):
    pattern = list()
    for record in SeqIO.parse("test.fa", "fasta"): 

ADD REPLYlink written 6 months ago by sacha1.7k

Yes, that absolutely meets my expectations! I did not know the position weight matrix before. But I study biomathematics, so that was not a problem now. The advantage of this is also that I can directly spend the probabilities. Thank you, that was really useful and even worked!

ADD REPLYlink written 6 months ago by drenz20
gravatar for jrj.healey
6 months ago by
United Kingdom
jrj.healey11k wrote:

What you are looking for is 'kmer frequency'. There are lots of pre-existing tools to calculate this.

Our very own Alex Reynolds has this github repository for instance:

Kmer frequency will give you the number of occurrences of ALL strings of k letters long. If you're only interested in particular kmer occurences, you can filter the data you get back. You may even be able to specify which kmers you want counted specifically.

The above is not a python solution, but there are plenty out there that are which you should be able to find easily now you have that key terminology.

Here's a tutorial you could follow if you wanted to code something yourself though.

ADD COMMENTlink modified 6 months ago • written 6 months ago by jrj.healey11k
gravatar for Alex Reynolds
6 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

You could use a Python regular expression with a lookahead for your kmer of interest:

>>> import re
>>> [x.start() for x in re.finditer('(?=AAA)', a)]
[0, 1, 2, 9]
>>> [x.start() for x in re.finditer('(?=AAZ)', a)]

The len() property of the list of "hit" start positions is the number or frequency of the kmer in question.

Consider also what you might need to do in Python, if you need to search the reverse complement of the query sequence.

Another possible approach would be to write a Python generator of all contiguous substrings of a query sequence (and, possibly, its reverse complement). Using a generator would reduce the memory overhead of making a list from all substrings.

One can iterate or walk through the output of a generator, one element at a time. You might compare the generator output with your kmer of interest and increment a counter, if there's a match.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Alex Reynolds27k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2187 users visited in the last hour