Question: Dna Pattern Matching In Python for Analysis
1
drenz20 wrote:

Hi Guys, I am new to the forum to come directly with a question / problem.

I want to write a program that counts the patterns like this. I have given such sequences as an example:

• A: AAAAAAAAA
• B: AAAAACAAG
• C: GAAAACGAA
• D: AAAAAAAAT

A patten is a group of four letters of the four groups. As an example, the first Patten would be "AAGA". Now I wonder how many times that happens in my entire Sequenze. So far I have only found programs that are looking for a given pattern. Does anyone have an idea how to program this or do you already know where this was done?

I'm very thankful for your help!

Best regards

rna-seq sequence • 1.0k views
modified 13 months ago by Kevin Blighe49k • written 13 months ago by drenz20

Is this a homework assignment? What have you tried?

No, this is not a homework assignment. I'm getting into bioinformatics and the problem has been with me many times. I have already tried the following approach:

enter link description here

But I did not even get an idea. That's why I'm asking here if anyone has an idea.

Many thanks for your help! I'm watching this. I still wonder if the kmer-counter is the other way around. A pattern should be a group of four letters. Each of these four letters is from the group A, B, C, D. So the first letter from the first group is A, from the second group A, from the third group G and from the fourth group A. Together, these 4 letters are then a patter. Do you understand what I mean?

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. 6
sacha1.8k wrote:

Something like this ?

>A
AAAAAAAAA
>B
AAAAACAAG
>C
GAAAACGAA
>D
AAAAAAAAT

Python counter

from Bio import SeqIO
import pandas as pd
from collections import Counter

df = pd.DataFrame()
for record in SeqIO.parse("test.fa", "fasta"):
df[record.id] = list(str(record.seq))
df = df.transpose()

It returns a dataframe like this :

0   1   2   3   4   5   6   7   8
A   A   A   A   A   A   A   A   A   A
B   A   A   A   A   A   C   A   A   G
C   G   A   A   A   A   C   G   A   A
D   A   A   A   A   A   A   A   A   T

Then you can count column string occurence with Counter

c = Counter([df[i].str.cat() for i in df])
# Output : Counter({'AAGA': 2, 'AAAA': 5, 'ACCA': 1, 'AGAT': 1})

Note : It only works if you have few sequence. Loading the dataframe with a huge fasta will fail with low memory. You also may want to see Position Weight Matrix

Thank you so much! I think I got it now.

ADD REPLYlink modified 13 months ago • written 13 months ago by drenz20
1

Is the previous code the expected result? So, if you have a large dataset, you can avoid creation of matrix. This is not efficient, but will work :

patterns = []
seq1 = str(next(SeqIO.parse("test.fa", "fasta")).seq)

for i in range(len(seq1)):
pattern = list()
for record in SeqIO.parse("test.fa", "fasta"):
pattern.append(record.seq[i])
patterns.append("".join(pattern))

Counter(patterns)
1

Yes, that absolutely meets my expectations! I did not know the position weight matrix before. But I study biomathematics, so that was not a problem now. The advantage of this is also that I can directly spend the probabilities. Thank you, that was really useful and even worked!

4
Joe14k wrote:

What you are looking for is 'kmer frequency'. There are lots of pre-existing tools to calculate this.

Our very own Alex Reynolds has this github repository for instance:

https://github.com/alexpreynolds/kmer-counter

Kmer frequency will give you the number of occurrences of ALL strings of k letters long. If you're only interested in particular kmer occurences, you can filter the data you get back. You may even be able to specify which kmers you want counted specifically.

The above is not a python solution, but there are plenty out there that are which you should be able to find easily now you have that key terminology.

Here's a tutorial you could follow if you wanted to code something yourself though.

4
Alex Reynolds29k wrote:

You could use a Python regular expression with a lookahead for your kmer of interest:

>>> import re
>>> a = 'AAAAACAAGAAA'
>>> [x.start() for x in re.finditer('(?=AAA)', a)]
[0, 1, 2, 9]
>>> [x.start() for x in re.finditer('(?=AAZ)', a)]
[]

The len() property of the list of "hit" start positions is the number or frequency of the kmer in question.

Consider also what you might need to do in Python, if you need to search the reverse complement of the query sequence.

Another possible approach would be to write a Python generator of all contiguous substrings of a query sequence (and, possibly, its reverse complement). Using a generator would reduce the memory overhead of making a list from all substrings.

One can iterate or walk through the output of a generator, one element at a time. You might compare the generator output with your kmer of interest and increment a counter, if there's a match.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Alex Reynolds29k