Question: Most Frequent Nucleotide/String Pattern In Dna Sequence Files Using Python Or Biopython
2
gravatar for Dhillonv10
7.5 years ago by
Dhillonv10100
Dhillonv10100 wrote:

Hi all,

So I'm working on this problem of finding the most frequent 6-nucleotide long patterns from a given DNA file which is a standard FASTA file looking like:

>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides> 

>name of protein
ACGTACTAGACAGAGAGAGAG .... <more nucleotides>

Now I have these sequences in a file and I have a method for counting the most frequent ones, namely taking each sequence and copy-paste it in a script that uses the Counter function from python library:

>>> from collections import Counter
>>> protein = "ACGTACTAGACAGAGAGAGAG"
>>> Counter(protein[i:i+6] for i in range(len(protein)-5))
Counter({'AGAGAG': 3, 'GAGAGA': 2, 'ACGTAC': 1, 'CGTACT': 1, 'ACAGAG': 1, 'AGACAG': 1, 'TACTAG': 1, 'TAGACA': 1, 'CTAGAC': 1, 'CAGAGA': 1, 'GTACTA': 1, 'ACTAGA': 1, 'GACAGA': 1})

I modified the snipper a little so now it asks me for the sequence and I just copy-past it in. Biopython has a FASTA parser: (Source: the biopython cookbook)

from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print seq_record.id
    print repr(seq_record.seq)
    print len(seq_record)

But using that and then doing Counter on seq_record.seq just makes things ugly. My question is that can anyone tell me how to modify that little snippet of code that I have so that it can read from a file and do that for each sequence present? TO be more precise, how do I change that biopython snipper so that Counter function gives me the same result as the Counter used above when I hand-input the sequence. Thanks!

fasta python biopython • 5.2k views
ADD COMMENTlink written 7.5 years ago by Dhillonv10100
2
gravatar for Damian Kao
7.5 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Seems like you are almost there. You just have to combined the two pieces of code like this:

from Bio import SeqIO
from collections import Counter

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    seq = str(seq_record.seq)
    cnt = Counter(seq[i:i+6] for i in range(len(seq)-5))

    #you can now do what you want with this counter object. 
    #you can turn it in to a dictionary by dict(cnt)
    #cnt.items() to convert to list of pairs (elements,count)
    #
    #for example:
    #
    #print ">" + seq_record.id
    #for subsequence, count in cnt.items():
    #    print subSequence + '\t' + str(count)

You probably do not have to cast seq_record.seq as a string, but sometimes it can be a problem if you don't. Read more about the python counter object here: http://docs.python.org/dev/library/collections.html#counter-objects

ADD COMMENTlink written 7.5 years ago by Damian Kao15k

thanks @DK, I've tried somerhing like that, but here's the problem using COunter on the seq object gives me something like:

Counter({Seq('CCTGTG', SingleLetterAlphabet()): 1, Seq('TCTGTG', SingleLetterAlphabet()): 1, Seq('TGGTGG', SingleLetterAlphabet()): 1, Seq('TTTCAG', SingleLetterAlphabet()): 1, Seq('TTTTGG', SingleLetterAlphabet()): 1, Seq('TTGCCT', SingleLetterAlphabet()): 1, Seq('AAGGTT', SingleLetterAlphabet()): 1, Seq('TAACAA', SingleLetterAlphabet()): 1, Seq('GGGGCA', SingleLetterAlphabet()): 1, Seq('GACCCC',

ADD REPLYlink written 7.5 years ago by Dhillonv10100

try casting seq object as string like in the script I wrote.

ADD REPLYlink written 7.5 years ago by Damian Kao15k

@DK: you ROCK!!!! thanks a lot, wow i've been trying to work this out for a long time and it finally worked out.

ADD REPLYlink written 7.5 years ago by Dhillonv10100
1
gravatar for brentp
7.5 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

If I understand correctly, this should get you close:

from collections import Counter
from Bio import SeqIO

for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print seq_record.id
    seq = seq_record.seq
    c = Counter(seq[i:i + 6] for i in range(len(seq) - 5))
    print c
ADD COMMENTlink written 7.5 years ago by brentp23k

thanks @brentp, I have the same problem with this snippet as the one above it, I get this: Counter({Seq('CCTGTG', SingleLetterAlphabet()): 1, Seq('TCTGTG', SingleLetterAlphabet()): 1, Seq('TGGTGG', SingleLetterAlphabet()): 1, Seq('TTTCAG', SingleLetterAlphabet()): 1, Seq('TTTTGG', SingleLetterAlphabet()): 1, Seq('TTGCCT', SingleLetterAlphabet()): 1, Seq('AAGGTT', SingleLetterAlphabet()): 1, Seq('TAACAA', SingleLetterAlphabet()): 1, Seq('GGGGCA', SingleLetterAlphabet()): 1, Seq('GACCCC',

ADD REPLYlink written 7.5 years ago by Dhillonv10100

add seq = str(seq_record.seq) as DK suggets

ADD REPLYlink written 7.5 years ago by brentp23k

@brentp, it turns out if you cast a string at seq_record.seq, it works out perfectly. So if you could edit your answer here: seq = (str) seq_record.seq

ADD REPLYlink written 7.5 years ago by Dhillonv10100

yup it worked as needed.

ADD REPLYlink written 7.5 years ago by Dhillonv10100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour