What's the most efficient way to load k-mers into dict in python?
1
0
Entering edit mode
3.0 years ago
Ray • 0

Currently, I have hundreds of bacterial genomes (fasta files), and I want to parse these fasta files then load k-mers from these files to a dict. For example,

A.fasta

>1 
ATAATA 
>2 
TTTAAA
.....

B.fasta

>1 
ATAAGA 
>2 
TTTAGA 
......

Then, the dict I want to get looks like (suppose k=4 ): \

d={'ATAA':{'A':'','B':'',...},'TAAA':{'A':'',...},...} # A refers to "A.fasta", B refers to"B.fasta"

However, I find it's not efficient enough for my own code (see below)... Is there a more efficient way to achieve this goal?

My python script:

    import re 
    import os 
    from Bio import SeqIO 
    from collections import defaultdict 
    import sys
    sys.path.append('..')
    from library import seqpy 
    def build_kmer_dict(idir,k): 
        print('Load k-mer to dict...') 
        dlabel=defaultdict(lambda:{}) 
        c=1  
        label_match={} 
        for filename in os.listdir(idir): 
            ff=idir+'/'+filename 
            seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(ff, "fasta")} 
            for cl in seq_dict: 
                seq=str(seq_dict[cl]) 
                for i in range(len(seq)-k+1): 
                    kmer=seq[i:i+k] 
                    rev_kmer=seqpy.revcomp(seq[i:i+k]) 
                    dlabel[kmer][c]='' 
                    dlabel[rev_kmer][c]='' 
            label_match[c]=filename
            c+=1 
        return dlabel,label_match 
    # All genomes fasta files are in the folder "../Fasta_File_Dir"
    d, lm=build_kmer_dict('../Fasta_File_Dir',31)
K-mer • 1.9k views
ADD COMMENT
0
Entering edit mode

What are you trying to achieve? Having meaningless values like '' in a dictionary is not the most efficient data structure, you might want to keep the k-mers in a set

ADD REPLY
0
Entering edit mode

Hi, Asaf,

I want to load k-mers from different genomes into a 2d dict that the first key is the k-mers and the second key is the prefix of these genomes. Considering I need to query, I think dict could be my best choice. What I want to do is to find the most efficient python code to achieve this goal.

ADD REPLY
0
Entering edit mode

Still, I'm not sure what your final goal is and I think you chose the wrong dataset to represent your data. Anyway, a much better solution would be using EMBOSS compseq, make sure you specify -reverse and -nozerocount and set -word to the desired k-mer size. You will get a list of the k-mers that appear in each input sequence (and their frequencies, but you said you don't need this data)

ADD REPLY
0
Entering edit mode
3.0 years ago
Michael 54k

I am not sure about your goal here. Do you need to simply count the k-mers or also record their position? Anyway, improving the speed of k-mer counting by parallelization is a little tricky and not achievable in python (or perl). Maybe you can use Jellyfish and then parse the output.

ADD COMMENT
0
Entering edit mode

Hi, Michael,

My goal is not to count the k-mers. I just want to record the k-mers and their genomes label (whether they exist in the genome A or B), so I try to build the 2d dict from these fasta files as my code did. My final goal is to find a more efficient way to do this ... I have tried jellyfish and then parsed the output to the dict in python, however, the whole process is even slower than my own code (Loading output files to python dict takes a long time, I guess)...

ADD REPLY
1
Entering edit mode

Well, I guess there is a reason why jellyfish is written in c++. To see why, you can look at the jellyfish paper. I agree, the standard dictionary in python is quite slow and you could achieve higher rates of kmers/second e.g. in Perl (~up to 2-fold if I remember our attempts correctly, we had done a little comparison in class).

If you need to stick with it, you could remove the revcomp in my opinion, it is unnecessary to do this during dict parsing. It can be either done at query time, or best once the initial dictionary is ready. Another, minor way to gain speed could be to pre-define the (oversize) size of the dictionary (eg. 10 n, uses 10x as much ram but will give no collision) or use a faster hashing function (not sure if that can be even done). Unfortunately, parallelization is not an option.

ADD REPLY
0
Entering edit mode

Thanks a lot for your prompt reply. I guess the best way is to consider using c++ to rebuild this function... Will try to find some related materials about this!

Thank you very much! :)

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6