Currently, I have hundreds of bacterial genomes (fasta files), and I want to parse these fasta files then load k-mers from these files to a dict. For example,
A.fasta
>1
ATAATA
>2
TTTAAA
.....
B.fasta
>1
ATAAGA
>2
TTTAGA
......
Then, the dict I want to get looks like (suppose k=4 ): \
d={'ATAA':{'A':'','B':'',...},'TAAA':{'A':'',...},...} # A refers to "A.fasta", B refers to"B.fasta"
However, I find it's not efficient enough for my own code (see below)... Is there a more efficient way to achieve this goal?
My python script:
import re
import os
from Bio import SeqIO
from collections import defaultdict
import sys
sys.path.append('..')
from library import seqpy
def build_kmer_dict(idir,k):
print('Load k-mer to dict...')
dlabel=defaultdict(lambda:{})
c=1
label_match={}
for filename in os.listdir(idir):
ff=idir+'/'+filename
seq_dict = {rec.id : rec.seq for rec in SeqIO.parse(ff, "fasta")}
for cl in seq_dict:
seq=str(seq_dict[cl])
for i in range(len(seq)-k+1):
kmer=seq[i:i+k]
rev_kmer=seqpy.revcomp(seq[i:i+k])
dlabel[kmer][c]=''
dlabel[rev_kmer][c]=''
label_match[c]=filename
c+=1
return dlabel,label_match
# All genomes fasta files are in the folder "../Fasta_File_Dir"
d, lm=build_kmer_dict('../Fasta_File_Dir',31)
What are you trying to achieve? Having meaningless values like
''
in a dictionary is not the most efficient data structure, you might want to keep the k-mers in aset
Hi, Asaf,
I want to load k-mers from different genomes into a 2d dict that the first key is the k-mers and the second key is the prefix of these genomes. Considering I need to query, I think dict could be my best choice. What I want to do is to find the most efficient python code to achieve this goal.
Still, I'm not sure what your final goal is and I think you chose the wrong dataset to represent your data. Anyway, a much better solution would be using EMBOSS compseq, make sure you specify
-reverse
and-nozerocount
and set-word
to the desired k-mer size. You will get a list of the k-mers that appear in each input sequence (and their frequencies, but you said you don't need this data)