Question: Python code: too slow to compare two dicts
0
gravatar for Caroline
5 weeks ago by
Caroline0
Caroline0 wrote:

Hi everyone! I'm a freshman in Python. I was trying to extract the alternative splicing sites that overlapped with SNP sites. I wrote a code to do this, and it works, but too slow and costs much memory. I post it here and wonder anyone could give me some advices to improve it. Thanks.

import re

path = "D:/work_dir/"
fin1 = open(path + "AS_event_list.txt",'r')  
fin2 = open(path + "SNP_site.txt",'r') 
fot =  open(path + "SNP_AS.txt",'w')

dic1={}
with fin1 as file:
    for line in file: 
    #e.g.,G19254_scaffold_5_-_4353064_4354635_4353064_4354330_4354479_4354635
        _line = line.strip().split('_')
        chr1 = '_'.join(_line[1:3])
        _line1 = _line[0] + '\t' + chr1 + '\t'+ '\t'.join(_line[3:10])
        dic1[_line1] = chr1

dic2={}
with fin2 as file2:
    for line2 in file2:
   #e.g., scaffold_1    1876    .   G   C   .   .   EFF=intergenic_region (seperated by tab)
        _line2 = line2.strip().split('\t')
        chr2 = _line2[0]
        pos = _line2
        dic2[line2] = chr2

for key in dic2:
    newpos = key.split('\t')
    _chr0 = newpos[0]
    pos2 = newpos[1]

    for k in dic1:
        newpos3 = k.split('\t')
        chr0 = newpos3[1]
        poss = '\t'.join(newpos3[3:9])
        pos0 = poss.split('\t')

        for _pos in pos0:
            if(str(pos2) == str(_pos) and chr0 == _chr0):
                fot.write( key + '\t' + k)  
            else:
                continue

fin1.close()
fin2.close()
fot.close()
print("Done.")
rna-seq python • 123 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Caroline0
2

It is unfortunate to only show a block of code. Please illustrate the problem by showing input and expected outputs. The answer is probably anyway "use bedtools intersect".

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint36k

^This exactly. Don't re-invent the wheel. Writing code to solve problems that have already been solved efficiently is not worth the effort. No amount of optimization a beginner could do will yield a better result than an expert's program written in an optimized language.

ADD REPLYlink written 5 weeks ago by RamRS28k
1

Ok, while I agree that you should probably use bedtools intersect, I am always happy to see that people like to learn how things work. So, here's what comes to mind in your script:

  1. You're storing everything from file2 in a dictionary, which you only iterate over once. You could easily merge blocks 2 and 3, reading each line from file2 (your SNPs) and then do your comparisons against the alternative splicing data in file1. Looking at your code fragment, there is no need to actually store the SNP data. (-> this will already save a huge amount of memory, depending how many SNPs you got)

  2. In your comparison of AS vs SNPs (block 3), you're redoing the extraction of chr0 and pos0 over and over again, resulting in extra runtime. It would make more sense to preprocess that and store the data differently (e.g. in a list of tuples (chr, pos, full_line)).

  3. In the last for loop you're comparing each position before checking the seq (scaffold/contig/chromosome). It would make more sense to compare the seq first, and then only iterate over the positions if the seqs match. In addition, you're using str() casts, which should not be necessary, since you're never casting to int() before. (also adds up in terms of runtime)

  4. If I read this correctly, you're storing all data in dictionaries (with using the whole data line as key, which is strange), but never actually use the values. Dictionaries in Python are (or used to be?) huge memory hogs, so for this case you might be better off using lists (s. 1.)

ADD REPLYlink written 5 weeks ago by cschu1812.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1563 users visited in the last hour