Python code: too slow to compare two dicts
Entering edit mode
4.0 years ago
Caroline • 0

Hi everyone! I'm a freshman in Python. I was trying to extract the alternative splicing sites that overlapped with SNP sites. I wrote a code to do this, and it works, but too slow and costs much memory. I post it here and wonder anyone could give me some advices to improve it. Thanks.

import re

path = "D:/work_dir/"
fin1 = open(path + "AS_event_list.txt",'r')  
fin2 = open(path + "SNP_site.txt",'r') 
fot =  open(path + "SNP_AS.txt",'w')

with fin1 as file:
    for line in file: 
        _line = line.strip().split('_')
        chr1 = '_'.join(_line[1:3])
        _line1 = _line[0] + '\t' + chr1 + '\t'+ '\t'.join(_line[3:10])
        dic1[_line1] = chr1

with fin2 as file2:
    for line2 in file2:
   #e.g., scaffold_1    1876    .   G   C   .   .   EFF=intergenic_region (seperated by tab)
        _line2 = line2.strip().split('\t')
        chr2 = _line2[0]
        pos = _line2
        dic2[line2] = chr2

for key in dic2:
    newpos = key.split('\t')
    _chr0 = newpos[0]
    pos2 = newpos[1]

    for k in dic1:
        newpos3 = k.split('\t')
        chr0 = newpos3[1]
        poss = '\t'.join(newpos3[3:9])
        pos0 = poss.split('\t')

        for _pos in pos0:
            if(str(pos2) == str(_pos) and chr0 == _chr0):
                fot.write( key + '\t' + k)  

RNA-Seq python • 757 views
Entering edit mode

It is unfortunate to only show a block of code. Please illustrate the problem by showing input and expected outputs. The answer is probably anyway "use bedtools intersect".

Entering edit mode

^This exactly. Don't re-invent the wheel. Writing code to solve problems that have already been solved efficiently is not worth the effort. No amount of optimization a beginner could do will yield a better result than an expert's program written in an optimized language.

Entering edit mode

Ok, while I agree that you should probably use bedtools intersect, I am always happy to see that people like to learn how things work. So, here's what comes to mind in your script:

  1. You're storing everything from file2 in a dictionary, which you only iterate over once. You could easily merge blocks 2 and 3, reading each line from file2 (your SNPs) and then do your comparisons against the alternative splicing data in file1. Looking at your code fragment, there is no need to actually store the SNP data. (-> this will already save a huge amount of memory, depending how many SNPs you got)

  2. In your comparison of AS vs SNPs (block 3), you're redoing the extraction of chr0 and pos0 over and over again, resulting in extra runtime. It would make more sense to preprocess that and store the data differently (e.g. in a list of tuples (chr, pos, full_line)).

  3. In the last for loop you're comparing each position before checking the seq (scaffold/contig/chromosome). It would make more sense to compare the seq first, and then only iterate over the positions if the seqs match. In addition, you're using str() casts, which should not be necessary, since you're never casting to int() before. (also adds up in terms of runtime)

  4. If I read this correctly, you're storing all data in dictionaries (with using the whole data line as key, which is strange), but never actually use the values. Dictionaries in Python are (or used to be?) huge memory hogs, so for this case you might be better off using lists (s. 1.)


Login before adding your answer.

Traffic: 3202 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6