Question: Improving efficiency of a script for genotype scoring that enumerates all allele combinations
0
gravatar for Volka
11 months ago by
Volka50
Volka50 wrote:

I have written a script in order to compute a score, as well as a frequency for a list of genetic profiles.

A genetic profile here consists of a combination of SNPs. Each SNP has two alleles. The input file is something like below:

  (SNP1 SNP2 SNP3) # example header not in file
    AA   CC   TT
    AT   CC   TT
    TT   CC   TT
    AA   CG   TT
    AT   CG   TT
    TT   CG   TT
    AA   GG   TT
    AT   GG   TT
    TT   GG   TT
    AA   CC   TA
    AT   CC   TA
    TT   CC   TA
    AA   CG   TA
    AT   CG   TA
    TT   CG   TA
    AA   GG   TA
    AT   GG   TA
    TT   GG   TA
    AA   CC   AA
    AT   CC   AA
    TT   CC   AA
    AA   CG   AA
    AT   CG   AA
    TT   CG   AA
    AA   GG   AA
    AT   GG   AA
    TT   GG   AA

I then have the following code, which can take in the input file above and a table containing weights and frequencies, such as below:

(SNP        RiskAll  RefAll         OR            log(OR)    RiskAllFreq) # example header not in file
SNP1             A       T       1.25    0.223143551314     0.97273 
SNP2             C       G       1.07    0.0676586484738    0.3     
SNP3             T       A       1.08    0.0769610411361    0.1136

It then computes a score, based on the sum of log odds ratios for each risk allele for each SNP in a genetic profile, as well as a frequency based on multiplying the allele frequencies, assuming Hardy Weinberg equilibrium.

snp=[]
riskall={}
weights={}
popfreqs={}    # frequency of effect allele

# read in table containing SNP IDs, respective risk alleles, log(OR) and allele frequencies
with open(table, 'r') as f:
    for line in f:
        snp.append(line.split()[0])
        riskall[line.split()[0]]=line.split()[1]
        weights[line.split()[0]]=line.split()[4]
        popfreqs[line.split()[0]]=line.split()[5]


# compute scores and frequencies for each genetic profile
freqs={}
scores={}
with open(geneticprofiles, 'r') as f:
    for line in f:
        score=0
        freq=1
        for j in range(len(line.split())):
            allele1=line.split()[j][0]
            allele2=line.split()[j][1]
            if allele2 != riskall[snp[j]]:      # homozygous for reference allele
                score+=0
                freq*=(float(1-float(popfreqs[snp[j]]))*float(1-float(popfreqs[snp[j]])))
            elif allele1 != riskall[snp[j]] and allele2 == riskall[snp[j]]:  # heterozygous
                score+=(float(weights[snp[j]]))
                freq*=(2*(float(1-float(popfreqs[snp[j]]))*float(popfreqs[snp[j]])))
            elif allele1 == riskall[snp[j]]:      # homozygous for risk allele
                score+=2*(float(weights[snp[j]]))
                freq*=(float(popfreqs[snp[j]])*float(popfreqs[snp[j]]))


            if freq < float(sys.argv[3]):   # frequency threshold to break loop 
                break

        print(','.join(line.split()) + "\t" + str(score) + "\t" + str(freq))

So far, I have set a variable where I can specify a threshold to break the loop when the frequency gets extremely low, for instance, at approximately 1e-10. I am looking to ramp this up to include at least 20 SNPs. What improvements can be done to speed up the script? Right now, I am using a threshold of 1e-4, but six days in the script is still running.

ADD COMMENTlink modified 10 months ago by Michael Dondrup46k • written 11 months ago by Volka50
2

6 days sounds weird for a script with one significant loop. Please check the following:

  • How large are your files in reality, how many lines?
  • Did you test that your script really does what it is supposed to do? (I suspect it doesn't.)
  • If not, make a small input file of 10 lines with head -n10 and test the output an run times.

Some hints to what might be broke:

  • you don't seem to skip headers in files, treating the first line as any other
  • you don't skip row numbers in your genotype file, treating '01' etc. as the first allele!
ADD REPLYlink modified 11 months ago • written 11 months ago by Michael Dondrup46k

Hi, thanks for the reply! I was able to test the script out with a small input file, up to 15 SNPs. This input file containing the combinations is about 630MB with 14,348,907 lines (3^15), while the output is about 1GB. This takes about 10 to 15 minutes. The output looks to be right, having manually calculated some of the frequencies and scores.

When ramping up to 22 SNPs however, the input file with the SNP combinations goes up to about 2TB. The increase in lines is dramatic here, to 3,486,784,401 (3^20).

EDIT: Also sorry about that! I actually added the headers and row numbers to be more informative, they're not actually in the input files..

ADD REPLYlink modified 11 months ago • written 11 months ago by Volka50

Please do the test on an small file with 10 lines of genotypes (genetic profiles), the number of snps doesn't matter here, first and check each line for correctness. Your script doesn't match your example formats!

ADD REPLYlink modified 11 months ago • written 11 months ago by Michael Dondrup46k

When ramping up to 22 SNPs however, the input file with the SNP combinations goes up to about 2TB. The increase in lines is dramatic here, to 3,486,784,401 (3^20).

Ah, I see! Please provide more background, are these genotypes 'real'? or do you just generate them to calculate 'everything' as in all possible combinations? My first impulse would be to say that if you generate a file-size so big you cannot handle it, you might as well not need to do it at all. Then I calculate the running time: if ~0.5GB of input takes 10 minutes, then (1T ~2048x10min) 2048*10/60/24 14 days! So your observed run-time until now is less than half of what it takes to finish.

ADD REPLYlink modified 11 months ago • written 11 months ago by Michael Dondrup46k

That's right, we are indeed trying to get every single possible combination, whether they are real or not. That's where I'm hoping the frequency threshold will come into play, as we would not be interested in looking at combinations that are at very low frequencies, and skipping the loop for those would have (hopefully) saved some time.

Your calculation for the run times does make sense though. I am wondering if there is any way I could improve my script to speed up the process?

Just in case, I have also uploaded the input and output when running with 10 SNPs.

Input table: https://pastebin.com/dZ5MH3g0

Combinations: https://pastebin.com/zVnYFpWa (first 10 lines)

Output: https://pastebin.com/YgFWEXxM (first 10 lines)

ADD REPLYlink modified 11 months ago • written 11 months ago by Volka50
1

I guess for such big file you are describing here, python isn't the right choice. Python is known to be very slow in using loops. You should think about using a compiler language like java or C++.

Also thinking about ways how to parallize the calculation will be necessary.

fin swimmer

ADD REPLYlink written 11 months ago by finswimmer12k

I guess a Perl script would work as well, in this case it's mainly IO that is the limiting factor. There is no principal problem of parsing a 1TB file in perl line by line.

ADD REPLYlink written 11 months ago by Michael Dondrup46k

I still suspect your problem is ill-posed. Why do you need to calculate all the combinations of all 22 SNPs? Imagine the following combinatorial problem: each SNP has two alleles (0,1), that gives at least 3 genotypes (0,1),(0,0),(1,1), ignore (1,0) indistinguishable from (0,1) unless genotypes are phased. 3^22=31,381,059,609 (31 billion combinations or ~4 times the earth's population)

  • most of these combinations can never be observed in any population because of sample size
  • most of these combinations will never occur in anyone because alleles are in linkage with each other.

Now you seem to try to brute-force your way through the calculus. In informatics, brute-force of a combinatorial problem is always the last resort.

ADD REPLYlink modified 11 months ago • written 11 months ago by Michael Dondrup46k

One thing that seems like it might be an option is to use the heavily optimised itertools python library (if you stick to python).

It has many functions for dealing with permutations of data in combinations. You may find some incremental gains purely from switching to that, which could be significant for your real data.

Python is also faster if you wrap the code with the construct:

def main():
   # Do stuff

if __name__ == “__main__”:
    main()
ADD REPLYlink written 11 months ago by Joe14k

Thanks for the reply! I actually use itertools to compute all the combinations before the actual calculation. I shall have a look and see if I can incorporate the calculation into that part.

By wrapping the code, do you mean I could simply put my code into a function and use "if __name__ == “__main__”:" to speed up the code?

ADD REPLYlink written 11 months ago by Volka50

Yep, exactly. I can’t find the link where I first discovered this was a thing for python. If I remember correctly, it has something to do with efficiency of allocating namespaces, its faster inside a function or something like that. Admittedly these gains will mostly be during setup and tear down rather than the loop running, but it might help the byte code in some way I can’t predict.

So yes, put all of your code* inside a function called main(), and then add that if statement to the end of the file (this is also how you build code that can be put together in modules). Here’s an example of how you might structure some code using this idiom: https://github.com/jrjhealey/bioinfo-tools/blob/master/Mutate.py

*by ‘all of your code’ I mean the main logic chain. It’s ok to have separate functions defined outside of the main function, so long as they are called from within it.

Some other general pointers (though I don’t know if they can be usefully applied to this particular code):

  • list comprehensions are faster than naked loops
  • functions like map, reduce, and lambda functions are fast.
  • python code inside functions runs faster than its equivalent ‘naked’ code due to how it’s bytecode compiled.

Edit: found the original link I was looking for:

http://mycode.doesnot.run/2018/04/11/pivot/

ADD REPLYlink modified 11 months ago • written 11 months ago by Joe14k
2
gravatar for Michael Dondrup
10 months ago by
Bergen, Norway
Michael Dondrup46k wrote:

This hasn't been answered previously, but I think it deserves a proper answer.

The proper way to solve this combinatorial problem is not a more efficient programming language or minor code changes, but to revise the problem formulation and then to seek a solution that is not relying on brute force.

The current implementation is a a brute force approach with minor pruning (jumping out of the score calculation, that on average has no or little effect on complexity) that tries to enumerate the complete search space of 2^22 = 4,194,304 possible allele combinations. This solution is inadequate, primarily for the reason that it is not relevant in 'real genetics' because most likely the majority of genotypes is never observed or present in any population. Therefore, I do not see that the need for the majority calculations is well established, if there is a requirement anyway, you need to provide better arguments for this.

To avoid unnecessary computations and thereby improve efficiency of the script would either:

  • restrict the scoring to alleles that really occur, e.g. score all patients with their observed genotypes
  • if the problem is somehow related to optimization of a risk profile, use proper optimization techniques, e.g. dynamic programming to avoid full enumeration of the search space

While the number of iterations is excessive, it is not intractable. As a proof, the current implementation relies on an R-script that generates all possible combinations (see Method to get all possible combinations of genotypes for a group of SNPs ). If the R script is capable of enumerating all possible combinations and write them to a string representation, the little calculus required for scoring the combinations would not add to the complexity of that R-script. Therefore, ignoring that it is a brute force approach, an improved way of dealing with the coring is to calculate scores in R already, which would be a single additional vector arithmetic operation per genotype.


Other minor improvements would include propper encoding of the alleles, if there are always only two alleles per SNP, they can be encoded as binary vectors which will reduce memory requirements and lookup time.

ADD COMMENTlink modified 10 months ago • written 10 months ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 982 users visited in the last hour