Entering edit mode

4.0 years ago

curious
▴
750

I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.

One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.

Can you give a representative example what you actually aim to do?

I am trying to apply for each row:

`Var(HDS)/(p(1-p)) where p=mean(HDS)`

So:

`Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)`

If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.

Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before