Question: Doing an operation on one huge VCF, use Python or would C be faster (somewhat niave bcftools/computer science question)
0
curious320 wrote:

I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.

One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.

bcftools C python vcf • 131 views
modified 5 weeks ago • written 5 weeks ago by curious320

Can you give a representative example what you actually aim to do?

``````##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
chr5    11893   chr5:11893:T:C  T       C       .       PASS       GT:DS:HDS:GP       0|0:0.021:0.021,0:0.965,0.031,0 0|0:0.080:0.080,0.006:0.910,0.080,0.001       0|0:0.030:0.008,0.021:0.965,0.031,0
``````

I am trying to apply for each row:

`Var(HDS)/(p(1-p)) where p=mean(HDS)`

So:

`Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)`

If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.

``````from pysam import VariantFile
from itertools import chain
from statistics import variance, mean

def calc_rsq(row):
row_variance = variance(row)
row_mean = mean(row)
rsq = row_variance / (row_mean * (1-row_mean))
return rsq

vcf_in = VariantFile("my_fav_vcf.vcf.gz")  # auto-detect input format

for rec in vcf_in.fetch('chr1', 100000, 101000):
haploid_dosages = [s['HDS'] for s in rec.samples.values()]
haploid_dosages_list = list(chain(*haploid_dosages))
rsq = calc_rsq(haploid_dosages_list)
rec.info["R2"] = rsq
vcf_out.write(rec)
``````

Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before