Question

Doing an operation on one huge VCF, use Python or would C be faster (somewhat niave bcftools/computer science question)

0

Entering edit mode

4.2 years ago

curious ▴ 780

I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.

One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.

python bcftools c vcf • 1.7k views

ADD COMMENT • link 4.2 years ago by curious ▴ 780

0

Entering edit mode

Can you give a representative example what you actually aim to do?

ADD REPLY • link 4.2 years ago by ATpoint 84k

0

Entering edit mode

##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
chr5    11893   chr5:11893:T:C  T       C       .       PASS       GT:DS:HDS:GP       0|0:0.021:0.021,0:0.965,0.031,0 0|0:0.080:0.080,0.006:0.910,0.080,0.001       0|0:0.030:0.008,0.021:0.965,0.031,0

I am trying to apply for each row:

Var(HDS)/(p(1-p)) where p=mean(HDS)

So:

Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)

ADD REPLY • link 4.2 years ago by curious ▴ 780

0

Entering edit mode

If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.

ADD REPLY • link 4.2 years ago by finswimmer 16k

1

Entering edit mode

from pysam import VariantFile
from itertools import chain
from statistics import variance, mean

def calc_rsq(row):
    row_variance = variance(row)
    row_mean = mean(row)
    rsq = row_variance / (row_mean * (1-row_mean))
    return rsq

vcf_in = VariantFile("my_fav_vcf.vcf.gz")  # auto-detect input format
vcf_out = VariantFile('test.vcf', 'w', header=vcf_in.header)


for rec in vcf_in.fetch('chr1', 100000, 101000):
    haploid_dosages = [s['HDS'] for s in rec.samples.values()]
    haploid_dosages_list = list(chain(*haploid_dosages))
    rsq = calc_rsq(haploid_dosages_list)
    rec.info["R2"] = rsq
    vcf_out.write(rec)

Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before

ADD REPLY • link 4.2 years ago by curious ▴ 780