Question: Doing an operation on one huge VCF, use Python or would C be faster (somewhat niave bcftools/computer science question)
gravatar for curious
5 weeks ago by
curious320 wrote:

I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.

One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.

bcftools C python vcf • 131 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by curious320

Can you give a representative example what you actually aim to do?

ADD REPLYlink written 5 weeks ago by ATpoint36k
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
chr5    11893   chr5:11893:T:C  T       C       .       PASS       GT:DS:HDS:GP       0|0:0.021:0.021,0:0.965,0.031,0 0|0:0.080:0.080,0.006:0.910,0.080,0.001       0|0:0.030:0.008,0.021:0.965,0.031,0

I am trying to apply for each row:

Var(HDS)/(p(1-p)) where p=mean(HDS)


Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by curious320

If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by finswimmer13k
from pysam import VariantFile
from itertools import chain
from statistics import variance, mean

def calc_rsq(row):
    row_variance = variance(row)
    row_mean = mean(row)
    rsq = row_variance / (row_mean * (1-row_mean))
    return rsq

vcf_in = VariantFile("my_fav_vcf.vcf.gz")  # auto-detect input format
vcf_out = VariantFile('test.vcf', 'w', header=vcf_in.header)

for rec in vcf_in.fetch('chr1', 100000, 101000):
    haploid_dosages = [s['HDS'] for s in rec.samples.values()]
    haploid_dosages_list = list(chain(*haploid_dosages))
    rsq = calc_rsq(haploid_dosages_list)["R2"] = rsq

Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before

ADD REPLYlink written 5 weeks ago by curious320
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 978 users visited in the last hour