Question: Doing an operation on one huge VCF, use Python or would C be faster (somewhat niave bcftools/computer science question)
0
gravatar for curious
5 weeks ago by
curious320
curious320 wrote:

I have to perform what amounts to basically a correlation calculation on dosages from every row of what is equivalent to a 300M varaint X 30K sample VCF.

One thing I am wondering is if this would be faster to write a C plugin and work with BCFs or to Use Python and read in chunks and convert to a numpy matrix before performing my calculation. I am fairly sure the Python approach is going to take a really long time, butI don't know if C would be any faster. Does anyone have any suggestions of how to approach this with performance in mind. I would greatly appreciate any tips. Thank you.

bcftools C python vcf • 131 views
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by curious320

Can you give a representative example what you actually aim to do?

ADD REPLYlink written 5 weeks ago by ATpoint36k
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">
##FORMAT=<ID=HDS,Number=2,Type=Float,Description="Estimated Haploid Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1 SAMPLE2 SAMPLE3
chr5    11893   chr5:11893:T:C  T       C       .       PASS       GT:DS:HDS:GP       0|0:0.021:0.021,0:0.965,0.031,0 0|0:0.080:0.080,0.006:0.910,0.080,0.001       0|0:0.030:0.008,0.021:0.965,0.031,0

I am trying to apply for each row:

Var(HDS)/(p(1-p)) where p=mean(HDS)

So:

Var([0.021,0,0.080,0.006,0.008,0.021]) / .023(1-0.023)

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by curious320

If you like to use python, have a look at pysam, which is a wrapper around the htslib C-API.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by finswimmer13k
from pysam import VariantFile
from itertools import chain
from statistics import variance, mean

def calc_rsq(row):
    row_variance = variance(row)
    row_mean = mean(row)
    rsq = row_variance / (row_mean * (1-row_mean))
    return rsq

vcf_in = VariantFile("my_fav_vcf.vcf.gz")  # auto-detect input format
vcf_out = VariantFile('test.vcf', 'w', header=vcf_in.header)


for rec in vcf_in.fetch('chr1', 100000, 101000):
    haploid_dosages = [s['HDS'] for s in rec.samples.values()]
    haploid_dosages_list = list(chain(*haploid_dosages))
    rsq = calc_rsq(haploid_dosages_list)
    rec.info["R2"] = rsq
    vcf_out.write(rec)

Thanks I think pysam is going to change my work dramatically, I was parsing vcfs manually before

ADD REPLYlink written 5 weeks ago by curious320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 978 users visited in the last hour