Calculate mean DP4 in a multisample vcf
0
0
Entering edit mode
6 months ago
avelarbio46 ▴ 30

Hello everyone!

I'm trying to reduce the FORMAT in my vcf file by doing some summary statistics. To do this, I'm using:

MYVCF=my_multisample_vcf_path
paste <(bcftools view "$MYVCF" \| awk -F"\t" 'BEGIN {print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT"} !/^#/ {print$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9}') <(bcftools query -f '[\t%SAMPLE=%GT]\n' "$MYVCF" \| awk 'BEGIN {OFS="\t"; print "nHomAlt\tnHet\tnHomRef"} {nHet=gsub(/0\|1|1\|0|0\/1|1\/0/, ""); nHomAlt=sub(/1\|1|1\/1/, ""); nHomRef=gsub(/0\|0|0\/0/, ""); print nHomAlt,nHet,nHomRef}') \| sed 's/,\t/\t/g' | sed 's/,$//g' >> out_put.vcf


This is generating 3 columns with the name of the samples that are Het, HomAlt and HomRef for each variant.

I want to do the same thing for DP4 , but instead of printing the names of samples, print the mean of all samples for each variant

##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="ref forward, ref reverse, alt forward, alt reverse">


Obviously, DP4 is a little more complex of a field then GT

Is there anyway to do this with AWK or any other tool?

So, basically, add 4 columns to VCF

DP4_ref_forward_mean            DP4_ref_reverse_mean            DP4_alt_forward_mean            DP4_alt_foward

dp4 vcf bcftools • 250 views