Question

calculate Per variant Heterozygosity from VCF file

4

Entering edit mode

6.3 years ago

SOHAIL ▴ 400

Hi everybody,

Is there any way to calculate the per variant heterozygosity (i.e. number of 1/0 or 0/1 genotypes observed at given variant site for set of individuals in VCF file) from VCF file?

I knew per individual heterozygosity can be calculated by --het tag from VCFtools

Thanks!

-sohail

ngs vcf • 9.7k views

ADD COMMENT • link updated 6.3 years ago by Kevin Blighe 87k • written 6.3 years ago by SOHAIL ▴ 400

score 14 · Answer 1 · 2017-12-28

14

Entering edit mode

6.3 years ago

Kevin Blighe 87k

NB - this does NOT support multi-allelic records. If you have these, split them via bcftools norm -m-any

----------------------

Assuming that you're confident that your VCF is normalised and that you genuinely just want to count occurrences of heterozygous calls per line, then the following will work for either phased (0|1 or 1|0) or unphased (0/1 or 1/0) genotypes, or a mixture of these.

Here, I'm actually accessing a BCF and it's the imputed 1000 Genomes Phase III data. So, the genotypes are phased.

The gsub function in AWK conveniently returns the number of matched patterns.

paste <(bcftools view 1000Genomes.Norm.bcf |\
    awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
      !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' 1000Genomes.Norm.bcf |\
    awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}')

CHR POS ID  REF ALT nHet
1   10177   rs367896724 A   AC  1490
1   10235   rs540431307 T   TA  6
1   10352   rs555500075 T   TA  2025
1   10505   rs548419688 A   T   1
1   10506   rs568405545 C   G   1
1   10511   rs534229142 G   A   1
1   10539   rs537182016 C   A   3
1   10542   rs572818783 C   T   1
1   10579   rs538322974 C   A   1
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   35
1   10642   rs558604819 G   A   21
1   11008   rs575272151 C   G   403
1   11012   rs544419019 C   G   403
1   11063   rs561109771 T   G   15
1   13011   rs574746232 T   G   3
1   13110   rs540538026 G   A   134
1   13116   rs62635286  T   G   414
1   13118   rs200579949 A   G   414

ADD COMMENT • link 4.2 years ago by Kevin Blighe 87k

1

Entering edit mode

It worked perfectly fine.. thanks for the help.. @Kevin

ADD REPLY • link 6.3 years ago by SOHAIL ▴ 400

0

Entering edit mode

You're welcome, Sohail

ADD REPLY • link 6.3 years ago by Kevin Blighe 87k

0

Entering edit mode

This is almost perfect. The problem are sites like this:

0/1:1,1:2:26:0|1:27975_T_C:39,0,26

This site gives me two heterozygous counts instead of 1, I am trying to solve it, adding to gsub something like this but is not working gsub(/^0\|1|^1\|0|^0\/1|^1\/0/,"")

Any Idea?

ADD REPLY • link 5.4 years ago by shinken123 ▴ 150

0

Entering edit mode

Hey, that will require an extra if statement. The code above is designed for multi-allelic records that are encoded as 1/2, 2/2, 1/0, etc. I do not exactly have time to edit this, so, my recommendation would be to just split your multiallelic records via bcftools norm -m-any and then run the above AWK code.

How did you produce your file?

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

I made the SNP calling with GATK.

ADD REPLY • link 5.4 years ago by shinken123 ▴ 150

0

Entering edit mode

I am not surprised (seriously) - this is an issue that re-occurs every now and then. GATK's programs can produce VCFs in a format that is difficult to use with other scripts/programs.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

1

Entering edit mode

Yep, thats True, I this case, because my file is not phased, the solution was to use your script only with the nonphases counts. like this gsub(/0\/1|1\/0/,"").

Thank you for your script It help me a lot.

ADD REPLY • link 5.4 years ago by shinken123 ▴ 150

0

Entering edit mode

Hey! Thanks. It works fine. Just a small update to use this command with a VCF file including multiallelic sites (up to four).

paste <(awk -F"\t" 'BEGIN {print "CHR\tPOS\tREF\tALT"} !/^#/ {print $1"\t"$2"\t"$4"\t"$5}' file.vcf) \
<(awk 'BEGIN {print "nHet"} !/^#/ {print gsub(/0\|1|1\|0|0\/1|1\/0|0\|2|2\|0|0\/2|2\/0|0\|3|3\|0|0\/3|3\/0|0\|4|4\|0|0\/4|4\/0/, "")}'  file.vcf)