Question: calculate Per variant Heterozygosity from VCF file
2
gravatar for SOHAIL
2.8 years ago by
SOHAIL310
Beijing Institute of Genomics, CAS.
SOHAIL310 wrote:

Hi everybody,

Is there any way to calculate the per variant heterozygosity (i.e. number of 1/0 or 0/1 genotypes observed at given variant site for set of individuals in VCF file) from VCF file?

I knew per individual heterozygosity can be calculated by --het tag from VCFtools

Thanks!

-sohail

ngs vcf • 4.6k views
ADD COMMENTlink modified 2.8 years ago by Kevin Blighe65k • written 2.8 years ago by SOHAIL310
11
gravatar for Kevin Blighe
2.8 years ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

NB - this does NOT support multi-allelic records. If you have these, split them via bcftools norm -m-any

----------------------

Assuming that you're confident that your VCF is normalised and that you genuinely just want to count occurrences of heterozygous calls per line, then the following will work for either phased (0|1 or 1|0) or unphased (0/1 or 1/0) genotypes, or a mixture of these.

Here, I'm actually accessing a BCF and it's the imputed 1000 Genomes Phase III data. So, the genotypes are phased.

The gsub function in AWK conveniently returns the number of matched patterns.

paste <(bcftools view 1000Genomes.Norm.bcf |\
    awk -F"\t" 'BEGIN {print "CHR\tPOS\tID\tREF\tALT"} \
      !/^#/ {print $1"\t"$2"\t"$3"\t"$4"\t"$5}') \
    \
  <(bcftools query -f '[\t%SAMPLE=%GT]\n' 1000Genomes.Norm.bcf |\
    awk 'BEGIN {print "nHet"} {print gsub(/0\|1|1\|0|0\/1|1\/0/, "")}')

CHR POS ID  REF ALT nHet
1   10177   rs367896724 A   AC  1490
1   10235   rs540431307 T   TA  6
1   10352   rs555500075 T   TA  2025
1   10505   rs548419688 A   T   1
1   10506   rs568405545 C   G   1
1   10511   rs534229142 G   A   1
1   10539   rs537182016 C   A   3
1   10542   rs572818783 C   T   1
1   10579   rs538322974 C   A   1
1   10616   rs376342519 CCGCCGTTGCAAAGGCGCGCCG  C   35
1   10642   rs558604819 G   A   21
1   11008   rs575272151 C   G   403
1   11012   rs544419019 C   G   403
1   11063   rs561109771 T   G   15
1   13011   rs574746232 T   G   3
1   13110   rs540538026 G   A   134
1   13116   rs62635286  T   G   414
1   13118   rs200579949 A   G   414
ADD COMMENTlink modified 7 months ago • written 2.8 years ago by Kevin Blighe65k
1

It worked perfectly fine.. thanks for the help.. @Kevin

ADD REPLYlink written 2.7 years ago by SOHAIL310

You're welcome, Sohail

ADD REPLYlink written 2.7 years ago by Kevin Blighe65k

This is almost perfect. The problem are sites like this:

0/1:1,1:2:26:0|1:27975_T_C:39,0,26

This site gives me two heterozygous counts instead of 1, I am trying to solve it, adding to gsub something like this but is not working gsub(/^0\|1|^1\|0|^0\/1|^1\/0/,"")

Any Idea?

ADD REPLYlink written 22 months ago by shinken123100

Hey, that will require an extra if statement. The code above is designed for multi-allelic records that are encoded as 1/2, 2/2, 1/0, etc. I do not exactly have time to edit this, so, my recommendation would be to just split your multiallelic records via bcftools norm -m-any and then run the above AWK code.

How did you produce your file?

ADD REPLYlink written 22 months ago by Kevin Blighe65k

I made the SNP calling with GATK.

ADD REPLYlink written 22 months ago by shinken123100

I am not surprised (seriously) - this is an issue that re-occurs every now and then. GATK's programs can produce VCFs in a format that is difficult to use with other scripts/programs.

ADD REPLYlink written 22 months ago by Kevin Blighe65k
1

Yep, thats True, I this case, because my file is not phased, the solution was to use your script only with the nonphases counts. like this gsub(/0\/1|1\/0/,"").

Thank you for your script It help me a lot.

ADD REPLYlink written 22 months ago by shinken123100

Hey! Thanks. It works fine. Just a small update to use this command with a VCF file including multiallelic sites (up to four).

paste <(awk -F"\t" 'BEGIN {print "CHR\tPOS\tREF\tALT"} !/^#/ {print $1"\t"$2"\t"$4"\t"$5}' file.vcf) \
<(awk 'BEGIN {print "nHet"} !/^#/ {print gsub(/0\|1|1\|0|0\/1|1\/0|0\|2|2\|0|0\/2|2\/0|0\|3|3\|0|0\/3|3\/0|0\|4|4\|0|0\/4|4\/0/, "")}'  file.vcf)
ADD REPLYlink modified 7 months ago • written 7 months ago by gsilvaarias0

Thank you. Can you show some sample output just to help users?

ADD REPLYlink written 7 months ago by Kevin Blighe65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2186 users visited in the last hour