Count the number of homozygous and heterozygous variants in VCF file
2
2
Entering edit mode
5.4 years ago

I would like to find the count of homozygous major(0|0),homozygous minor(1|1) and heterozygous(1|0,0|1) alleles for each position of a chromosome.I would like to form a table like this:

POS 00 1 0/0 1 11

Pos1 n1 n2 n3

Pos2 n4 n5 n6

where n1...n6 are the counts .

See the example file here 1000Genomes

Any suggestions to try to get this information would be much appreciated

Thanks in Advance

1000Genomes awk count allele • 6.1k views
ADD COMMENT
5
Entering edit mode
5.4 years ago

An awk solution:

$ zcat input.vcf.gz|awk -v OFS="\t" '$0 !~ "^#" {hom_ref = 0; hom_alt = 0; het = 0; for(i=10;i<=NF;i++) { if($i ~ /0\|0/) hom_ref++; else if($i ~ /1\|1/) hom_alt++;  else het++; } print $1, $2, hom_ref, hom_alt, het}'

fin swimmer

ADD COMMENT
0
Entering edit mode

Thanks a lot finswimmer .It works.

ADD REPLY
0
Entering edit mode

I am sorry this seems to give me a wrong solution

ADD REPLY
0
Entering edit mode

Why do you think so?

ADD REPLY
0
Entering edit mode

I get the the count of 0|0 right. But when i manually calculated it gives me the wrong count for 0|1,1|0 and 1|1

ADD REPLY
0
Entering edit mode

Could you please post a small example of your file and the output you get and which you would expect?

ADD REPLY
0
Entering edit mode

Sure. Here is the example :

Position, Count_0|0, Count_0|1, Count_1|0 ,Count_1|1

16050115, 404, 0 , 0 , 0

16050213, 403, 0 , 1 , 0

16050607 400 , 2 , 2 , 0

ADD REPLY
0
Entering edit mode

Hello aadhirareddy1323 ,

what I meant was an example of your input file.

fin swimmer

ADD REPLY
0
Entering edit mode
#CHROM    POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 
22      16050115        .       G       A       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050213        .       C       T       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050607        .       G       A       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050739        .       TA      T       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050783        .       A       G       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050840        .       C       G       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050847        .       T       C       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     
22      16050922        rs367963583     T       G       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0    0|0
22      16050984        rs188945759     C       G       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
22      16051075        .       G       A       .       PASS    .       GT      0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0
22      16051249        rs62224609      T       C       .       PASS    .       GT      0|0     1|0     1|0     0|0     0|0     1|0     0|0     0|0
22      16051453        rs143503259     A       C,G     .       PASS    .       GT      0|0     1|0     0|0     0|0     0|0     1|0     0|0    0|0
ADD REPLY
0
Entering edit mode

OK, thanks. Mistake found. There was in elsemissing here: if($i ~ /1\|1/) hom_alt++;. It must be else if. I corrected it in my answer.

fin swimmer

ADD REPLY
0
Entering edit mode

Thank you fin swimmer . I will try this

ADD REPLY
0
Entering edit mode

@fin swimmer ..This gives me wrong count again

ADD REPLY
0
Entering edit mode

I get the count of home_ref and alt right. The problem with het is I want to only sum the ones with 0|1 and 1|0. but it gives me the sum of 0|1,1|0,0|2,etc.. I am sorry I havent mentioned it before .How do i proceed ? I am new to linux

ADD REPLY
0
Entering edit mode

That wasn't clear to me, that you want to differ the heterozygous. You just have to modify my code in that way, that you define a new variable for each genotype you like to count, check with in if-statement if it exist, count and output all the variables.

So for checking the genotype 0|1 and 1|0 it looks like this:

$ zcat input.vcf.gz|awk -v OFS="\t" '$0 !~ "^#" {hom_ref = 0; hom_alt = 0; het_01 = 0; het_10=0; for(i=10;i<=NF;i++) { if($i ~ /0\|0/) hom_ref++; else if($i ~ /1\|1/) hom_alt++;  else if($i ~ /0\|1/) het_01++; else if($i ~ /1\|0/) het_10++; } print $1, $2, hom_ref, hom_alt, het_01, het_10}'

Why do want to make a difference between 0|1 and 1|0?

fin swimmer

ADD REPLY
0
Entering edit mode

@finswimmer,I dont want to make the difference , but want to exclude 0|2,0|3 etc and keep only count(1|0 and 0|1) for heterozygous.

ADD REPLY
0
Entering edit mode
5.4 years ago
cat vcffile \
| perl -ane '
/^#/ and next;
%c = ();
foreach (@F[9..$#F]) { /^([^:]+)/ and $c{$1}++ }
print "$F[0]\t$F[1]";
foreach $gt (sort keys %c) { print "\t$gt:$c{$gt}" }
print "\n"

I found this code gives me the right count

ADD COMMENT
0
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6