Genotype count in a VCF file
4
0
Entering edit mode
7.9 years ago
akang ▴ 110

From a VCF file how can i count how many samples have 0/0, 0/1, 1/1 genotypes on a given locus? Ill appreciate any help!

SNP vcf • 7.6k views
2
Entering edit mode
7.9 years ago
sacha ★ 2.4k

The following command counts occurence of genotype in a VCF file.

  cat test.vcf|grep -oE "([0-2]/[0-2])"|sort | uniq -c


You can filter for a specific locus before :

  cat test.vcf|awk '$1==CHROM &&$2==POS {print $0}' | grep -oE "([0-2]/[0-2])"|sort | uniq -c  ADD COMMENT 0 Entering edit mode I think the solution provided would tell total 0/1 or 1/0 count per vcf file. He wants how many samples have 0/0, 0/1, 1/1 genotypes on a given locus. So suppose there are 10 samples in a vcf file, then there would be 10 genotypes at each locus corresponding to 10 sample. So 0/0, 0/1, 1/1 distribution has to be wrt 10 samples. ADD REPLY 0 Entering edit mode It does work (it summarizes) but I assume OP wanted counts per line/variant. ADD REPLY 0 Entering edit mode could it be possible to print the sample names as well? ADD REPLY 2 Entering edit mode 7.9 years ago here is a simple perl alternative: zcat ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \ | perl -ane ' /^#/ and next; %c = (); foreach (@F[9..$#F]) { /^([^:]+)/ and $c{$1}++ }
print "$F[0]\t$F[1]";
foreach $gt (sort keys %c) { print "\t$gt:$c{$gt}" }
print "\n"
'

1
Entering edit mode
7.9 years ago

I have not tried. But probably it will work. Let me know if it does not work.

grep -v '^#' temp.vcf | perl -lne '@x=split /\s+/,$_;$var11=grep{/1\/1/}@x[9..$#x];$var01=grep{/0\/1/}@x[9..$#x];$var10=grep{/1\/0/}@x[9..$#x];print "$x[0]\t$x[1]\t$var01\t$var10\t$var11\n"'


Thanks

Priyabrata

Persistent LABS

0
Entering edit mode

Although I haven't checked whether it's correct, it does work ;)

1
Entering edit mode
7.9 years ago

using bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

cat ILLUMINA.wex.broad_phase2_baseline.20111114.both.exome.genotypes.1000.vcf |
java -jar src/jvarkit/dist/bioalcidae.jar -F VCF -e 'while(iter.hasNext()) {var ctx = iter.next();m={};for(var i=0;i< ctx.getNSamples();++i) {var g=ctx.getGenotype(i);var t=g.getType().name();if(t in m) {m[t]++;} else {m[t]=1;};} out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)); }'

1 69453 {"NO_CALL":748,"HOM_REF":426,"HOM_VAR":3,"HET":5}
1 69486 {"NO_CALL":747,"HOM_REF":434,"HET":1}
1 69496 {"NO_CALL":749,"HOM_REF":429,"HET":4}
1 69511 {"NO_CALL":693,"HOM_VAR":392,"HET":72,"HOM_REF":25}
1 69534 {"NO_CALL":755,"HOM_REF":426,"HET":1}
1 69541 {"NO_CALL":761,"HOM_REF":420,"HET":1}
1 69552 {"NO_CALL":757,"HOM_REF":419,"HET":6}
1 69590 {"NO_CALL":776,"HOM_REF":404,"HET":2}
1 69610 {"NO_CALL":783,"HOM_REF":392,"HET":6,"HOM_VAR":1}
1 69635 {"NO_CALL":789,"HOM_REF":391,"HET":2}
1 69761 {"NO_CALL":886,"HOM_REF":276,"HET":13,"HOM_VAR":7}
1 69768 {"NO_CALL":911,"HOM_REF":270,"HET":1}
1 69897 {"NO_CALL":999,"HOM_VAR":131,"HET":24,"HOM_REF":28}
1 69968 {"NO_CALL":1152,"HOM_REF":29,"HET":1}
1 861275 {"HOM_REF":1181,"HET":1}
1 861292 {"HOM_REF":1179,"HET":3}

0
Entering edit mode

Hi Pierre, could we print the sample names also in separate columns corresponding to the counts? BTW, i could not find the jar in the git, is it moved to a different location or name?

0
Entering edit mode

, i could not find the jar in the git,

you need to compile the program.

could we print the sample names also in separate columns corresponding to the counts

yes , just loop over the m object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/keys

0
Entering edit mode

I tried getSampleName() in the print statement. It won't do what i need, it is just printing one name for all the variants. Seems there is a little more complicated solution that needs some intuition. Could you help.

out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)+" "+g.getSampleName())