Question: Genotype count in a VCF file
0
gravatar for akang
2.6 years ago by
akang90
akang90 wrote:

From a VCF file how can i count how many samples have 0/0, 0/1, 1/1 genotypes on a given locus? Ill appreciate any help!

snp vcf • 2.5k views
ADD COMMENTlink modified 2.6 years ago by Jorge Amigo11k • written 2.6 years ago by akang90
2
gravatar for Jorge Amigo
2.6 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

here is a simple perl alternative:

zcat ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
| perl -ane '
/^#/ and next;
%c = ();
foreach (@F[9..$#F]) { /^([^:]+)/ and $c{$1}++ }
print "$F[0]\t$F[1]";
foreach $gt (sort keys %c) { print "\t$gt:$c{$gt}" }
print "\n"
'
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Jorge Amigo11k
1
gravatar for Persistent LABS
2.6 years ago by
Pune
Persistent LABS740 wrote:

I have not tried. But probably it will work. Let me know if it does not work.

grep -v '^#' temp.vcf | perl -lne '@x=split /\s+/,$_;$var11=grep{/1\/1/}@x[9..$#x]; $var01=grep{/0\/1/}@x[9..$#x];$var10=grep{/1\/0/}@x[9..$#x];print "$x[0]\t$x[1]\t$var01\t$var10\t$var11\n"'

Thanks

Priyabrata

Persistent LABS

ADD COMMENTlink modified 2.6 years ago by Jorge Amigo11k • written 2.6 years ago by Persistent LABS740

Although I haven't checked whether it's correct, it does work ;)

ADD REPLYlink written 2.6 years ago by WouterDeCoster39k
1
gravatar for sacha
2.6 years ago by
sacha1.7k
France
sacha1.7k wrote:

The following command counts occurence of genotype in a VCF file.

  cat test.vcf|grep -oE "([0-2]/[0-2])"|sort | uniq -c

You can filter for a specific locus before :

  cat test.vcf|awk '$1==CHROM && $2==POS {print $0}' | grep -oE "([0-2]/[0-2])"|sort | uniq -c
ADD COMMENTlink written 2.6 years ago by sacha1.7k

I think the solution provided would tell total 0/1 or 1/0 count per vcf file. He wants how many samples have 0/0, 0/1, 1/1 genotypes on a given locus. So suppose there are 10 samples in a vcf file, then there would be 10 genotypes at each locus corresponding to 10 sample. So 0/0, 0/1, 1/1 distribution has to be wrt 10 samples.

ADD REPLYlink written 2.6 years ago by Persistent LABS740

It does work (it summarizes) but I assume OP wanted counts per line/variant.

ADD REPLYlink written 2.6 years ago by WouterDeCoster39k

could it be possible to print the sample names as well?

ADD REPLYlink written 13 months ago by hellbio370
1
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

using bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

cat ILLUMINA.wex.broad_phase2_baseline.20111114.both.exome.genotypes.1000.vcf | 
java -jar src/jvarkit/dist/bioalcidae.jar -F VCF -e 'while(iter.hasNext()) {var ctx = iter.next();m={};for(var i=0;i< ctx.getNSamples();++i) {var g=ctx.getGenotype(i);var t=g.getType().name();if(t in m) {m[t]++;} else {m[t]=1;};} out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)); }'

1 69453 {"NO_CALL":748,"HOM_REF":426,"HOM_VAR":3,"HET":5}
1 69486 {"NO_CALL":747,"HOM_REF":434,"HET":1}
1 69496 {"NO_CALL":749,"HOM_REF":429,"HET":4}
1 69511 {"NO_CALL":693,"HOM_VAR":392,"HET":72,"HOM_REF":25}
1 69534 {"NO_CALL":755,"HOM_REF":426,"HET":1}
1 69541 {"NO_CALL":761,"HOM_REF":420,"HET":1}
1 69552 {"NO_CALL":757,"HOM_REF":419,"HET":6}
1 69590 {"NO_CALL":776,"HOM_REF":404,"HET":2}
1 69610 {"NO_CALL":783,"HOM_REF":392,"HET":6,"HOM_VAR":1}
1 69635 {"NO_CALL":789,"HOM_REF":391,"HET":2}
1 69761 {"NO_CALL":886,"HOM_REF":276,"HET":13,"HOM_VAR":7}
1 69768 {"NO_CALL":911,"HOM_REF":270,"HET":1}
1 69897 {"NO_CALL":999,"HOM_VAR":131,"HET":24,"HOM_REF":28}
1 69968 {"NO_CALL":1152,"HOM_REF":29,"HET":1}
1 861275 {"HOM_REF":1181,"HET":1}
1 861292 {"HOM_REF":1179,"HET":3}
ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum120k

Hi Pierre, could we print the sample names also in separate columns corresponding to the counts? BTW, i could not find the jar in the git, is it moved to a different location or name?

ADD REPLYlink modified 13 months ago • written 13 months ago by hellbio370

, i could not find the jar in the git,

you need to compile the program.

could we print the sample names also in separate columns corresponding to the counts

yes , just loop over the m object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/keys

ADD REPLYlink written 13 months ago by Pierre Lindenbaum120k

I tried getSampleName() in the print statement. It won't do what i need, it is just printing one name for all the variants. Seems there is a little more complicated solution that needs some intuition. Could you help.

out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)+" "+g.getSampleName())
ADD REPLYlink modified 13 months ago • written 13 months ago by hellbio370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 945 users visited in the last hour