Question: Genotype count in a VCF file
0
gravatar for akang
3.8 years ago by
akang90
akang90 wrote:

From a VCF file how can i count how many samples have 0/0, 0/1, 1/1 genotypes on a given locus? Ill appreciate any help!

snp vcf • 3.5k views
ADD COMMENTlink modified 3.8 years ago by Jorge Amigo11k • written 3.8 years ago by akang90
2
gravatar for Jorge Amigo
3.8 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

here is a simple perl alternative:

zcat ALL.chr7.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
| perl -ane '
/^#/ and next;
%c = ();
foreach (@F[9..$#F]) { /^([^:]+)/ and $c{$1}++ }
print "$F[0]\t$F[1]";
foreach $gt (sort keys %c) { print "\t$gt:$c{$gt}" }
print "\n"
'
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Jorge Amigo11k
1
gravatar for Persistent LABS
3.8 years ago by
Pune
Persistent LABS740 wrote:

I have not tried. But probably it will work. Let me know if it does not work.

grep -v '^#' temp.vcf | perl -lne '@x=split /\s+/,$_;$var11=grep{/1\/1/}@x[9..$#x]; $var01=grep{/0\/1/}@x[9..$#x];$var10=grep{/1\/0/}@x[9..$#x];print "$x[0]\t$x[1]\t$var01\t$var10\t$var11\n"'

Thanks

Priyabrata

Persistent LABS

ADD COMMENTlink modified 3.8 years ago by Jorge Amigo11k • written 3.8 years ago by Persistent LABS740

Although I haven't checked whether it's correct, it does work ;)

ADD REPLYlink written 3.8 years ago by WouterDeCoster44k
1
gravatar for sacha
3.8 years ago by
sacha1.9k
France
sacha1.9k wrote:

The following command counts occurence of genotype in a VCF file.

  cat test.vcf|grep -oE "([0-2]/[0-2])"|sort | uniq -c

You can filter for a specific locus before :

  cat test.vcf|awk '$1==CHROM && $2==POS {print $0}' | grep -oE "([0-2]/[0-2])"|sort | uniq -c
ADD COMMENTlink written 3.8 years ago by sacha1.9k

I think the solution provided would tell total 0/1 or 1/0 count per vcf file. He wants how many samples have 0/0, 0/1, 1/1 genotypes on a given locus. So suppose there are 10 samples in a vcf file, then there would be 10 genotypes at each locus corresponding to 10 sample. So 0/0, 0/1, 1/1 distribution has to be wrt 10 samples.

ADD REPLYlink written 3.8 years ago by Persistent LABS740

It does work (it summarizes) but I assume OP wanted counts per line/variant.

ADD REPLYlink written 3.8 years ago by WouterDeCoster44k

could it be possible to print the sample names as well?

ADD REPLYlink written 2.2 years ago by hellbio420
1
gravatar for Pierre Lindenbaum
3.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

using bioalcidae: https://github.com/lindenb/jvarkit/wiki/BioAlcidae

cat ILLUMINA.wex.broad_phase2_baseline.20111114.both.exome.genotypes.1000.vcf | 
java -jar src/jvarkit/dist/bioalcidae.jar -F VCF -e 'while(iter.hasNext()) {var ctx = iter.next();m={};for(var i=0;i< ctx.getNSamples();++i) {var g=ctx.getGenotype(i);var t=g.getType().name();if(t in m) {m[t]++;} else {m[t]=1;};} out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)); }'

1 69453 {"NO_CALL":748,"HOM_REF":426,"HOM_VAR":3,"HET":5}
1 69486 {"NO_CALL":747,"HOM_REF":434,"HET":1}
1 69496 {"NO_CALL":749,"HOM_REF":429,"HET":4}
1 69511 {"NO_CALL":693,"HOM_VAR":392,"HET":72,"HOM_REF":25}
1 69534 {"NO_CALL":755,"HOM_REF":426,"HET":1}
1 69541 {"NO_CALL":761,"HOM_REF":420,"HET":1}
1 69552 {"NO_CALL":757,"HOM_REF":419,"HET":6}
1 69590 {"NO_CALL":776,"HOM_REF":404,"HET":2}
1 69610 {"NO_CALL":783,"HOM_REF":392,"HET":6,"HOM_VAR":1}
1 69635 {"NO_CALL":789,"HOM_REF":391,"HET":2}
1 69761 {"NO_CALL":886,"HOM_REF":276,"HET":13,"HOM_VAR":7}
1 69768 {"NO_CALL":911,"HOM_REF":270,"HET":1}
1 69897 {"NO_CALL":999,"HOM_VAR":131,"HET":24,"HOM_REF":28}
1 69968 {"NO_CALL":1152,"HOM_REF":29,"HET":1}
1 861275 {"HOM_REF":1181,"HET":1}
1 861292 {"HOM_REF":1179,"HET":3}
ADD COMMENTlink written 3.8 years ago by Pierre Lindenbaum129k

Hi Pierre, could we print the sample names also in separate columns corresponding to the counts? BTW, i could not find the jar in the git, is it moved to a different location or name?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by hellbio420

, i could not find the jar in the git,

you need to compile the program.

could we print the sample names also in separate columns corresponding to the counts

yes , just loop over the m object. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/keys

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum129k

I tried getSampleName() in the print statement. It won't do what i need, it is just printing one name for all the variants. Seems there is a little more complicated solution that needs some intuition. Could you help.

out.println(ctx.getContig()+" "+ctx.getStart()+" "+JSON.stringify(m)+" "+g.getSampleName())
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by hellbio420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 738 users visited in the last hour