How to check which samples has more uncalled genotypes in multi-sample vcf
1
1
Entering edit mode
3.8 years ago
BAGeno ▴ 180

Hi,

I have multi-sample vcf and in this vcf, there are many sites which have uncalled or missing genotype. Is there a way to check which sample has greater number of uncalled genotypes in vcf. So that I can exclude that sample from further analysis.

genotype missing sample • 1.3k views
ADD COMMENT
0
Entering edit mode

Hello BAGeno,

see my answer in this thread. You just have to adopt the genotype in the awk script or if it's a small file and speed doesn't matter this more easy one.

fin swimmer

ADD REPLY
3
Entering edit mode
3.8 years ago

A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().flatMap(G->G.getGenotypes().stream()).filter(G->!G.isCalled()).map(G->G.getSampleName()).collect(Collectors.groupingBy(Function.identity(), Collectors.counting())).forEach((K,V)->println(K+"\t"+V));' src/test/resources/test_vcf01.vcf  | sort -t $'\t' -k2,2n



S3  8
S4  9
S5  14
S6  18
S2  23
S1  73
  • stream().get a stream of variants
  • flatMap(G->G.getGenotypes().stream()). map to a stream of genotypes
  • filter(G->!G.isCalled()). keep the uncalled genotype
  • map(G->G.getSampleName()). map to the sample name
  • collect(Collectors.groupingBy(Function.identity(), Collectors.counting())) convert to associative array sample/count
  • .forEach((K,V)->println(K+"\t"+V)); print the results.
ADD COMMENT

Login before adding your answer.

Traffic: 1620 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6