How to count variants per gene per sample from a multi-sample VCF
2
1
Entering edit mode
4.5 years ago
O.ibra ▴ 10

I have an annotated vcf of 24 samples, and would like to get a table of the number of variants per gene per sample. Is there a way to do that which has been tried before?

(java -Xmx4g -jar snpEff/snpEff.jar count -v hg19 all_multi.vcf > test_count.txt) only produces counts and not per sample information.

While:

bcftools query -i'GT="1"' -f'[%SAMPLE %GT \n]' all_multi.vcf.gz | awk '{print $1}' | sort | uniq -c > sample.counts

Doesn't produce anything.

Any ideas/suggestions would be much appreciated. Thanks in advance!

VCF Count Genes • 2.5k views
ADD COMMENT
2
Entering edit mode
4.5 years ago

using Bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

and the following snippet:

final Counter<String> counts=new Counter<>();
stream().forEach(V->{
    for(final String ann: V.getAttributeAsStringList("ANN","")) {
        String tokens[]=ann.split("[\\|]");
        if(tokens.length <= 6) continue;
        final String gene=tokens[3]+"\t"+tokens[6];
        for(Genotype g:V.getGenotypes()) {
            if(!(g.isHet() || g.isHomVar())) continue;
            counts.incr(gene+"\t"+g.getSampleName());   
            }
        }
    });

for(final String key: counts.keySet()) {
println(key+"\t"+counts.count(key));
}

invoke:

java -jar dist/bioalcidaejdk.jar -f snppet.code src/test/resources/test_vcf01.vcf
(...)
C1orf159-RP11-465B22.5  ENSG00000131591-ENSG00000223823 S6  1
C1orf159    ENST00000379325 S1  1
RP11-54O7.17    ENST00000606034 S6  3
C1orf159    ENST00000379325 S2  14
C1orf159    ENST00000379325 S3  13
RP11-54O7.17    ENST00000606034 S4  3
C1orf159    ENST00000379325 S4  14
RP11-54O7.17    ENST00000606034 S5  3
C1orf159    ENST00000379325 S5  13
RP11-54O7.17    ENST00000606034 S2  3
ADD COMMENT
0
Entering edit mode

Thank you, that seems to work but I seem to have a problem with duplicates, which is causing errors in any downstream analysis or data wrangling. Please see the example below

GAK ENST00000511980 S1  2
GAK ENST00000511345 S1  1
ZNF517  ENST00000526178 S2  1
PTPN13  ENST00000427191 S3  1
IL15RA  ENST00000379977 S4  1

Appreciating your suggestions in advance!

ADD REPLY
0
Entering edit mode

OK course, there is more than one transcript per gene.

ADD REPLY
0
Entering edit mode

Oh of course, completely missed that bit! Thanks again for your help.

ADD REPLY
0
Entering edit mode
4.5 years ago

You could split the VCF by sample via bcftools view with the --samples option, then use your snpEFF command to get counts for each sample and merge them together by the gene names in R/python/perl to generate your final table.

ADD COMMENT

Login before adding your answer.

Traffic: 2263 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6