Question

How to count variants per gene per sample from a multi-sample VCF

1

Entering edit mode

4.5 years ago

O.ibra ▴ 10

I have an annotated vcf of 24 samples, and would like to get a table of the number of variants per gene per sample. Is there a way to do that which has been tried before?

(java -Xmx4g -jar snpEff/snpEff.jar count -v hg19 all_multi.vcf > test_count.txt) only produces counts and not per sample information.

While:

bcftools query -i'GT="1"' -f'[%SAMPLE %GT \n]' all_multi.vcf.gz | awk '{print $1}' | sort | uniq -c > sample.counts

Doesn't produce anything.

Any ideas/suggestions would be much appreciated. Thanks in advance!

VCF Count Genes • 2.5k views

ADD COMMENT • link 4.5 years ago by O.ibra ▴ 10

Pierre Lindenbaum · Answer 1 · 2019-10-25

2

Entering edit mode

4.5 years ago

Pierre Lindenbaum 161k

using Bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

and the following snippet:

final Counter<String> counts=new Counter<>();
stream().forEach(V->{
    for(final String ann: V.getAttributeAsStringList("ANN","")) {
        String tokens[]=ann.split("[\\|]");
        if(tokens.length <= 6) continue;
        final String gene=tokens[3]+"\t"+tokens[6];
        for(Genotype g:V.getGenotypes()) {
            if(!(g.isHet() || g.isHomVar())) continue;
            counts.incr(gene+"\t"+g.getSampleName());   
            }
        }
    });

for(final String key: counts.keySet()) {
println(key+"\t"+counts.count(key));
}

invoke:

java -jar dist/bioalcidaejdk.jar -f snppet.code src/test/resources/test_vcf01.vcf
(...)
C1orf159-RP11-465B22.5  ENSG00000131591-ENSG00000223823 S6  1
C1orf159    ENST00000379325 S1  1
RP11-54O7.17    ENST00000606034 S6  3
C1orf159    ENST00000379325 S2  14
C1orf159    ENST00000379325 S3  13
RP11-54O7.17    ENST00000606034 S4  3
C1orf159    ENST00000379325 S4  14
RP11-54O7.17    ENST00000606034 S5  3
C1orf159    ENST00000379325 S5  13
RP11-54O7.17    ENST00000606034 S2  3

ADD COMMENT • link 4.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you, that seems to work but I seem to have a problem with duplicates, which is causing errors in any downstream analysis or data wrangling. Please see the example below

GAK ENST00000511980 S1  2
GAK ENST00000511345 S1  1
ZNF517  ENST00000526178 S2  1
PTPN13  ENST00000427191 S3  1
IL15RA  ENST00000379977 S4  1

Appreciating your suggestions in advance!

ADD REPLY • link updated 4.5 years ago by Pierre Lindenbaum 161k • written 4.5 years ago by O.ibra ▴ 10

0

Entering edit mode

OK course, there is more than one transcript per gene.

ADD REPLY • link 4.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Oh of course, completely missed that bit! Thanks again for your help.

ADD REPLY • link 4.5 years ago by O.ibra ▴ 10

score 0 · Answer 2 · 2019-10-24

0

Entering edit mode

4.5 years ago

jared.andrews07 ★ 16k

You could split the VCF by sample via bcftools view with the --samples option, then use your snpEFF command to get counts for each sample and merge them together by the gene names in R/python/perl to generate your final table.

ADD COMMENT • link 4.5 years ago by jared.andrews07 ★ 16k