Question

how to count variants par sample per chromosome in a vcf file?

1

Entering edit mode

6.8 years ago

nagarsaggi ▴ 40

I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram

snp • 17k views

ADD COMMENT • link updated 4.2 years ago by Ram 45k • written 6.8 years ago by nagarsaggi ▴ 40

1

Entering edit mode

Use vcfstats function from Rtgtools per sample stats

ADD REPLY • link 6.8 years ago by cpad0112 21k

4

Entering edit mode

6.8 years ago

JC 13k

You can use a bash command like:

gunzip -c myfile.vcf.gz | grep -v "#" | cut -f1 | uniq -c

Explanation:

gunzip -c myfile.vcf.gz -> decompress the VCF
grep -v "#" -> removes all header lines
cut -f1 -> keeps only the first column (which is the chromosome id)
uniq -c -> counts unique elements in a list

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 6.8 years ago by JC 13k

0

Entering edit mode

Thank JC for the prompt reply, but with this commend, I get same number of the variants on each chromosome across the samples. The commend counts all the chromosome sites where a variants is called across the samples which would be same for all the samples. What I want is the number of the variants called on each chromosome in a individual sample.

ADD REPLY • link 6.8 years ago by nagarsaggi ▴ 40

0

Entering edit mode

you can extract samples per individual with VCFtools, then use the same strategy to count called variants

ADD REPLY • link 6.8 years ago by JC 13k

0

Entering edit mode

Yes, I tried this but still get same number of variant per site for all individual samples.

ADD REPLY • link 6.8 years ago by nagarsaggi ▴ 40

0

Entering edit mode

across the samples.

this was not specified in your original question. How are we supposed to know it ?

I updated the question.

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

5.7 years ago

chrchang523 11k

With plink2:

plink2 --vcf <VCF filename> --out converted
plink2 --pfile converted --sample-counts --chr 1 --out chr1_results
plink2 --pfile converted --sample-counts --chr 2 --out chr2_results
...

The --sample-counts flag provides essentially the same information as bcftools stats -s, except the implementation is ~1000x as efficient.

ADD COMMENT • link 5.7 years ago by chrchang523 11k

0

Entering edit mode

Thank you for the wonderful response above about using plink2.

I tried your code above, but I got the error that there is no such file converted.psam? How do I get psam file from the first command line plink2 --vcf <VCF filename> --out converted?

ADD REPLY • link updated 4.2 years ago by Ram 45k • written 5.4 years ago by sookwah.yee • 0

0

Entering edit mode

What happens if you add --make-pgen to the first command line?

ADD REPLY • link updated 4.2 years ago by Ram 45k • written 5.4 years ago by chrchang523 11k

0

Entering edit mode

This is by far the best approach to this problem. SOOO FAST!

By the way, to know the number of variants, we just have to sum columns "HOM_ALT_SNP_CT" and "HET_SNP_CT", right?

ADD REPLY • link 4.2 years ago by Leandro Lima ▴ 970

1

Entering edit mode

6.8 years ago

Pierre Lindenbaum 166k

A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar bioalcidaejdk.jar  -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));'  otavirus_rf.vcf.gz

RF07    S4:1
RF07    S3:2
RF05    S2:4
RF07    S2:2
RF11    S4:1
RF03    S5:2
RF09    S1:1
RF09    S2:1
RF01    S5:1
RF09    S3:1
RF03    S2:4
RF03    S1:1
RF09    S5:1
RF05    S3:4
RF03    S4:2
RF07    S5:1
RF05    S4:2
RF03    S3:4
RF06    S3:1
RF06    S2:1
RF04    S1:3
RF06    S1:1
RF10    S4:2
RF10    S1:1
RF02    S4:1

details:

stream(). /* get a stream of variant */
    flatMap( 
        V->V.getGenotypes(). /* get the genotypes */
            stream(). /* convert to a stream of genotypes */
            filter(G->G.isCalled() && !G.isHomRef()). /* discard NO_CALL && HOM_REF */
            map(G->V.getContig()+"\t"+G.getSampleName()) /* convert to key=contig+sample */
        ).
    collect( 
        Collectors.groupingBy(Function.identity(), Collectors.counting()) /* count each key */
        ).forEach((K,C)->println(K+" : "+C)); /* print */

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

5.7 years ago

morgan6.claire ▴ 10

Was searching for the same answer and sorta figured out an approach that works for me

Using a combo of bcftools and UNIX you can get the count of alternative positions per indvidual

bcftools query -i'GT="alt"' -f'[%SAMPLE %GT \n]' 1365_filtered.01.vcf.gz | awk '{print $1}' | sort | uniq -c > sample.counts

If you want to get this by chromosome you could include the "-r" parameter when running bcftools, or add '%CHROM\' to the -f parameter and use UNIX to fitler posthoc.

The nice thing about this approach is s you can explictly count the number of het/alt/hom/miss by chaing the "GT=" option.