I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram
I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram
Split your vcf file by sample and count how many times chromosom appear in each file .
FILE=yourfile.vcf
for sample in `bcftools query -l $FILE`
do 
      bcftools view -c1 -H -s $sample -o ${sample}.vcf $FILE
      cat ${sample}.vcf |cut -f1|uniq -c > ${sample}.count
done
You can use a bash command like:
gunzip -c myfile.vcf.gz | grep -v "#" | cut -f1 | uniq -c
Explanation:
gunzip -c myfile.vcf.gz -> decompress the VCFgrep -v "#" -> removes all header linescut -f1 -> keeps only the first column (which is the chromosome id)uniq -c -> counts unique elements in a listThank JC for the prompt reply, but with this commend, I get same number of the variants on each chromosome across the samples. The commend counts all the chromosome sites where a variants is called across the samples which would be same for all the samples. What I want is the number of the variants called on each chromosome in a individual sample.
With plink2:
plink2 --vcf <VCF filename> --out converted
plink2 --pfile converted --sample-counts --chr 1 --out chr1_results
plink2 --pfile converted --sample-counts --chr 2 --out chr2_results
...
The --sample-counts flag provides essentially the same information as bcftools stats -s, except the implementation is ~1000x as efficient.
A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
$ java -jar bioalcidaejdk.jar  -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));'  otavirus_rf.vcf.gz
RF07    S4:1
RF07    S3:2
RF05    S2:4
RF07    S2:2
RF11    S4:1
RF03    S5:2
RF09    S1:1
RF09    S2:1
RF01    S5:1
RF09    S3:1
RF03    S2:4
RF03    S1:1
RF09    S5:1
RF05    S3:4
RF03    S4:2
RF07    S5:1
RF05    S4:2
RF03    S3:4
RF06    S3:1
RF06    S2:1
RF04    S1:3
RF06    S1:1
RF10    S4:2
RF10    S1:1
RF02    S4:1
details:
stream(). /* get a stream of variant */
    flatMap( 
        V->V.getGenotypes(). /* get the genotypes */
            stream(). /* convert to a stream of genotypes */
            filter(G->G.isCalled() && !G.isHomRef()). /* discard NO_CALL && HOM_REF */
            map(G->V.getContig()+"\t"+G.getSampleName()) /* convert to key=contig+sample */
        ).
    collect( 
        Collectors.groupingBy(Function.identity(), Collectors.counting()) /* count each key */
        ).forEach((K,C)->println(K+" : "+C)); /* print */
Was searching for the same answer and sorta figured out an approach that works for me
Using a combo of bcftools and UNIX you can get the count of alternative positions per indvidual
bcftools query -i'GT="alt"' -f'[%SAMPLE %GT \n]' 1365_filtered.01.vcf.gz | awk '{print $1}' | sort | uniq -c > sample.counts
If you want to get this by chromosome you could include the "-r" parameter when running bcftools, or add '%CHROM\' to the -f parameter and use UNIX to fitler posthoc.
The nice thing about this approach is s you can explictly count the number of het/alt/hom/miss by chaing the "GT=" option.
More on bcftools here.
Please remember that bcftools can be unpredictable with multi-allelic sites. For each bcftools command, one would need to understand how that command treats multi-allelic sites to ensure their results are accurate.
Run bcftools norm (or even better, vt-norm/vt-decompose) before using bcftools.
BCFtools stats with any one of the below two options is what you need..
 -s, --samples <list>               list of samples for sample stats, "-" to include all samples
 -S, --samples-file <file>          file of samples to include
Example:
bcftools stats -s - <multisample VCF file>
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Use vcfstats function from Rtgtools per sample stats