I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram
I want to count number of the variants called on each chromosome for each sample from a multi sample vcf file. Any help would be really appropriated. Thanks Ram
Split your vcf file by sample and count how many times chromosom appear in each file .
FILE=yourfile.vcf
for sample in `bcftools query -l $FILE`
do
bcftools view -c1 -H -s $sample -o ${sample}.vcf $FILE
cat ${sample}.vcf |cut -f1|uniq -c > ${sample}.count
done
You can use a bash command like:
gunzip -c myfile.vcf.gz | grep -v "#" | cut -f1 | uniq -c
Explanation:
gunzip -c myfile.vcf.gz
-> decompress the VCFgrep -v "#"
-> removes all header linescut -f1
-> keeps only the first column (which is the chromosome id)uniq -c
-> counts unique elements in a listThank JC for the prompt reply, but with this commend, I get same number of the variants on each chromosome across the samples. The commend counts all the chromosome sites where a variants is called across the samples which would be same for all the samples. What I want is the number of the variants called on each chromosome in a individual sample.
With plink2:
plink2 --vcf <VCF filename> --out converted
plink2 --pfile converted --sample-counts --chr 1 --out chr1_results
plink2 --pfile converted --sample-counts --chr 2 --out chr2_results
...
The --sample-counts flag provides essentially the same information as bcftools stats -s
, except the implementation is ~1000x as efficient.
A one liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html
$ java -jar bioalcidaejdk.jar -e 'stream().flatMap(V->V.getGenotypes().stream().filter(G->G.isCalled()&&!G.isHomRef()).map(G->V.getContig()+"\t"+G.getSampleName())).collect(Collectors.groupingBy(Function.identity(),Collectors.counting())).forEach((K,C)->println(K+":"+C));' otavirus_rf.vcf.gz
RF07 S4:1
RF07 S3:2
RF05 S2:4
RF07 S2:2
RF11 S4:1
RF03 S5:2
RF09 S1:1
RF09 S2:1
RF01 S5:1
RF09 S3:1
RF03 S2:4
RF03 S1:1
RF09 S5:1
RF05 S3:4
RF03 S4:2
RF07 S5:1
RF05 S4:2
RF03 S3:4
RF06 S3:1
RF06 S2:1
RF04 S1:3
RF06 S1:1
RF10 S4:2
RF10 S1:1
RF02 S4:1
details:
stream(). /* get a stream of variant */
flatMap(
V->V.getGenotypes(). /* get the genotypes */
stream(). /* convert to a stream of genotypes */
filter(G->G.isCalled() && !G.isHomRef()). /* discard NO_CALL && HOM_REF */
map(G->V.getContig()+"\t"+G.getSampleName()) /* convert to key=contig+sample */
).
collect(
Collectors.groupingBy(Function.identity(), Collectors.counting()) /* count each key */
).forEach((K,C)->println(K+" : "+C)); /* print */
Was searching for the same answer and sorta figured out an approach that works for me
Using a combo of bcftools and UNIX you can get the count of alternative positions per indvidual
bcftools query -i'GT="alt"' -f'[%SAMPLE %GT \n]' 1365_filtered.01.vcf.gz | awk '{print $1}' | sort | uniq -c > sample.counts
If you want to get this by chromosome you could include the "-r" parameter when running bcftools, or add '%CHROM\' to the -f parameter and use UNIX to fitler posthoc.
The nice thing about this approach is s you can explictly count the number of het/alt/hom/miss by chaing the "GT=" option.
More on bcftools here.
Please remember that bcftools can be unpredictable with multi-allelic sites. For each bcftools command, one would need to understand how that command treats multi-allelic sites to ensure their results are accurate.
Run bcftools norm
(or even better, vt-norm
/vt-decompose
) before using bcftools.
BCFtools stats with any one of the below two options is what you need..
-s, --samples <list> list of samples for sample stats, "-" to include all samples
-S, --samples-file <file> file of samples to include
Example:
bcftools stats -s - <multisample VCF file>
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Use vcfstats function from Rtgtools per sample stats