Hi,
I've got a massive bgzipped VCF which is tabix indexed. I just want to know the number of variations in the file. Surely tabix can do this?
Solutions to counting lines that involve gunzip are not useful ;-)
Hi,
I've got a massive bgzipped VCF which is tabix indexed. I just want to know the number of variations in the file. Surely tabix can do this?
Solutions to counting lines that involve gunzip are not useful ;-)
The current version of bcftools (v1.3) will store metadata in the index file when using bcftools index and then allow you to access either the total variant count or the number of variants per contig just by reading the index.
The Bcftools documentation has details
-n, --nrecords
print the number of records based on the CSI or TBI index files
-s, --stats
Print per contig stats based on the CSI or TBI index files. Output format is three tab-delimited columns listing the contig name, contig length (. if unknown) and number of records for the contig. Contigs with zero records are not printed.
e.g.
bcftools index some.vcf.gz # create the tbi
bcftools index --nrecords some.vcf.gz # get total variant count
bcftools index --stats some.vcf.gz # get variant count per chromsome
Heng Li says no, and he should know:
Dear Heng,
We are wondering if there's a fast approach to use tabix or any other hack to get the total number of variants that a VCF has without actually reading the whole vcf file and counting the lines. I assume that the total number of rows is somehow stored in the tbi file.
No, this is not stored in the traditional tabix index. The htslib implementation of tabix should have this information in dummy bins, I think.
Someone else needs to confirm, though.Heng
Are simple methods like bcftools stats (https://samtools.github.io/bcftools/bcftools.html#stats) or vcf-stats (http://vcftools.sourceforge.net/perl_module.html#vcf-stats) no good?
I need to get this information from python ! Have any idea ? I guess this value should be somewhere in the file !
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Out of curiosity, how big is this file so that
gunzip -c | wc -l
will be too slow? (But I see your point, one would expect the index to have this information somewhere...)I have a kind of weird approach, I just convert the vcf file to a plink .bed file and it gives me the number of variants in the vcf file. I do need to analyze most of my vcf files with plink eventually so it works for me :)