Get number of variations in a huge VCF using Tabix?
4
3
Entering edit mode
8.4 years ago
Dan ▴ 530

Hi,

I've got a massive bgzipped VCF which is tabix indexed. I just want to know the number of variations in the file. Surely tabix can do this?

Solutions to counting lines that involve gunzip are not useful ;-)

vcf tabix big-data variation • 11k views
ADD COMMENT
0
Entering edit mode

Out of curiosity, how big is this file so that gunzip -c | wc -l will be too slow? (But I see your point, one would expect the index to have this information somewhere...)

ADD REPLY
0
Entering edit mode

I have a kind of weird approach, I just convert the vcf file to a plink .bed file and it gives me the number of variants in the vcf file. I do need to analyze most of my vcf files with plink eventually so it works for me :)

ADD REPLY
8
Entering edit mode
7.6 years ago

The current version of bcftools (v1.3) will store metadata in the index file when using bcftools index and then allow you to access either the total variant count or the number of variants per contig just by reading the index.

The Bcftools documentation has details

-n, --nrecords
print the number of records based on the CSI or TBI index files
-s, --stats
Print per contig stats based on the CSI or TBI index files. Output format is three tab-delimited columns listing the contig name, contig length (. if unknown) and number of records for the contig. Contigs with zero records are not printed.

e.g.

bcftools index some.vcf.gz # create the tbi
bcftools index --nrecords some.vcf.gz # get total variant count
bcftools index --stats some.vcf.gz # get variant count per chromsome
ADD COMMENT
0
Entering edit mode

This is what I am looking for. Thanks :)

ADD REPLY
0
Entering edit mode

This works only if you want to get ALL the records in your vcf. If you want to count records with a certain filter, you either have to extract them to a separate vcf or use a different method (which I don't claim to have given the constraints of the OP)

ADD REPLY
1
Entering edit mode
8.4 years ago

Heng Li says no, and he should know:

Dear Heng,

We are wondering if there's a fast approach to use tabix or any other hack to get the total number of variants that a VCF has without actually reading the whole vcf file and counting the lines. I assume that the total number of rows is somehow stored in the tbi file.


No, this is not stored in the traditional tabix index. The htslib implementation of tabix should have this information in dummy bins, I think.
Someone else needs to confirm, though.

Heng

ADD COMMENT
0
Entering edit mode

What is meant by the dummy bin? I do not see it in the tabix spec. Maybe I should rethink problem and use something like indexcov but I thought I'd ask just for curiosities sake

ADD REPLY
0
Entering edit mode
8.4 years ago
mkulecka ▴ 360

Are simple methods like bcftools stats (https://samtools.github.io/bcftools/bcftools.html#stats) or vcf-stats (http://vcftools.sourceforge.net/perl_module.html#vcf-stats) no good?

ADD COMMENT
0
Entering edit mode
3.4 years ago
sacha ★ 2.4k

I need to get this information from python ! Have any idea ? I guess this value should be somewhere in the file !

ADD COMMENT

Login before adding your answer.

Traffic: 2029 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6