So, I figured this out and it is actually pretty easy, but I thought I would put it here just in case it is useful to anyone.
I realise that the original question did not cover why I wanted to do this. Essentially the tabix index is a (quick) window into the sort order of the vcf, whilst you have the contigs
section, this could have the incorrect sort order - and to generate the tabix index the file would have had to been iterated through, so on the assumption that the tabix index is current it should have the correct sort order.
Anyway, to test, create a dummy vcf file called test.vcf
, this has a different sort order to the contig
section, and also has contig
entries in that section that are not represented in the body of the vcf.
##fileformat=VCFv4.2
##contig=<ID=1,length=248956422>
##contig=<ID=10,length=133797422>
##contig=<ID=11,length=135086622>
##contig=<ID=12,length=133275309>
##contig=<ID=13,length=114364328>
##contig=<ID=14,length=107043718>
##contig=<ID=15,length=101991189>
##contig=<ID=16,length=90338345>
##contig=<ID=17,length=83257441>
##contig=<ID=18,length=80373285>
##contig=<ID=19,length=58617616>
##contig=<ID=2,length=242193529>
##contig=<ID=20,length=64444167>
##contig=<ID=21,length=46709983>
##contig=<ID=22,length=50818468>
##contig=<ID=3,length=198295559>
##contig=<ID=4,length=190214555>
##contig=<ID=5,length=181538259>
##contig=<ID=6,length=170805979>
##contig=<ID=7,length=159345973>
##contig=<ID=8,length=145138636>
##contig=<ID=9,length=138394717>
##contig=<ID=X,length=156040895>
##contig=<ID=Y,length=57227415>
#CHROM POS ID REF ALT QUAL FILTER INFO
1 1000000 . T A . . .
2 2000000 . AAACAACAACAACAACAACAAC A . . .
9 90000000 . CAGAA C . . .
10 10000000 . C CAAA . . .
Now con the command line, compress and index:
# adds the .gz extension
bgzip test.vcf
tabix -p"vcf" test.vcf.gz
See what tabix -l
gives.
tabix -l test.vcf.gz
1
2
9
10
In Python
import pysam
f = pysam.VariantFile("test.vcf.gz")
# This is the equivalent of tabix -l test.vcf.gz
# I am guessing this requires ordered dictionary keys
# which should not be an issue in modern python implementations
list(f.index.keys())
['1', '2', '9', '10']
# Now compare to the way the contigs are represented in the file
list(f.header.contigs.keys())
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'X', 'Y']
This was tested on pysam
v.0.17.0 and Python 3.8.10. Also tabix
version was 1.10.2-3
There are python libraries to handle vcfs. Is pysam necessary for downstream work? otherwise, you can use vcf libraries such as pyvcf.