tabix --list-chroms
1
0
Entering edit mode
2.1 years ago
finster ▴ 90

Hi,

Does anyone know of a way of doing the equivalent of tabix -l <VCF_FILE> with pysam? I could not find any info in the documentation - only index creation and querying indexed files. If this is not possible with pysam are there any other libraries that could do it?

I have looked at tabixpy but could not get it to work.

Thanks

pysam tabix • 778 views
ADD COMMENT
1
Entering edit mode
2.1 years ago
finster ▴ 90

So, I figured this out and it is actually pretty easy, but I thought I would put it here just in case it is useful to anyone.

I realise that the original question did not cover why I wanted to do this. Essentially the tabix index is a (quick) window into the sort order of the vcf, whilst you have the contigs section, this could have the incorrect sort order - and to generate the tabix index the file would have had to been iterated through, so on the assumption that the tabix index is current it should have the correct sort order.

Anyway, to test, create a dummy vcf file called test.vcf, this has a different sort order to the contig section, and also has contig entries in that section that are not represented in the body of the vcf.

##fileformat=VCFv4.2
##contig=<ID=1,length=248956422>
##contig=<ID=10,length=133797422>
##contig=<ID=11,length=135086622>
##contig=<ID=12,length=133275309>
##contig=<ID=13,length=114364328>
##contig=<ID=14,length=107043718>
##contig=<ID=15,length=101991189>
##contig=<ID=16,length=90338345>
##contig=<ID=17,length=83257441>
##contig=<ID=18,length=80373285>
##contig=<ID=19,length=58617616>
##contig=<ID=2,length=242193529>
##contig=<ID=20,length=64444167>
##contig=<ID=21,length=46709983>
##contig=<ID=22,length=50818468>
##contig=<ID=3,length=198295559>
##contig=<ID=4,length=190214555>
##contig=<ID=5,length=181538259>
##contig=<ID=6,length=170805979>
##contig=<ID=7,length=159345973>
##contig=<ID=8,length=145138636>
##contig=<ID=9,length=138394717>
##contig=<ID=X,length=156040895>
##contig=<ID=Y,length=57227415>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   1000000 .   T   A   .   .   .   
2   2000000 .   AAACAACAACAACAACAACAAC  A   .   .   .
9   90000000    .   CAGAA   C   .   .   .
10  10000000    .   C   CAAA    .   .   .

Now con the command line, compress and index:

# adds the .gz extension
bgzip test.vcf
tabix -p"vcf" test.vcf.gz

See what tabix -l gives.

tabix -l test.vcf.gz
1
2
9
10

In Python

import pysam

f = pysam.VariantFile("test.vcf.gz")

# This is the equivalent of tabix -l test.vcf.gz
# I am guessing this requires ordered dictionary keys
# which should not be an issue in modern python implementations
list(f.index.keys())
['1', '2', '9', '10']

# Now compare to the way the contigs are represented in the file
list(f.header.contigs.keys())
['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9', 'X', 'Y']

This was tested on pysam v.0.17.0 and Python 3.8.10. Also tabix version was 1.10.2-3

ADD COMMENT
0
Entering edit mode

There are python libraries to handle vcfs. Is pysam necessary for downstream work? otherwise, you can use vcf libraries such as pyvcf.

ADD REPLY

Login before adding your answer.

Traffic: 1487 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6