fast way to get last position in a large indexed VCF?
1
2
Entering edit mode
20 months ago
curious ▴ 750

I want to get first and last position in a VCF for chunking purposes

First position is easy to get:

bcftools query -f '%CHROM\t%POS\n' my_vcf  | head -n 1

I realize I could stream the whole vcf to find last position, but I was hoping there was some faster way to get it.

The only thing I can think of is some kind of binary search like approach, but there has to be a better way

bcftools • 2.2k views
ADD COMMENT
0
Entering edit mode

Maybe I didn't understand the question, but if the VCF is sorted, why not just use tail?

ADD REPLY
1
Entering edit mode

getting to the tail of a 10GB compressed file will take quite a bit and means unpacking the entire file.

As for the OP, perhaps reading the index file with a custom program might be able to tell you the last coordinate then you access that directly.

ADD REPLY
0
Entering edit mode

Do you any idea how to do this, I looked for this approach even before positing this. I cant cat or zcat the index and get anything readable and I cant find any helpful info with 'bcftools index' or 'tabix' options

ADD REPLY
0
Entering edit mode

The descriptions for most formats can be found here:

https://github.com/samtools/hts-specs

For example, the default indexing for VCF is CSI, I believe.

ADD REPLY
0
Entering edit mode

when poking around this way I found this, which is already doing my chunking to boot:

from sgkit.io.vcf import partition_into_regions
partition_into_regions("CEUTrio.20.21.gatk3.4.g.vcf.bgz", num_parts=10)

['20:1-10108928', '20:10108929-10207232', '20:10207233-', '21:1-10027008', '21:10027009-10043392', '21:10043393-10108928', '21:10108929-10141696', '21:10141697-10174464', '21:10174465-10190848', '21:10190849-10207232', '21:10207233-']

https://pystatgen.github.io/sgkit/latest/vcf.html#partitioning

ADD REPLY
0
Entering edit mode

I have been playing with this problem a bit in Python, and I concluded that you would need a bgzip aware solution for it to work. It does not seem like sgkit is such a tool though.

When using the builtin gzip stream Python will proceed to uncompress the entire stream thus, even if you figured out the correct offset it won't be able to jump to that location without unzipping everything to reach the end.

Alas, I was unable to locate a bgzip library in Python that also allows seeking a file to an offset.

ADD REPLY
0
Entering edit mode

That's an interesting question but I'm not sure a tabix index is relevant to what position happens to be at the end of the file. Unless it's position sorted, the vcf could end with chr1:123, for instance, instead of chrM:16569. The tabix index doesn't really care.

ADD REPLY
1
Entering edit mode

for tabix to work, the file has to be position sorted, and grouped by chromosome (usually we sort by both):

(grep "^#" in.gff; grep -v "^#" in.gff | sort -t"`printf '\t'`" -k1,1 -k4,4n) | bgzip > sorted.gff.gz;

tabix -p gff sorted.gff.gz;

tabix sorted.gff.gz chr1:10,000,000-20,000,000;

as seen in: http://www.htslib.org/doc/tabix.html

ADD REPLY
1
Entering edit mode
19 months ago

If the file is uncompressed (or if you find a bgzip compatible stream.seek for your programming language) you can jump near the end of the file quite efficiently:

import os

fname = "subset.vcf"
stream = open(fname)
size = os.path.getsize(fname)

# Move back a few kb
shift = 10 * 1024
stream.seek(size-shift)

for line in stream:
    pass

print(line.split())

it will be pretty fast:

 time python wind.py
['NC_000001.10', '25202832', 'rs1640276932', 'TGGGGTGGGGCGGGGCGGGGAGGGGC', 'T', '.', '.', 'RS=1640276932;SSR=0;VC=INDEL;GNO;FREQ=dbGaP_PopFreq:1,0']

real    0m0.043s
user    0m0.016s
sys     0m0.016s

versus streaming the whole file:

time cat subset.vcf| tail -1
NC_000001.10    25202832        rs1640276932    TGGGGTGGGGCGGGGCGGGGAGGGGC      T       .       .       RS=1640276932;SSR=0;VC=INDEL;GNO;FREQ=dbGaP_PopFreq:1,0

real    0m1.492s
user    0m0.313s
sys     0m1.578s
ADD COMMENT
0
Entering edit mode

How does running bctools query -f'%CHROM %POS\n' | tail -n1 compare?

ADD REPLY
0
Entering edit mode
time bcftools query -f'%CHROM %POS\n' subset.vcf | tail -n1
NC_000001.10 25202832

real    0m29.221s
user    0m6.281s
sys     0m23.922s

for the record, the file has 9 million variants

cat subset.vcf | wc -l
9199371
ADD REPLY

Login before adding your answer.

Traffic: 2748 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6