Question

parse large vcf file?

1

Entering edit mode

9.9 years ago

Luyi Tian ▴ 120

I have several datasets in vcf format(about 1GB per file) and I want to extract snp info in each samples. I usually use python so I tried pyvcf, but it works slow in such large datasets. Are there any better ways to do that, especially in python? Thanks in advance

python vcf • 11k views

ADD COMMENT • link updated 9.9 years ago by lh3 33k • written 9.9 years ago by Luyi Tian ▴ 120

4

Entering edit mode

9.9 years ago

lh3 33k

I don't know if the following is applicable to pyvcf - generally writing your task-specific parser is much faster than using a library for a complex format such as VCF. The reason is that libraries tend to parse everything but frequently you don't need everything. Libraries are wasting processing time. For collecting basic statistics in a VCF, I can write a script faster than the htslib parser in C (I wrote the initial version). Another example is samblaster is several times faster than the samtools/htslib parser because it only parse a few fields in SAM. It is possible to write libraries by using lazy evaluation, but it is non-trivial.

A second approach as others said is to index with tabix and process each chromosome separately. You don't need to pyvcf or to split by chromosome manually. Just make your script read data from stdin:

tabix -h your.vcf.gz chr2 | your-script.py > chr2.out

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by lh3 33k

1

Entering edit mode

9.9 years ago

Alex Reynolds 35k

Perhaps split your VCF file by chromosome, running extraction tasks in parallel by chromosome.

ADD COMMENT • link 9.9 years ago by Alex Reynolds 35k

Ram · Accepted Answer · 2014-05-23

A trick is to compress the VCF file with bgzip, and index it with tabix. Both these tools are downloadable from the tabix home page.

See also this page for more documentation. Follow the instructions there:

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

I am not sure if pyvcf supports compressed and indexed files. According to the source code, it seems so.

EDIT: yes, it seems that pyvcf supports compressed and indexed VCF files. See the example in pyvcf documentation:

>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> # fetch all records on chromosome 20 from base 1110696 through 1230237
>>> for record in vcf_reader.fetch('20', 1110695, 1230237):  
...     print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])