I have several datasets in vcf format(about 1GB per file) and I want to extract snp info in each samples. I usually use python so I tried pyvcf, but it works slow in such large datasets. Are there any better ways to do that, especially in python? Thanks in advance
A trick is to compress the VCF file with bgzip, and index it with tabix. Both these tools are downloadable from the tabix home page.
See also this page for more documentation. Follow the instructions there:
tabix -p vcf my_file.vcf.gz
I am not sure if pyvcf supports compressed and indexed files. According to the source code, it seems so.
EDIT: yes, it seems that pyvcf supports compressed and indexed VCF files. See the example in pyvcf documentation:
>>> vcf_reader = vcf.Reader(filename='vcf/test/tb.vcf.gz')
>>> # fetch all records on chromosome 20 from base 1110696 through 1230237
>>> for record in vcf_reader.fetch('20', 1110695, 1230237):
... print record
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
I don't know if the following is applicable to pyvcf - generally writing your task-specific parser is much faster than using a library for a complex format such as VCF. The reason is that libraries tend to parse everything but frequently you don't need everything. Libraries are wasting processing time. For collecting basic statistics in a VCF, I can write a script faster than the htslib parser in C (I wrote the initial version). Another example is samblaster is several times faster than the samtools/htslib parser because it only parse a few fields in SAM. It is possible to write libraries by using lazy evaluation, but it is non-trivial.
A second approach as others said is to index with tabix and process each chromosome separately. You don't need to pyvcf or to split by chromosome manually. Just make your script read data from stdin: