Make a variation record dict from vcf by pysam quickly
0
0
Entering edit mode
8 months ago
octpus616 ▴ 100

I am try to math a (very long and not sort) chrom pos list with record in vcf file, because of input pos list is not sort, I trying to use pysam.VariantFile.fetch() to handle,

Usually, each fetch() will have a file operation, which looks slow, because my data has some clusters, the nearby sites are usually in order according to the order of occurrence (but it cannot be assumed that they are strictly sorted), eg:


chr1 100000
chr1 100001
.....
chr1 100100
chr1 10000
chr1 10001
chr1 10003
......
chr1 10010

I am trying to fetch() a buffer size one time, so if next pos in buffsize It can avoid a file io. For got record easily from buffer, I create a dict to store records.

snp = vcf_input.fetch(chrom, pos - 1, pos + bufferSize)
vcf_inbuffer = {snp_record.pos: snp_record for snp_record in snp}

But I noted that it vcf_inbuffer = {snp_record.pos: snp_record for snp_record in snp} is very slow if the buffsize set a large value.

My question is:

1) Is there any other way faster to do it?

pysam python vcf • 326 views
ADD COMMENT

Login before adding your answer.

Traffic: 2537 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6