most efficient way to read from large vcf files
0
2
Entering edit mode
4.0 years ago
lait ▴ 160

Hi,

if I have a large VCF file (for example, from the 1000 genomes project), what could be the most 'efficient' way (R libraries?) to extract variants from this file that lie in certain genomic regions ?

The way I used to do it with small vcf files is to load them in memory and start digging, but with a 700MB vcf file, what could be a better way?

vcf • 3.2k views
1
Entering edit mode

As you ask for R I added my answer just as a comment. The most common way is to use tabix.

0
Entering edit mode

See if this is helpful: https://samtools.github.io/bcftools/howtos/query.html You will need to convert VCF to BCF.

0
Entering edit mode

There is kind of a sister project to this that removes the need for that conversion: https://vcftools.github.io/index.html

Tabix is also a good option. Actually, bedtools can do this as well.

0
Entering edit mode

Tabix is also a good option. Actually, bedtools can do this as well.

I read this from time to time, but never managed to it with bedtools that fast how tabix can do it. What's the correct way?

fin swimmer

0
Entering edit mode

The easiest way is to provide the region(s) you want to extract variants from in a BED file and then use bedtools intersect:

bedtools intersect -a myvariants.vcf -b myregions.bed -header -wa > output.vcf should do the trick. I have no idea if it's as fast as tabix, but it should be pretty quick.

0
Entering edit mode

Hello jared.andrews07,

yes I know I can subselect variants with bedtools intersect. But on large files this is slow. tabix build a position based index of the bgzipped vcf and is able to have random access to the positions I need. AFAIK bedtools cannot do this.

A little comparison on the 1000Genomes vcf file:

$cat test.bed 2 1000000 1000050$ time (tabix ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -R test.bed)
0,03s user 0,00s system 99% cpu 0,030 total

$time (bedtools intersect -a ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -b test.bed -wa) 91,69s user 0,57s system 99% cpu 1:33,14 total fin swimmer ADD REPLY 0 Entering edit mode$ vcf2bed < foo.vcf | bedops -e 1 - regions.bed > answer.bed