most efficient way to read from large vcf files
0
2
Entering edit mode
4.0 years ago
lait ▴ 160

Hi,

if I have a large VCF file (for example, from the 1000 genomes project), what could be the most 'efficient' way (R libraries?) to extract variants from this file that lie in certain genomic regions ?

The way I used to do it with small vcf files is to load them in memory and start digging, but with a 700MB vcf file, what could be a better way?

vcf • 3.2k views
ADD COMMENT
1
Entering edit mode

As you ask for R I added my answer just as a comment. The most common way is to use tabix.

ADD REPLY
0
Entering edit mode

See if this is helpful: https://samtools.github.io/bcftools/howtos/query.html You will need to convert VCF to BCF.

ADD REPLY
0
Entering edit mode

There is kind of a sister project to this that removes the need for that conversion: https://vcftools.github.io/index.html

Tabix is also a good option. Actually, bedtools can do this as well.

ADD REPLY
0
Entering edit mode

Tabix is also a good option. Actually, bedtools can do this as well.

I read this from time to time, but never managed to it with bedtools that fast how tabix can do it. What's the correct way?

fin swimmer

ADD REPLY
0
Entering edit mode

The easiest way is to provide the region(s) you want to extract variants from in a BED file and then use bedtools intersect:

bedtools intersect -a myvariants.vcf -b myregions.bed -header -wa > output.vcf should do the trick. I have no idea if it's as fast as tabix, but it should be pretty quick.

ADD REPLY
0
Entering edit mode

Hello jared.andrews07,

yes I know I can subselect variants with bedtools intersect. But on large files this is slow. tabix build a position based index of the bgzipped vcf and is able to have random access to the positions I need. AFAIK bedtools cannot do this.

A little comparison on the 1000Genomes vcf file:

$ cat test.bed
2   1000000 1000050

$ time (tabix ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -R test.bed)
0,03s user 0,00s system 99% cpu 0,030 total

$ time (bedtools intersect -a ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz -b test.bed -wa)
91,69s user 0,57s system 99% cpu 1:33,14 total

fin swimmer

ADD REPLY
0
Entering edit mode
$ vcf2bed < foo.vcf | bedops -e 1 - regions.bed > answer.bed
ADD REPLY

Login before adding your answer.

Traffic: 1898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6