Question: Quickly subsetting VCF while being memory efficient
0
gravatar for Brice Sarver
4.2 years ago by
Brice Sarver2.8k
United States
Brice Sarver2.8k wrote:

Hi all,

As one of the last steps of a large variant analysis, I would like to identify nocalls and mask them/pass them using --snpmask to GATK to replace sites that cannot be confidently called. Not terribly tricky.

I decided the best way to go about this was to generate calls at all sites using EMIT_ALL_SITES in GATK. I then use a short Perl one-liner to identify sites with nocalls (./.), subset the VCF so that it includes only such sites, and pass the result back to GATK one last time with --snpmask. The whole approach from raw data to sequences, including a bunch more that I won't go into here, is currently implemented in Python.

I was filtering the VCF using VCFtools by passing it the positions file generated above. However, the call to VCFtools takes over 100 GB of RAM, and I would like to do multiple samples simultaneously. The CPU time, too, is pretty large; this is one of the most resource-intensive steps, even outpacing indel realignment and variant calling.

My question: is there an alternative to subset the VCF taking my situation into account? Should I try creating a BED file from my list of positions (it would be something like chr1 0 1; chr1 1 2 for many of the sites) and letting tabix take a crack at it? Are there other options I ought to consider? It's even okay if it takes a bit longer but has a smaller memory footprint.

bed R python vcf • 1.6k views
ADD COMMENTlink modified 4.2 years ago by Alex Reynolds28k • written 4.2 years ago by Brice Sarver2.8k
4
gravatar for RamRS
4.2 years ago by
RamRS23k
Houston, TX
RamRS23k wrote:

If you can create that BED file, tabix has a BED filter option that is hands down the fastest I have seen - it takes maybe seconds to get stuff done where vcftools runs out of memory.

There's a catch though - tabix does not extract the VCF header.

Check out this post: Extract Sub-Set Of Regions From Vcf File

ADD COMMENTlink written 4.2 years ago by RamRS23k
1

I recently ran a similar task to the OPs with the mask saved as BED files. First time I ran it'd forgotten to tabix the files: ~7hrs to run. Once I'd tabixed them ~ 1 minute :)

ADD REPLYlink written 4.2 years ago by David W4.7k

Exactly. bgzip, then tbi and just like that, bada boom!

ADD REPLYlink written 4.2 years ago by RamRS23k

Thanks! tabix was one of the first things I thought of, just wasn't sure how it scaled up to whole-genome calls.

ADD REPLYlink written 4.2 years ago by Brice Sarver2.8k
0
gravatar for David W
4.2 years ago by
David W4.7k
New Zealand
David W4.7k wrote:

I don't know if it can be used in the particular case you are taking about, but genome query tools (https://github.com/ryanlayer/gqt) has been written specifically to handle very large genotype  files and with low memory footprint and speedy queries. 

EDIT

Some of gqt's tricks are specifically for multi-sample genotype files though, so you might not get as much out of it for this application?

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by David W4.7k
0
gravatar for Alex Reynolds
4.2 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

The following BEDOPS call is very memory efficient and fast:

$ vcf2bed < variants.vcf | bedops -e 1 - regions_of_interest.bed > answer.bed
ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1582 users visited in the last hour