bedops vcf2bed eats up all allocated memory
3
0
Entering edit mode
3.6 years ago
Ram 34k

Hello,

I'm trying to run bedops vcf2bed on a huge (27G) VCF file. I'd like to extract a BED of the deletions in the VCF.

My usage:

vcf2bed --max-mem=4G --deletions <file.vcf >file.bed

I'm running it on a cluster and I tried giving the job 8G and 16G RAM, but it maxes out all available RAM in a matter of seconds (100-150 seconds) and the job quits. I know splitting the VCF into per-chromosome chunks might solve this, but is there any other option I could use?

I am open to using other tools to extract the BED as well, given that they ensure end-start equals length(REF).


EDIT:

I combined a couple of ideas from the recommendations below to get to my solution - YMMV and not all of these mods might be necessary. I piped in the file using GNU parallel's --pipe and also used --do-not-sort to get unsorted bed files.

cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort >huge_file.unsorted.bed

--
Ram

bedops vcf2bed hpc • 1.6k views
ADD COMMENT
0
Entering edit mode

Sounds like giving it 72G of RAM might work, Ram.

...Sorry, couldn't pass it up.

Seriously though, the vcf2bed and convert2bed commands don't have a ton of options. Alex could probably give you a solution though, dude's a wizard. Alternatively, you could filter based on the variant type in the VCF to create a file of only the INDELs, which I'm sure would cut down your file to probably only a few gigs. vcf2bed should be able to easily handle that file then.

ADD REPLY
4
Entering edit mode
3.6 years ago

May be you can try GNU-parallel --pipe option. Parallel pipe option sends chunks of file instead of entire file. Please read the manual before you try.

ADD COMMENT
0
Entering edit mode

Hmmm. That is kind of what I was looking for. I'll dig in to GNU parallel and see how I can help it.

ADD REPLY
0
Entering edit mode

This helped. I used

cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort --insertions >huge_file.insertions.bed

and it worked like a charm, consuming only ~600M of memory and taking ~25 minutes. Amazing!

Maybe Alex Reynolds can add this as a tip in the manual?

ADD REPLY
1
Entering edit mode

Good that this thing worked.Parallel default block size is 1MB (as I understand) and this block or vcf lines (some where between million to 10 million) could have been bigger.

$ parallel --pipepart -a huge.vcf --block -10 convert2bed -i vcf > huge.bed or $ parallel --pipepart -a huge.vcf --block 10M convert2bed -i vcf > huge.bed

ADD REPLY
3
Entering edit mode
3.6 years ago

I would suggest using --do-not-sort instead of --max-mem. This skips sorting, which otherwise uses system memory (though it shouldn't here, because of the use of --max-mem).

If skipping sorting works, please let me know as I'd like to investigate further, if there's a problem.

ADD COMMENT
0
Entering edit mode

It unfortunately did not. I ran it on a 67G VCF file, but it ran out of memory in 53 seconds (I gave it 8G RAM) The exact command:

vcf2bed --snvs --do-not-sort <file.vcf >file.snvs.bed

The VCF is in v4.2 and I'm using bedops v2.4.20

ADD REPLY
0
Entering edit mode

You should definitely update your BEDOPS installation to v2.4.29.

There have been fixes to convert2bed between what you're running and the current version. There were VCF-specific fixes in v2.4.21 and v2.4.24, for example: http://bedops.readthedocs.io/en/latest/content/revision-history.html#v2-4-21

ADD REPLY
0
Entering edit mode

I downloaded the latest binaries, the file name is bedops_linux_x86_64-v2.4.29.tar.bz2, so I'm guessing it's 2.4.29, but when I run the binaries (using their absolute path, just to be sure), the version number shown in 2.4.20 - what's going on there?

ADD REPLY
0
Entering edit mode

Not for me.

 ./vcf2bed -v
Error: No input is specified; please redirect or pipe in formatted data
convert2bed
  version:  2.4.29 (typical)
  author:   Alex Reynolds
ADD REPLY
0
Entering edit mode

OK, there's some inconsistency. I downloaded using homebrew on my MBP, and it's 2.4.29 (typical). However, the github dowload link gives me a 2.4.20 version on unpacking.

Also, the version flag mentioned in the docs is -w, not -v, although the latter does show usage with version when it errors out. Plus, the -w only works as expected with convert2bed, not with bedops or vcf2bed. -v just errors out and shows usage (that also has version number in it). -v always exits with a non-0 exit code, while convert2bed -w exits with a 0.

ADD REPLY
0
Entering edit mode

I'll investigate the github download. That should provide 2.4.29 binaries, not 2.4.20.

ADD REPLY
0
Entering edit mode

I think it does. Maybe tweak the vcf2bed script to use a $(dirname $0)/convert2bed so it doesn't pick up from the PATH? Or even something to prioritize it over the one in $PATH?

ADD REPLY
0
Entering edit mode

I wasn't able to reproduce what you saw:

$ cd /tmp
$ wget https://github.com/bedops/bedops/releases/download/v2.4.29/bedops_linux_x86_64-v2.4.29.tar.bz2
...
$ tar xvf bedops_linux_x86_64-v2.4.29.tar.bz2
...
$ ./convert2bed --version
convert2bed
  version:  2.4.29 (typical)
  author:   Alex Reynolds

The idea is to do things in the "Unix way" as much as possible, and relying on the PATH is one reliable and consistent way to find binaries. I can't guarantee I'll use it, but I'll look at your approach and see if it can work for us.

ADD REPLY
0
Entering edit mode

Vcf2bed is a wrapper to call convert2bed. So you could use the absolute path to vcf2bed, and it would call the old convert2bed found in your environment PATH. Just install BEDOPS wherever you have 2.4.20 installed.

ADD REPLY
0
Entering edit mode

OK, that makes sense. Thank you!

ADD REPLY
2
Entering edit mode
ADD COMMENT
0
Entering edit mode

OK, I'm going to try --do-not-sort and see if that works. If it does, it saves me a huge headache.

ADD REPLY
0
Entering edit mode

or send it temp directory as in manual --sort-tmpdir=

ADD REPLY
0
Entering edit mode

This is also a good idea. If --max-mem is used, the /tmp directory can still fill up with temporary files used for a merge sort.

Using --sort-tmpdir with an alternative directory that can hold ~27G avoids this problem.

As VCF files can be much larger than /tmp, I may change defaults in vcf2bed to require specifying an alternative sort directory, or the use of --do-not-sort to skip sorting.

ADD REPLY

Login before adding your answer.

Traffic: 2535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6