Question: bedops vcf2bed eats up all allocated memory
0
gravatar for Ram
5 months ago by
Ram14k
New York
Ram14k wrote:

Hello,

I'm trying to run bedops vcf2bed on a huge (27G) VCF file. I'd like to extract a BED of the deletions in the VCF.

My usage:

vcf2bed --max-mem=4G --deletions <file.vcf >file.bed

I'm running it on a cluster and I tried giving the job 8G and 16G RAM, but it maxes out all available RAM in a matter of seconds (100-150 seconds) and the job quits. I know splitting the VCF into per-chromosome chunks might solve this, but is there any other option I could use?

I am open to using other tools to extract the BED as well, given that they ensure end-start equals length(REF).


EDIT:

I combined a couple of ideas from the recommendations below to get to my solution - YMMV and not all of these mods might be necessary. I piped in the file using GNU parallel's --pipe and also used --do-not-sort to get unsorted bed files.

cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort >huge_file.unsorted.bed

--
Ram

vcf2bed bedops hpc • 450 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by Ram14k

Sounds like giving it 72G of RAM might work, Ram.

...Sorry, couldn't pass it up.

Seriously though, the vcf2bed and convert2bed commands don't have a ton of options. Alex could probably give you a solution though, dude's a wizard. Alternatively, you could filter based on the variant type in the VCF to create a file of only the INDELs, which I'm sure would cut down your file to probably only a few gigs. vcf2bed should be able to easily handle that file then.

ADD REPLYlink written 5 months ago by jared.andrews07680
4
gravatar for cpad0112
5 months ago by
cpad01124.4k
cpad01124.4k wrote:

May be you can try GNU-parallel --pipe option. Parallel pipe option sends chunks of file instead of entire file. Please read the manual before you try.

ADD COMMENTlink written 5 months ago by cpad01124.4k

Hmmm. That is kind of what I was looking for. I'll dig in to GNU parallel and see how I can help it.

ADD REPLYlink written 5 months ago by Ram14k

This helped. I used

cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort --insertions >huge_file.insertions.bed

and it worked like a charm, consuming only ~600M of memory and taking ~25 minutes. Amazing!

Maybe Alex Reynolds can add this as a tip in the manual?

ADD REPLYlink written 5 months ago by Ram14k
1

Good that this thing worked.Parallel default block size is 1MB (as I understand) and this block or vcf lines (some where between million to 10 million) could have been bigger.

$ parallel --pipepart -a huge.vcf --block -10 convert2bed -i vcf > huge.bed or $ parallel --pipepart -a huge.vcf --block 10M convert2bed -i vcf > huge.bed

ADD REPLYlink modified 5 months ago • written 5 months ago by cpad01124.4k
3
gravatar for Alex Reynolds
5 months ago by
Alex Reynolds23k
Seattle, WA USA
Alex Reynolds23k wrote:

I would suggest using --do-not-sort instead of --max-mem. This skips sorting, which otherwise uses system memory (though it shouldn't here, because of the use of --max-mem).

If skipping sorting works, please let me know as I'd like to investigate further, if there's a problem.

ADD COMMENTlink written 5 months ago by Alex Reynolds23k

It unfortunately did not. I ran it on a 67G VCF file, but it ran out of memory in 53 seconds (I gave it 8G RAM) The exact command:

vcf2bed --snvs --do-not-sort <file.vcf >file.snvs.bed

The VCF is in v4.2 and I'm using bedops v2.4.20

ADD REPLYlink written 5 months ago by Ram14k

You should definitely update your BEDOPS installation to v2.4.29.

There have been fixes to convert2bed between what you're running and the current version. There were VCF-specific fixes in v2.4.21 and v2.4.24, for example: http://bedops.readthedocs.io/en/latest/content/revision-history.html#v2-4-21

ADD REPLYlink modified 5 months ago • written 5 months ago by Alex Reynolds23k

I downloaded the latest binaries, the file name is bedops_linux_x86_64-v2.4.29.tar.bz2, so I'm guessing it's 2.4.29, but when I run the binaries (using their absolute path, just to be sure), the version number shown in 2.4.20 - what's going on there?

ADD REPLYlink written 5 months ago by Ram14k

Not for me.

 ./vcf2bed -v
Error: No input is specified; please redirect or pipe in formatted data
convert2bed
  version:  2.4.29 (typical)
  author:   Alex Reynolds
ADD REPLYlink written 5 months ago by genomax46k

OK, there's some inconsistency. I downloaded using homebrew on my MBP, and it's 2.4.29 (typical). However, the github dowload link gives me a 2.4.20 version on unpacking.

Also, the version flag mentioned in the docs is -w, not -v, although the latter does show usage with version when it errors out. Plus, the -w only works as expected with convert2bed, not with bedops or vcf2bed. -v just errors out and shows usage (that also has version number in it). -v always exits with a non-0 exit code, while convert2bed -w exits with a 0.

ADD REPLYlink written 5 months ago by Ram14k

I'll investigate the github download. That should provide 2.4.29 binaries, not 2.4.20.

ADD REPLYlink written 5 months ago by Alex Reynolds23k

I think it does. Maybe tweak the vcf2bed script to use a $(dirname $0)/convert2bed so it doesn't pick up from the PATH? Or even something to prioritize it over the one in $PATH?

ADD REPLYlink written 5 months ago by Ram14k

I wasn't able to reproduce what you saw:

$ cd /tmp
$ wget https://github.com/bedops/bedops/releases/download/v2.4.29/bedops_linux_x86_64-v2.4.29.tar.bz2
...
$ tar xvf bedops_linux_x86_64-v2.4.29.tar.bz2
...
$ ./convert2bed --version
convert2bed
  version:  2.4.29 (typical)
  author:   Alex Reynolds

The idea is to do things in the "Unix way" as much as possible, and relying on the PATH is one reliable and consistent way to find binaries. I can't guarantee I'll use it, but I'll look at your approach and see if it can work for us.

ADD REPLYlink modified 5 months ago • written 5 months ago by Alex Reynolds23k

Vcf2bed is a wrapper to call convert2bed. So you could use the absolute path to vcf2bed, and it would call the old convert2bed found in your environment PATH. Just install BEDOPS wherever you have 2.4.20 installed.

ADD REPLYlink written 5 months ago by Alex Reynolds23k

OK, that makes sense. Thank you!

ADD REPLYlink written 5 months ago by Ram14k
2
gravatar for genomax
5 months ago by
genomax46k
United States
genomax46k wrote:

See if this helps: vcf2bed argument in bedops partially process vcf file

ADD COMMENTlink written 5 months ago by genomax46k

OK, I'm going to try --do-not-sort and see if that works. If it does, it saves me a huge headache.

ADD REPLYlink written 5 months ago by Ram14k

or send it temp directory as in manual --sort-tmpdir=

ADD REPLYlink written 5 months ago by cpad01124.4k

This is also a good idea. If --max-mem is used, the /tmp directory can still fill up with temporary files used for a merge sort.

Using --sort-tmpdir with an alternative directory that can hold ~27G avoids this problem.

As VCF files can be much larger than /tmp, I may change defaults in vcf2bed to require specifying an alternative sort directory, or the use of --do-not-sort to skip sorting.

ADD REPLYlink written 5 months ago by Alex Reynolds23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1833 users visited in the last hour