Question: bedops vcf2bed eats up all allocated memory
0
gravatar for Ram
10 days ago by
Ram12k
New York
Ram12k wrote:

Hello,

I'm trying to run bedops vcf2bed on a huge (27G) VCF file. I'd like to extract a BED of the deletions in the VCF.

My usage:

vcf2bed --max-mem=4G --deletions <file.vcf >file.bed

I'm running it on a cluster and I tried giving the job 8G and 16G RAM, but it maxes out all available RAM in a matter of seconds (100-150 seconds) and the job quits. I know splitting the VCF into per-chromosome chunks might solve this, but is there any other option I could use?

I am open to using other tools to extract the BED as well, given that they ensure end-start equals length(REF).


EDIT:

I combined a couple of ideas from the recommendations below to get to my solution - YMMV and not all of these mods might be necessary. I piped in the file using GNU parallel's --pipe and also used --do-not-sort to get unsorted bed files.

cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort >huge_file.unsorted.bed

--
Ram

vcf2bed bedops hpc • 228 views
ADD COMMENTlink modified 9 days ago • written 10 days ago by Ram12k

Sounds like giving it 72G of RAM might work, Ram.

...Sorry, couldn't pass it up.

Seriously though, the vcf2bed and convert2bed commands don't have a ton of options. Alex could probably give you a solution though, dude's a wizard. Alternatively, you could filter based on the variant type in the VCF to create a file of only the INDELs, which I'm sure would cut down your file to probably only a few gigs. vcf2bed should be able to easily handle that file then.

ADD REPLYlink written 10 days ago by jared.andrews07280
4
gravatar for cpad0112
10 days ago by
cpad01123.0k
cpad01123.0k wrote:

May be you can try GNU-parallel --pipe option. Parallel pipe option sends chunks of file instead of entire file. Please read the manual before you try.

ADD COMMENTlink written 10 days ago by cpad01123.0k

Hmmm. That is kind of what I was looking for. I'll dig in to GNU parallel and see how I can help it.

ADD REPLYlink written 10 days ago by Ram12k

This helped. I used

cat huge_file.vcf | parallel --pipe vcf2bed --do-not-sort --insertions >huge_file.insertions.bed

and it worked like a charm, consuming only ~600M of memory and taking ~25 minutes. Amazing!

Maybe Alex Reynolds can add this as a tip in the manual?

ADD REPLYlink written 9 days ago by Ram12k
1

Good that this thing worked.Parallel default block size is 1MB (as I understand) and this block or vcf lines (some where between million to 10 million) could have been bigger.

$ parallel --pipepart -a huge.vcf --block -10 convert2bed -i vcf > huge.bed or $ parallel --pipepart -a huge.vcf --block 10M convert2bed -i vcf > huge.bed

ADD REPLYlink modified 9 days ago • written 9 days ago by cpad01123.0k
3
gravatar for Alex Reynolds
10 days ago by
Alex Reynolds21k
Seattle, WA USA
Alex Reynolds21k wrote:

I would suggest using --do-not-sort instead of --max-mem. This skips sorting, which otherwise uses system memory (though it shouldn't here, because of the use of --max-mem).

If skipping sorting works, please let me know as I'd like to investigate further, if there's a problem.

ADD COMMENTlink written 10 days ago by Alex Reynolds21k

It unfortunately did not. I ran it on a 67G VCF file, but it ran out of memory in 53 seconds (I gave it 8G RAM) The exact command:

vcf2bed --snvs --do-not-sort <file.vcf >file.snvs.bed

The VCF is in v4.2 and I'm using bedops v2.4.20

ADD REPLYlink written 10 days ago by Ram12k

You should definitely update your BEDOPS installation to v2.4.29.

There have been fixes to convert2bed between what you're running and the current version. There were VCF-specific fixes in v2.4.21 and v2.4.24, for example: http://bedops.readthedocs.io/en/latest/content/revision-history.html#v2-4-21

ADD REPLYlink modified 10 days ago • written 10 days ago by Alex Reynolds21k

I downloaded the latest binaries, the file name is bedops_linux_x86_64-v2.4.29.tar.bz2, so I'm guessing it's 2.4.29, but when I run the binaries (using their absolute path, just to be sure), the version number shown in 2.4.20 - what's going on there?

ADD REPLYlink written 10 days ago by Ram12k

Not for me.

 ./vcf2bed -v
Error: No input is specified; please redirect or pipe in formatted data
convert2bed
  version:  2.4.29 (typical)
  author:   Alex Reynolds
ADD REPLYlink written 10 days ago by genomax37k

OK, there's some inconsistency. I downloaded using homebrew on my MBP, and it's 2.4.29 (typical). However, the github dowload link gives me a 2.4.20 version on unpacking.

Also, the version flag mentioned in the docs is -w, not -v, although the latter does show usage with version when it errors out. Plus, the -w only works as expected with convert2bed, not with bedops or vcf2bed. -v just errors out and shows usage (that also has version number in it). -v always exits with a non-0 exit code, while convert2bed -w exits with a 0.

ADD REPLYlink written 9 days ago by Ram12k

I'll investigate the github download. That should provide 2.4.29 binaries, not 2.4.20.

ADD REPLYlink written 9 days ago by Alex Reynolds21k

I think it does. Maybe tweak the vcf2bed script to use a $(dirname $0)/convert2bed so it doesn't pick up from the PATH? Or even something to prioritize it over the one in $PATH?

ADD REPLYlink written 9 days ago by Ram12k

I wasn't able to reproduce what you saw:

$ cd /tmp
$ wget https://github.com/bedops/bedops/releases/download/v2.4.29/bedops_linux_x86_64-v2.4.29.tar.bz2
...
$ tar xvf bedops_linux_x86_64-v2.4.29.tar.bz2
...
$ ./convert2bed --version
convert2bed
  version:  2.4.29 (typical)
  author:   Alex Reynolds

The idea is to do things in the "Unix way" as much as possible, and relying on the PATH is one reliable and consistent way to find binaries. I can't guarantee I'll use it, but I'll look at your approach and see if it can work for us.

ADD REPLYlink modified 9 days ago • written 9 days ago by Alex Reynolds21k

Vcf2bed is a wrapper to call convert2bed. So you could use the absolute path to vcf2bed, and it would call the old convert2bed found in your environment PATH. Just install BEDOPS wherever you have 2.4.20 installed.

ADD REPLYlink written 9 days ago by Alex Reynolds21k

OK, that makes sense. Thank you!

ADD REPLYlink written 9 days ago by Ram12k
2
gravatar for genomax
10 days ago by
genomax37k
United States
genomax37k wrote:

See if this helps: vcf2bed argument in bedops partially process vcf file

ADD COMMENTlink written 10 days ago by genomax37k

OK, I'm going to try --do-not-sort and see if that works. If it does, it saves me a huge headache.

ADD REPLYlink written 10 days ago by Ram12k

or send it temp directory as in manual --sort-tmpdir=

ADD REPLYlink written 10 days ago by cpad01123.0k

This is also a good idea. If --max-mem is used, the /tmp directory can still fill up with temporary files used for a merge sort.

Using --sort-tmpdir with an alternative directory that can hold ~27G avoids this problem.

As VCF files can be much larger than /tmp, I may change defaults in vcf2bed to require specifying an alternative sort directory, or the use of --do-not-sort to skip sorting.

ADD REPLYlink written 10 days ago by Alex Reynolds21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1446 users visited in the last hour