Using GNU parallel to speed up merging VCFs with bcftools
2
0
Entering edit mode
10 weeks ago
mfshiller ▴ 10

I have a bunch of VCF files to merge. Bcftools isn't being able to handle everything so I have to merge in batches. I would like to use GNU parallel to do this because I'm working on an Amazon EC2 instance through PuTTy which sometimes crashes, leaving the process unfinished. How could I do this?

Edit: This is what I ended up doing, in case this is useful to anyone down the line:

ls *vcf.gz > vcf.list
parallel --max-args 30 bcftools merge {} -Oz -o batch_merge{#}.vcf.gz :::: vcf.list


This is merging batches of 30 files in parallel. I had almost 900 vcfs to merge so it went pretty quickly.

vcf parallel gwas bcftools bash • 237 views
0
Entering edit mode

you might be interested in some of the comments here; https://shicheng-guo.github.io/bioinformatics/1923/02/28/bcftools-merge

1
Entering edit mode
10 weeks ago
ole.tange ★ 4.0k

First: Learn tmux. This way you can let PuTTY crash, and you can reconnect and continue where you left off. It is also excellent for starting a job at work, and then reconnecting when you get home to see how it is doing.

I use: tmux, CTRL-b CTRL-c, CTRL-b CTRL-n, CTRL-b CTRL-p, CTRL-b CTRL-d, and tmux at.

Then look at GNU Parallel. Read chapter 1+2 of https://zenodo.org/record/1146014 It should take you no more than 15 minutes.

Then write out the complete commands you want to run. When you have written out 3, you can see there is a pattern. Try to replicate that pattern with GNU Parallel - use --dryrun to see if you made it do the right thing.

GNU Parallel is not magic: If you do not know how to run the commands by hand in serial, it is unlikely you will be able to make GNU Parallel do it for you.

0
Entering edit mode

I recently had to do the same thing, and also accomplished it with GNU parallel. Code here. But basically;

find inputs_dir -type f | parallel --jobs 1 --xargs bedops -m {} ">" merged.{#}.bed
find . -maxdepth 1 ! -path 'inputs_dir*' -type f -name "merged.*.bed" | parallel bedops -m {} ">" merged.bed


I am using Bedops with .bed files, replace that with your bcftools commands and .vcf files. I took a naive "two pass" approach, assuming that the number of output files from the first command will be small enough to combine in a single command with the second. But if you really do have massive numbers of files, you might need to wrap this in a for or while loop that keeps merging until there are no files left un-merged. The important part being that parallel --xargs will batch the input files up into groups that are small enough to fit on the command line, leaving you with multiple intermediary merge products that you can clean up with one more merge command.

0
Entering edit mode

Thanks for your suggestion! I ended up doing something pretty similar (see edited OP), using --max-args instead of --xargs.

0
Entering edit mode

Thanks. I ended up finding a solution with Parallel that worked very well for me (see edited OP).

0
Entering edit mode
10 weeks ago
ATpoint 48k

A: How to parallelize bcftools mpileup with GNU parallel?

A: Samtools mpileup taking a long time to finish

The idea can be generalized though. You could also define the batches up front and then use SLURM (or whatever scheduler use use) arrays to submit a job for each batch.