Question

Fastest way to merge 2 vcfs and get a bcf

1

Entering edit mode

3.0 years ago

boxate1618 ▴ 60

I have 2 vcfs that have millions of records and thousands of samples, need to merge them and get a bcf as output. It seems converting to bcf first then merging is significantly faster (i can do first the conversion in parallel), and runs about 75 seconds in my test. Trying to output bcf during the merge takes about 125 seconds. Is there anything else that could seed this up?

# try merging while converting to bcf simultaneously
#real    2m5.219s
#user    5m42.089s
#sys     0m3.121s
time bcftools merge --threads 24 $vcf_path1 $vcf_path2 -Ob > $convert_during_path

# covert each to bcf first
#real    0m46.101s
#user    2m33.987s
#sys     0m1.645s
time bcftools view --threads 24 $vcf_path1 -Ob > $bcf_path1

#real    0m44.881s
#user    2m32.189s
#sys     0m1.533s
time bcftools view --threads 24 $vcf_path2 -Ob > $bcf_path2

# merge bcfs
#real    0m29.010s
#user    3m47.727s
#sys     0m1.569s
time bcftools merge --threads 24 $bcf_path1 $bcf_path2 -Ob > $convert_before_path

bcftols • 1.9k views

ADD COMMENT • link written 3.0 years ago by boxate1618 ▴ 60

0

Entering edit mode

The time difference is negligible, no? Are these reduced VCFs / BCFs as part of a trial run for the ultimate merge?

It makes sense that it is quicker via BCF.

Trying to output bcf during the merge

Why would you do that?

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Are these reduced VCFs / BCFs

Yes, these "test" vcfs/bcf have several thousand records and samples to get an idea of benchmarking. the real ones have millions or records and tens of thousands of samples

Why would you do that?

I have to do 2 successive merges then filter across all sites. My understanding of bcf is that operations on all sites will be 10-20 faster over vcf

ADD REPLY • link 2.9 years ago by boxate1618 ▴ 60

0

Entering edit mode

Ah yes, makes sense now! I would definitely use BCF for any large operation, and ensure that both files are normalised:

bcftools norm -m-any \
  --check-ref w \
  -f hg38.fasta \
  -Ob var.bcf > var.norm.bcf ;
bcftools index var.norm.bcf ;

Other than that, I would just start with the merge and ensure that you have considerable memory available...

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I chunked a big cohort "by sample" and ran genotype imputation on the chunks, now need to merge and filter by imputation quality. So chunks should already be consistent w/ strand and same order of records.

after some googleing:

I might be able to save some time with piping uncompressed bcf between step or steps where I have to write, write less compressed bcf In addition, seems the newest version of bcftools 1.12 lets you merge between pipes using the --no-index option. I would imagine this is a little more dangerous though, not sure if I am going to try that one. I have > 1TB RAM so hopefully can play with some of these. Even a 2X speed up is days for me.

http://www.htslib.org/doc/bcftools.html#merge

ADD REPLY • link 2.9 years ago by boxate1618 ▴ 60

0

Entering edit mode

Wait, you did the imputation as 1 sample against the reference dataset, and then repeated that across all samples separately? Is that the correct procedure? Our chunks would normally be chromosomal regions. For example, I did 2 large imputations last year in 5 megabase chunks across the genome. One can instruct the algorithms to impute a certain amount of bp across each chunk so that they overlap.

Regarding the actual speed efficiency testing, BCFtools is already 'well-refined' and, ironically, if users are awaiting the results, it may be better to just commence the process.

ADD REPLY • link 2.9 years ago by Kevin Blighe 87k