How to efficiently/quickly merge ~500k vcfs?
2
1
Entering edit mode
11 months ago
Lei ▴ 20

I have a frankly ludicrous number of single-sample vcf.gz files (with their tabix) that I want to merge into on big file. I've previously used bcftools merge on 48 threads to merge 1000 and it took 15+ minutes. I'm pretty sure that time to complete won't scale linearly once I increase the number of samples to 500k+. Any suggestions? Should I merge groups of samples at a time like going up a tree? Should I use a different toos?

bcftools vcf variant-calling • 1.3k views
ADD COMMENT
1
Entering edit mode

just saw there is a 'virtual codeathon' for "scaling vcf to millions of samples" soon, can sign up here https://ncbiinsights.ncbi.nlm.nih.gov/event/vcf-for-population-genomics-codeathon/

ADD REPLY
0
Entering edit mode
11 months ago

I would suggest TileDB-VCF, which enables downstream analysis (and export) without the need to fire up a Spark cluster. (Disclaimer: I work for TileDB)

ADD COMMENT
1
Entering edit mode

Interesting! I looked into TileDB-VCF a couple of months back and it looks like the tutorial has much improved! I'll give it a try as well.

ADD REPLY
0
Entering edit mode

Feel free to reach out to me directly. I can walk you through some notebooks and/or provide some free credits to get you started.

ADD REPLY
0
Entering edit mode
11 months ago
DBScan ▴ 300

Do you have VCFs or gVCFs? For gVCFs you could also use HAIL (https://hail.is/) or GLNexus (https://github.com/dnanexus-rnd/GLnexus).

ADD COMMENT

Login before adding your answer.

Traffic: 3059 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6