Question

Split bam into chunks but keep reads mapped to the same region together

0

Entering edit mode

4.0 years ago

olikidrod • 0

I need to split enormous bam files into smaller pieces, to parallelize haplotype calling. It's widely recommended in other posts to split these by chromosome. However, my files are so large (whole genomes at high coverage) that calling haplotypes on a single chromosome can still take more than a week.

I'd therefore like to split these further, e.g. split the chr1 bam file into 8 smaller 'chunks'. However, it's essential that the reads covering a particular region of a chromosome are distributed only in the same 'chunk', otherwise it would be impossible to accurately call the haplotype of that region. Can anyone suggest how I might split a bam file and ensure this?

I've tried using the 'split' function in alntools, which is intended to split bams "such that all the alignment of a same read appear only in a single chunk." However, the project is no longer actively maintained and there are dependency conflicts that mean it no longer works. Any suggestions would be much appreciated.

alignment BAM split • 2.1k views

ADD COMMENT • link updated 4.0 years ago by d-cameron ★ 2.9k • written 4.0 years ago by olikidrod • 0

0

Entering edit mode

You mean you want to make sure all chunks' reads are not overlapped (which means one alignment occurs in only on chunk), is there any other criteria?

ADD REPLY • link 4.0 years ago by Jianyu ▴ 580

score 0 · Answer 1 · 2020-04-06

I need to split enormous bam files into smaller pieces, to parallelize haplotype calling

Indexed BAM supports random access so there no technical reason why you have to split the bam at all. Many good callers will be able to multi-thread the calling. Other callers will have a parameter to specify a region of interest and will only make calls in that region. In the case of the latter, you can just split your genome into X non-overlapping 'chunks' and start X instances of your caller, each configured with a different region of interest.

It's only if you caller supports neither of those scenarios that you will need to split your bam file, and even then, you could just use samtools to split your bam if you know the intervals. Splitting by chromosome is a very crude approach to multi-threading but there are still a number of callers out there that do take this approach.