Question: Split bam into chunks but keep reads mapped to the same region together
0
gravatar for olikidrod
11 months ago by
olikidrod0
olikidrod0 wrote:

I need to split enormous bam files into smaller pieces, to parallelize haplotype calling. It's widely recommended in other posts to split these by chromosome. However, my files are so large (whole genomes at high coverage) that calling haplotypes on a single chromosome can still take more than a week.

I'd therefore like to split these further, e.g. split the chr1 bam file into 8 smaller 'chunks'. However, it's essential that the reads covering a particular region of a chromosome are distributed only in the same 'chunk', otherwise it would be impossible to accurately call the haplotype of that region. Can anyone suggest how I might split a bam file and ensure this?

I've tried using the 'split' function in alntools, which is intended to split bams "such that all the alignment of a same read appear only in a single chunk." However, the project is no longer actively maintained and there are dependency conflicts that mean it no longer works. Any suggestions would be much appreciated.

bam split alignment • 297 views
ADD COMMENTlink modified 11 months ago by d-cameron2.3k • written 11 months ago by olikidrod0

You mean you want to make sure all chunks' reads are not overlapped (which means one alignment occurs in only on chunk), is there any other criteria?

ADD REPLYlink written 11 months ago by yztxwd380
0
gravatar for d-cameron
11 months ago by
d-cameron2.3k
Australia
d-cameron2.3k wrote:

I need to split enormous bam files into smaller pieces, to parallelize haplotype calling

Indexed BAM supports random access so there no technical reason why you have to split the bam at all. Many good callers will be able to multi-thread the calling. Other callers will have a parameter to specify a region of interest and will only make calls in that region. In the case of the latter, you can just split your genome into X non-overlapping 'chunks' and start X instances of your caller, each configured with a different region of interest.

It's only if you caller supports neither of those scenarios that you will need to split your bam file, and even then, you could just use samtools to split your bam if you know the intervals. Splitting by chromosome is a very crude approach to multi-threading but there are still a number of callers out there that do take this approach.

ADD COMMENTlink modified 11 months ago • written 11 months ago by d-cameron2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 970 users visited in the last hour
_