Hi,
I have some C++ code I've written that uses the BamTools API to count the number of reads that overlap a set of intervals in hg19. I'm trying to run this on ~400 BAM files using SGE (24 threads per node), but the I/O burden is wreaking havoc with the cluster I'm using. Currently, my code does something along the lines of:
BamReader reader;
reader . Open(bamName);
while(reader . GetNextAlignment(al))
{
do stuff...
}
Which is the most straightforward use of the API to read BAM files I can imagine.
My guess is that BamTools provides random access to the file, and all I'm looking for is to read the BAM line by line. Is there any way to do this with the API, or am I better off opening a pipe to samtools view?
For a number of reasons, I cannot use GATK DepthOfCoverage or SamTools pileup since I need the read names for my coverage analysis.
Thanks!
EDIT: I am using these BAM files to compare interval coverage from another sequencing pipeline to one I am developing.
The BAM files I am looking at come from a different project that has alignment data for families. There is one BAM file per family, with each member identifiable by a set of read group IDs corresponding to different flow cells. These files already come to me sorted and indexed.
I'm looking at coverage for about 6M short intervals throughout hg19, ranging from ~10 bp to ~200 bp.
I specifically need the read IDs because I have noticed that some read fragments are small enough that the same interval is present in both read pairs. If I didn't have read IDs, I wouldn't be able to filter reads that cover the same region twice.
I can't quite tell from your rough example, but are you using the BamTools API to seek to the positions you're interesting in? This will require a sorted and indexed bam file. If you're not, and you're checking for interval overlaps in the inner loop ("do stuff"), you could be wasting a lot of work and IO (depending on how sparse the intervals are).
I'm not seeking within the BAM file, just streaming it in line by line. You make a great point that I could easily be using intervals instead.
The following code in bedtools (for the multibamcov tool) uses a MultiReader to seek to regions identified by specific BED entries and count coverage. This may help. https://github.com/arq5x/bedtools/blob/master/src/multiBamCov/multiBamCov.cpp#L106-L215
If I end up using intervals, this is definitely the way to do it. I'll give this a shot unless I or someone else come up with a better idea. Thanks!
"... SamTools pileup since I need the read names for my coverage analysis." but could you use samtools view -L my.bed my.bam ?
Thanks Pierre for the idea. Unfortunately, I didn't initially mention that I need the read IDs to filter overlapping read pairs. But I'm glad to learn about another useful samtools feature :)
bug you'll get the read IDs with "samtools view"
Whoops...misunderstood your comment, Pierre. I'm looking at my local documentation for samtools and the documentation and sourceforge and don't see anything about a -L flag with bed coordinates. Can you point me to the samtools version that has this option?
$ samtools-0.1.18/samtools view
Thanks again for the tip...I just timed samtools view with -L and it worked really well and fairly quickly
After trying using BamTools with SetRegion, the time to get the coverage at all 6 million loci went from less than half an hour to almost 6 hours.
Thanks to Pierre though, I looked into the SamTools C API and Heng Li's code for the latest version of SamTools (which has the -L flag, almost exactly like what I'm looking for) and was able to use that code to get coverage at these intervals according to my specifications while limiting how much time the code takes. It now runs in about 17 minutes.
If anyone is interested in the code, let me know, but it's really just a derivation of sam_view from samtools-0.1.18 by Heng Li.
Thanks again for everyone's help!
Can you split your set of intervals an run your program for each sub-set on a different node ?
I can split by interval, but since I'm already running 400+ BAM files in parallel on a 16 node cluster, I'd rather not increase the number of jobs I'm submitting by an order of magnitude.
Are you not actually using the BamMultiReader? And based on your use case, it sounds like you are seeking to specific intervals. This would likely lead to a bunch of I/O ops, especially if you are interrogating many intervals. Worst case, you will do 400 seeks for each new interval.
Since it's only one file per person, I don't need BamMultiReader. However, you are right, seeking to intervals is probably the way to go. I guess I'm still a little concerned that it could be a huge IO draw while it's running, but it can just run in a much shorter period of time. If I limit the number of jobs that can be submitted at once, it should be fine though.