Question

Extending the BAM index to include mate info

0

Entering edit mode

6.9 years ago

novice ★ 1.1k

I have had this problem for quite sometime now: I want to extract reads if they or their mates are mapped to a region. The default behavior of samtools when given a region is to extract reads based only on their own RefID and Position, disregarding the mate's RefID and Position. Since the mapping info of each read's mate is included on the same line in the SAM format, it's easy to write a script to do that, but samtools remains exponentially faster, I think mainly because of random-access facilitated by the BAM index rather than line-by-line processing of the SAM file. So I have decided to take some time to try and optimize read-extraction (according to my criteria) because I plan to do that a lot.

After spending some time trying to understand samtools code, I realize the issue is that the BAM index doesn't contain the mate information (which makes sense because it is not used by samtools). However, I must admit that I'm still intimidated by the code. I thought maybe someone who's worked on samtools development before could point me in the right direction. I would like to add two fields to the BAM index: mate_tid, and mate_position. Then the next step would be to extend the cases for including a read when given a region to account for the mate info, but I think that part would be easy.

PS: please also let me know if I'm completely off in my understanding of the BAM index!

samtools index • 1.2k views

ADD COMMENT • link 6.9 years ago by novice ★ 1.1k

0

Entering edit mode

Before you reengineer half of htslib, have you tried instead:

Hashing the names of alignments in your ROI (removing those where both mates are in the ROI).
Storing the mate position
Grouping the mate positions together in reasonable size ranges
Fetching those ranges and checking each read name against the hash from (1) (removing once it's found).

That will run faster than actually doing what you propose and be much much much easier to implement (the hashing stuff is already available in the htslib headers).

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Are you saying that I do the hashing and so on for each region query? If so, I don't see how that's faster. I want to hash/index once then do gazillion region queries.

ADD REPLY • link 6.9 years ago by novice ★ 1.1k

0

Entering edit mode

That's the trick, you don't have to do the gazillion region queries that you would have otherwise done. Since the mates will mostly cluster together, you'll do a couple region queries. This will be vastly faster than what you're proposing in your original question.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k