I have a set of regions in the genome (about 4 million regions, parametrized by start and end genomic cooridnates) that I want to align BAM reads into. I want to get all reads that overlap that set of coordinates.
The naive way to do this is very slow and I'd like to use interval trees. However, to use interval trees, I'd have to first index all the BAM reads into a tree, and then, for each coordinate in my set of regions, look for all the reads within the tree.
My concern is that this requires loading all of the BAM reads into memory since I have to index them into the interval tree data structure.
Is there a way around this, or a Python library that does this already? Any help would be appreciated.
p.s. I'd prefer to do this within Python and not use BEDTools unless there's an easy way to integrate Python with BEDTools that someone can recommend?