I'm using pysam's bam fetch() method to pull reads out of a few random regions of the genome. Since fetch returns an iterator, I assumed it would be approximately as quick as the file seek operation, but it appears to be performing a substantial amount of work to pull out the first read in regions of high coverage:
In : %timeit bam.fetch("chr1", 121485288, 121485289).next() 1 loops, best of 3: 3.4 s per loop In : bam.count("chr1", 121485288, 121485289) Out: 849385
In cases like this with high coverage, I'd like to only pull out the first thousand reads or so in order to speed up the operation. Is there any way to run this query so that it doesn't spend almost all the time (3.5 out of 5 seconds) accessing the first read? Is this a limitation of the bam format?