I possess numerous sorted BAM files; however, for my project, I am required to randomly select a subset of reads (1e5) from them. I have explored the option of converting a pysam object to a list, but encountered issues with substantial memory usage and slow processing. Similarly, the downsampling APIs of samtools and picard present similar challenges. Is there any efficiency may?
But bear in mind that sam/bam are alignment-centric rather than read-centric formats, so records are not necessarily the same as reads. Reformat is designed to be very fast and low-memory, and as such, operates on records (lines) so it ignores situations where you have a split alignment such that half is at the beginning of the bam and half is at the end of the bam, since keeping those together would require slow random-access and/or a lot of memory (which is what you are observing). Ultimately, it's just not an efficient format to work with for this kind of problem; it works better to subsample fastqs.