I possess numerous sorted BAM files; however, for my project, I am required to randomly select a subset of reads (1e5) from them. I have explored the option of converting a pysam object to a list, but encountered issues with substantial memory usage and slow processing. Similarly, the downsampling APIs of samtools and picard present similar challenges. Is there any efficiency may?
--subsample FLOAT Keep only FLOAT fraction of templates/read pairs
--subsample-seed INT Influence WHICH reads are kept in subsampling [0]
-s INT.FRAC Same as --subsample 0.FRAC --subsample-seed INT
But bear in mind that sam/bam are alignment-centric rather than read-centric formats, so records are not necessarily the same as reads. Reformat is designed to be very fast and low-memory, and as such, operates on records (lines) so it ignores situations where you have a split alignment such that half is at the beginning of the bam and half is at the end of the bam, since keeping those together would require slow random-access and/or a lot of memory (which is what you are observing). Ultimately, it's just not an efficient format to work with for this kind of problem; it works better to subsample fastqs.
I have a feeling that that is what OP means by
If the data contains paired-end reads then name sorting the BAM would also be a requirement in addition to perhaps getting 1e5 reads from each end.