Question

subset a bam file

0

Entering edit mode

19 months ago

Ming • 0

I possess numerous sorted BAM files; however, for my project, I am required to randomly select a subset of reads (1e5) from them. I have explored the option of converting a pysam object to a list, but encountered issues with substantial memory usage and slow processing. Similarly, the downsampling APIs of samtools and picard present similar challenges. Is there any efficiency may?

bam pysam NGS samtools • 1.4k views

ADD COMMENT • link updated 19 months ago by GenoMax 152k • written 19 months ago by Ming • 0

score 0 · Answer 1 · 2023-11-25

0

Entering edit mode

19 months ago

Pierre Lindenbaum 166k

How about the option "-s" of samtools view ?

      --subsample FLOAT      Keep only FLOAT fraction of templates/read pairs
      --subsample-seed INT   Influence WHICH reads are kept in subsampling [0]
  -s INT.FRAC                Same as --subsample 0.FRAC --subsample-seed INT

ADD COMMENT • link 19 months ago by Pierre Lindenbaum 166k

score 0 · Answer 2 · 2023-11-25

If you need exactly 1e5 records, you can do that with BBTools:

reformat.sh in=reads.bam out=subsampled.bam srt=100000 primaryonly

But bear in mind that sam/bam are alignment-centric rather than read-centric formats, so records are not necessarily the same as reads. Reformat is designed to be very fast and low-memory, and as such, operates on records (lines) so it ignores situations where you have a split alignment such that half is at the beginning of the bam and half is at the end of the bam, since keeping those together would require slow random-access and/or a lot of memory (which is what you are observing). Ultimately, it's just not an efficient format to work with for this kind of problem; it works better to subsample fastqs.