subset a bam file
2
0
Entering edit mode
11 months ago
Ming • 0

I possess numerous sorted BAM files; however, for my project, I am required to randomly select a subset of reads (1e5) from them. I have explored the option of converting a pysam object to a list, but encountered issues with substantial memory usage and slow processing. Similarly, the downsampling APIs of samtools and picard present similar challenges. Is there any efficiency may?

bam pysam NGS samtools • 913 views
ADD COMMENT
0
Entering edit mode
11 months ago

How about the option "-s" of samtools view ?

      --subsample FLOAT      Keep only FLOAT fraction of templates/read pairs
      --subsample-seed INT   Influence WHICH reads are kept in subsampling [0]
  -s INT.FRAC                Same as --subsample 0.FRAC --subsample-seed INT
ADD COMMENT
0
Entering edit mode
11 months ago

If you need exactly 1e5 records, you can do that with BBTools:

reformat.sh in=reads.bam out=subsampled.bam srt=100000 primaryonly

But bear in mind that sam/bam are alignment-centric rather than read-centric formats, so records are not necessarily the same as reads. Reformat is designed to be very fast and low-memory, and as such, operates on records (lines) so it ignores situations where you have a split alignment such that half is at the beginning of the bam and half is at the end of the bam, since keeping those together would require slow random-access and/or a lot of memory (which is what you are observing). Ultimately, it's just not an efficient format to work with for this kind of problem; it works better to subsample fastqs.

ADD COMMENT
0
Entering edit mode

it works better to subsample fastqs.

I have a feeling that that is what OP means by

I am required to randomly select a subset of reads (1e5) from them.

If the data contains paired-end reads then name sorting the BAM would also be a requirement in addition to perhaps getting 1e5 reads from each end.

ADD REPLY

Login before adding your answer.

Traffic: 790 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6