5.9 years ago by
Santiago de Compostela, Spain
I think that the idea behind this question is to generate a reduced BAM file based on coverage, leaving only the reads and regions which would actually be useful for any kind of downstream analysis Abhi may want to perform. I can only foresee 2 major set of applications for this idea, which in my opinion shouldn't be addressed this way: a) you want simplify a later coverage calculation, or b) you want to work only on regions with a coverage above certain threshold. in any of these cases, the usual proceeding is to deal directly with the entire BAM, setting filters/thresholds for the analysis to be performed on them. the tools that were designed to deal with BAM files are indeed optimized to perform these filters/thresholds when needed.
I could only understand performing such extra work if your intention is to perform a later intensive work on that BAM file, which just being significantly reduced on size would represent an interesting save of disk usage, hence you'll get reduced timings. if you still want to go for such filtering process, the easiest thing I can think of would be a first pass trying to generate a bed file with the regions of coverage above the desired threshold (bedtools' coverageBed should do the work, and it'll also be very fast), and then a second pass filtering the BAM file with those regions (samtools should do the work, and again that should be very fast indeed).