Does anyone know how to downsample a high coverage bam file such that it has a ceiling on number of reads at each position? Most of the downsampling tools (Picard, GATK walkers) downsamples the original bam file to a certain fraction of coverage, but that's not suited to my particular purposes. I found the --number option in the GATKCommandLine but for some strange reason it always stops working after about a hundred bases. If anyone has used the --number option in GATK successfully please let me know so I can investigate that more, if not please does anyone have any suggestion? Thanks!
Seems like there ought to be a better way to do it, but the work flow I can come up with off the top of my head is:
1) calculate depth in windows across the genome. (lots of info on doing this in other biostar questions)
samtools view -R goodregions.bed file.bam >newfile.sam
3) for each window that exceeded your threshold, calculate the fraction of the reads you want to keep, then downsample those regions accordingly:
samtools view file.bam 1:1-10000 | sed -n '0~100p' >>newfile.sam