Downsample BAM for targeted NGS panels
2.7 years ago
Hello

I am downsampling BAMs down to 10% on control samples to check at what minimum read depth are we able to detect a certain set of SNPs.

What approach do people usually adopt for this

1) downsample the same BAM 3 times with different seed options and then take the average of the read depth ?

sambamba view -h -f bam -t 10 --subsampling-seed=3 -s 0.1 $BAM -o$downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=2 -s 0.1 $BAM -o$downsample_0.10.bam
sambamba view -h -f bam -t 10 --subsampling-seed=1 -s 0.1 $BAM -o$downsample_0.10.bam


2) do it just once

sambamba view -h -f bam -t 10 --subsampling-seed=34223 -s 0.1 $BAM -o$downsample_0.10.bam


is subsampling-seed relevant for reproducibility or just a number ?

downsample targeted-NGS sambamba
subsampling-seed relevant or just a number

I speculate that you will get the same set of reads if you are using a seed.

ive edited a bit of my question - so does this help with the reproducibility as well ?

You can easily test it with sampling a small number of reads.

You can also tryreformat.sh from BBMap to do subsampling. I think it should work with a BAM file. You will have a rich set of options for sampling

2.7 years ago

In Sheffield Children's NHS Foundation Trust, we already did this in 2013/4 and found that a total position read depth of 18 was the minimum at which one should be reporting [edit: germline] variants.

The general workflow was:

1. obtain a few dozen patient samples that had matched NGS and Sanger data over our regions of interest
2. downsample the aligned BAMs using Picard's DownsampleSam - I believe we chose 75%, 50%, and 25% random reads
3. check the last known position read depth at which all Sanger-confirmed variants were called

That was it. To obtain better precision, one could generate even more downsampled BAMs. Had we had time to publish, my plan was to downsample in 5% decrements, from 100% to 5%.

It was through this process that we also inadvertently 'recovered' the missed GATK variants, i.e., we would frequently encounter Sanger-confirmed variants, not in the original BAM, but in one of the downsampled BAMs.

java -jar "${Picard_root}"picard.jar DownsampleSam \ INPUT=Aligned_Sorted_PCRDuped_FiltMAPQ.bam \ OUTPUT=Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam \ RANDOM_SEED=50 PROBABILITY=0.75 \ VALIDATION_STRINGENCY=SILENT ; "${SAMtools_root}"samtools index Aligned_Sorted_PCRDuped_FiltMAPQ_75pcReads.bam ;


Kevin

I guess, adding fixed randon_seed will help to reproduce the results

That's pretty much the plan - to downsample from 50% to 10% and then check for its read depth. I was opting for my second option of doing it just once but I was suggested to downsample the same BAM thrice and then take its average read depth

So why did you use a random seed of 50 here ?

Spun a coin? - not sure - that is another parameter to test. I believe, in a standard routine run, it should be left null`, so that the seed changes