Hello Bio-wizards,
I'm currently aligning (ultra) deep whole-exome sequencing data with a target depth of coverage of 500x on bulk (FFPE) samples. I'm going through the alignment steps as per the instructions of the GATK best practices. However, I'm wondering about the 'MarkDulpicates' step after alignment. As I understand it correctly, marks identical reads (same sequence and orientation) under the assumption that these duplicates are generated during library preparation and are not actual pieces of DNA from different cells in the bulk sample that happen to be identical. I understand this reasoning when you have a standard 10x depth of coverage. For my experiment (500x) however, you would expect average about 2.5 identical reads to start from each base in the target region (for 100bp paired-end sequencing) not including any library artefacts (assuming no mutation occur). This may not be a significant problem for 500x but if you would increase read depth >1000x, we would start to see >5 actual biological duplicate reads being reduced to 1.
So I'm wondering, if you really think about it, is the 'best practice' of Marking Duplicates even a good idea in ultra deep sequencing experiments? I'm questioning this since in most samples I only see a median depth of coverage of about 300x which is quite far off the targeted 500x and I'm looking for an explanation.
Nothing is being reduced by default. You are simply marking potential duplicates. Unless you use
--REMOVE_DUPLICATES
no reads will be removed.Correct but when e.g. calculating coverage statistics or doing variant calling, the marked duplicate reads don't get taken into account as far as I know. So that the same as them being not in the file.