I have 50% duplicates on WGS on tumour samples and while I was expecting that the coverage will be reduced from 30x to 15x, it goes below 8x. So I am trying to figure out the reason that I have less coverage.
I can see that I have a warning/failure on the per tile sequence quality.
I suspect that this can affect the number of reads, can it though affect the coverage as well?
It will take a while because the analysis is exhaustive, but the provided metrics will give a thorough perspective, including percentage of optical duplicates, marked duplicates, and overlapping bases, among other things.
So I had calculated the duplicate rate according to fastqc, however this is not absolutely correct as it is an estimate of the first 100000seq. According to CollectWgsMetrics Tool and the metrics in MarkDuplicates, the duplicate rate is much higher, which fit the mean coverage, calculated with depthofcoverage GATK.
However, the mean coverage calculated by CollectWgsMetrics Tool is smaller than the one I calculated from the depthofCoverage average column. I suppose this happens because the CollectWgsMetrics Tool calculates the mean coverage in bases of the genome territory (non-N bases in the genome), after all filters are applied.
Is that right?
Also I am not sure what are the filter that are applied in the CollectWgsMetrics Tool. Does anyone know?