I am trying to understand a bit deeper how duplications occur and how to deal with that in NGS analysis. First, of all, I wanted to understand the FastQC read duplication report for which the tutorial of Istvan Albert is really good (Revisiting the FastQC read duplication report).
My FASTQ file has shown this report
The title shows the proportion of duplicated read what is (as far I can undertand) so high. I have run Rmdup and MarkDuplicate in this file and the proportion of duplicated reads detected and removed/marked is around 15%.
So my question is, are not all duplicated reads removed when applying removal duplicated algorithms?
My second question is, for the simple simulation that Istvan Albert does in his post, I can understand what the red and blue lines is telling me. However, what my red and blue lines are telling me when working in a more realistic scenario like this (e.g. why is there a pick between 9 and >10)?