I am using Qualimap to do some Quality Control (QC) of my bowtie2 alignment and also to compare QC reports when I mark duplicates in the BAM file using samtools markdup. While comparing these reports, I noticed a difference in the insert sizes that Qualimap gives. For example:
One of my samples had a mean insert size of 181 before marking duplicates (i. e. bowtie2 mapping → SAM output → convert SAM to position sorted BAM that Qualimap requires), and gave the following Qualimap graph:
Then, I marked duplicates in this sample using samtools markdup. However, this command is dependent on samtools fixmate, and for fixmate to work, the BAM has to be sorted by read name. So, I did the following: bowtie2 mapping → SAM output → convert SAM to BAM → samtools collate to sort by read name → samtools fixmate → samtools sort to sort on position because that is what samtools markdup requires. This all resulted in a BAM output file where the duplicate sequences are marked, and I ran Qualimap over this. This resulted in a mean insert size of 213, and the following graph:
This is something completely different from the graph without the duplicate markings, even though it's the exact same data (apart from the actual duplicate markings). Samtools fixmate has the following definition: "fills in mate coordinates and insert size fields", so it does seem to do something to the insert sizes, but I feel like it shouldn't make an actual difference in the insert sizes. Does anyone have an explanation for this?
And by the way, this difference in insert size already occurs before the actual duplicate marking with samtools markdup, it's also visible in the position sorted BAM after using samtools sort after samtools fixmate.