I am using FastQC and
EstimateLibraryComplexity utility from Picard Tools to estimate the complexity of my paired-end RNAseq libraries. I have certain questions about the two tools.
First, according to the documentation of
Attempts to estimate library complexity from sequence of read pairs alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).
So it uses a minimum of 5 bp identity to group reads together and using a max difference rate of 0.03 determines whether they are duplicates or not. Does that mean that if my read length is 100bp and the
MAX_DIFF_RATE is 0.03, then it will assume reads to be duplicated unless they are different by more than 3 bp?
Second, how many bp does FastQC takes into consideration when determining duplicates? Is there an option in FastQC where you can change the minimum number of bases to be compared between reads in order for them to be considered duplicates.