FastQC and EstimateLibrarySize (Picard): Library Complexity and/or Duplicated Reads
1
3
Entering edit mode
8.3 years ago
komal.rathi ★ 4.0k

Hi,

I am using FastQC and EstimateLibraryComplexity utility from Picard Tools to estimate the complexity of my paired-end RNAseq libraries. I have certain questions about the two tools.

First, according to the documentation of EstimateLibraryComplexity,

Attempts to estimate library complexity from sequence of read pairs alone. Does so by sorting all reads by the first N bases (5 by default) of each read and then comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).

So it uses a minimum of 5 bp identity to group reads together and using a max difference rate of 0.03 determines whether they are duplicates or not. Does that mean that if my read length is 100bp and the MAX_DIFF_RATE is 0.03, then it will assume reads to be duplicated unless they are different by more than 3 bp?

Second, how many bp does FastQC takes into consideration when determining duplicates? Is there an option in FastQC where you can change the minimum number of bases to be compared between reads in order for them to be considered duplicates.

Thanks!

EstimateLibrarySize FastQC PicardTools • 5.1k views
0
Entering edit mode

"3bp or less" should be "more than 3bp"

1
Entering edit mode

Thanks I edited my question!

3
Entering edit mode
8.3 years ago
Dan D 7.3k

Does that mean that if my read length is 100bp and the MAX_DIFF_RATE is 0.03, then it will assume reads to be duplicated unless they are different by more than 3 bp?

Yes.

Second, how many bp does FastQC takes into consideration when determining duplicates? Is there an option in FastQC where you can change the minimum number of bases to be compared between reads in order for them to be considered duplicates.

From the manual:

To cut down on the memory requirements for this module only sequences which first appear in the first 200,000 sequences in each file are analysed, but this should be enough to get a good impression for the duplication levels in the whole file. Each sequence is tracked to the end of the file to give a representative count of the overall duplication level. To cut down on the amount of information in the final plot any sequences with more than 10 duplicates are placed into grouped bins to give a clear impression of the overall duplication level without having to show each individual duplication value.

Because the duplication detection requires an exact sequence match over the whole length of the sequence, any reads over 75bp in length are truncated to 50bp for the purposes of this analysis. Even so, longer reads are more likely to contain sequencing errors which will artificially increase the observed diversity and will tend to underrepresent highly duplicated sequences.

0
Entering edit mode

Thanks Dan D !

So if my bam file has 100 bp reads, each of the reads will get truncated to 50 bp and the program will determine duplication based on 100% identity which in essence is 50% (50/100) identity. So whereas EstimateLibrarySize uses a 97% identity, FastQC uses a 50% identity in order to determine duplication. Right? Also, has it been established yet whether one of these tools is better than the other?

1
Entering edit mode

I wouldn't characterize FastQC as using 50% identity. It's using 100% identity, just on a restricted range that's less likely to be affected by sequencing errors. Picard allows some dissimilarity to combat the 3' quality issue, but that allows for informative high-quality 5' differences to be missed. An ideal program would use some sort of quality-weighted edit distance as a metric to determine duplication, but implementing that is likely more hassle than it's worth, since the output of these programs is meant to be more qualitative than exactly quantitative.

1
Entering edit mode

What Devon Ryan said. I'll also add that there's a tradeoff between the two approaches. Picard's EstimateLibrarySize is much more thorough, but also slower and the command line is clunky to construct. Whereas FASTQC is much faster and easier to kick off (especially for multiple FASTQs), while being more of an estimate than EstimateLibrarySize.