How RNAseq PCR amplication results in duplicates
12 months ago
lsy9 ▴ 20

I've just started studying RNAseq and I'm confused about how PCR amplication bias leads to duplicates.

I first thought that

  • A cDNA is PCR amplified to form a cluster and each cluster results in one read.
  • It is possible to PCR amplify the cDNA before forming the cluster if the initial amount is too small, but as it results in bias it is not recommended and is not a routine procedure.
  • So duplicate reads are from the cluster formation step. If a cDNA is amplified too much, the cluster can get too big and is identified as two clusters, resulting in duplicate read.

But I searched more and found that it is called optical duplicates and is different from PCR duplicates. I also read somewhere that PCR duplicates are from PCR amplication step before the cluster formation.

My question is,

  1. Is PCR amplication before cluster formation a routine procedure in RNAseq? How about in single cell RNAseq?
  2. If all cDNA is amplified before cluster formation, than shouldn't all reads have duplicates? (Because each amplified cDNA -with indentical sequence- will form a seperate cluster and counted as one read?)
12 months ago

PCR amplification is common in both RNA-seq and scRNA-seq. The prevalence of duplicates in your sequenced library will be a function of library complexity (quantity of unique cDNAs before amplification) and sequencing depth. In general, the more complex the library, the deeper the library can be sequenced before you hit a wall of diminishing returns. This is because after a certain number of reads, you've sequenced most of the unique fragments in the library, and are now just sequencing PCR duplicates of those unique fragments.

RNA-seq libraries are generally complex enough that the number of sequenced PCR duplicates is negligible for analysis. scRNA-seq libraries tend to have low complexity and many PCR duplicates as a consequence, so reads are tagged with a random sequence called a unique molecular identifier (UMI) before PCR. This allows duplicate detection after sequencing since you would not really expect identical UMIs to appear in the same position/gene.

Thank you for the reply! I guess I was confused because I thought all cDNAs are read, while actually only part of them are read (random sampling), the amount depending on the chosen sequencing depth.


