Question

How RNAseq PCR amplication results in duplicates

1

Entering edit mode

3.2 years ago

lsy9 ▴ 20

I've just started studying RNAseq and I'm confused about how PCR amplication bias leads to duplicates.

I first thought that

A cDNA is PCR amplified to form a cluster and each cluster results in one read.
It is possible to PCR amplify the cDNA before forming the cluster if the initial amount is too small, but as it results in bias it is not recommended and is not a routine procedure.
So duplicate reads are from the cluster formation step. If a cDNA is amplified too much, the cluster can get too big and is identified as two clusters, resulting in duplicate read.

But I searched more and found that it is called optical duplicates and is different from PCR duplicates. I also read somewhere that PCR duplicates are from PCR amplication step before the cluster formation.

My question is,

Is PCR amplication before cluster formation a routine procedure in RNAseq? How about in single cell RNAseq?
If all cDNA is amplified before cluster formation, than shouldn't all reads have duplicates? (Because each amplified cDNA -with indentical sequence- will form a seperate cluster and counted as one read?)

RNA-Seq sequencing • 1.2k views

ADD COMMENT • link updated 3.2 years ago by rpolicastro 13k • written 3.2 years ago by lsy9 ▴ 20

score 3 · Answer 1 · 2021-01-26

PCR amplification is common in both RNA-seq and scRNA-seq. The prevalence of duplicates in your sequenced library will be a function of library complexity (quantity of unique cDNAs before amplification) and sequencing depth. In general, the more complex the library, the deeper the library can be sequenced before you hit a wall of diminishing returns. This is because after a certain number of reads, you've sequenced most of the unique fragments in the library, and are now just sequencing PCR duplicates of those unique fragments.

RNA-seq libraries are generally complex enough that the number of sequenced PCR duplicates is negligible for analysis. scRNA-seq libraries tend to have low complexity and many PCR duplicates as a consequence, so reads are tagged with a random sequence called a unique molecular identifier (UMI) before PCR. This allows duplicate detection after sequencing since you would not really expect identical UMIs to appear in the same position/gene.