Question

Duplicate reads in RNA-seq

0

Entering edit mode

7.8 years ago

mmrcksn ▴ 50

Hi everyone,

I have some paired end RNA-seq samples that have high levels of duplication (some as high as only 6% remaining after de-duplication). I think it was due to low concentration of input RNA (~1ng), and smaller subset of genes being expressed (because the RNA is from a specific cell type isolated from brain). Even after a poly-A selection, the highest gene expressed in my samples was a ribosomal RNA transcript.

I used Picard's MarkDuplicates to remove duplicated reads from my samples and looked at how that affected counting. I was happy to see that the counts for the rRNA gene were greatly reduced, but it also seems that the counts for almost every single gene are reduced. I thought that only high expressing genes would have duplicate reads. I also did a correlation analysis between the regular samples and the de-duplicated samples and saw that there was excellent correlation between them, but I'm just confused now.

If basically every gene has duplicates, what does it mean? Should I only use de-duplicated samples for further analysis? I know there are lots of other threads on this issue but it seems like my duplication is more severe.

RNA-Seq duplicate reads picard • 3.8k views

ADD COMMENT • link updated 7.8 years ago by igor 13k • written 7.8 years ago by mmrcksn ▴ 50

1

Entering edit mode

Someone with better experimental chops will need to confirm but perhaps extra cycles of amplifications caused this problem?

If you feel that the experiment did not work as intended then perhaps it is time to consider redoing (at least the library part) (that is easy for someone like me to say, so apologies in advance, if this is an irreplaceable sample/difficult experiment).

ADD REPLY • link 7.8 years ago by GenoMax 142k

score 1 · Answer 1 · 2016-07-22

You definitely have more duplicates than usual. If you started with little RNA, then you must have amplified a lot, so it makes sense that you have a lot of duplicates. They would be found in all genes, since you are amplifying all genes. Thus, all genes would have fewer counts after duplicate removal.

See previous extensive discussion on the topic here: How detrimental are duplicate reads in RNAseq experiments?