I have some paired end RNA-seq samples that have high levels of duplication (some as high as only 6% remaining after de-duplication). I think it was due to low concentration of input RNA (~1ng), and smaller subset of genes being expressed (because the RNA is from a specific cell type isolated from brain). Even after a poly-A selection, the highest gene expressed in my samples was a ribosomal RNA transcript.
I used Picard's MarkDuplicates to remove duplicated reads from my samples and looked at how that affected counting. I was happy to see that the counts for the rRNA gene were greatly reduced, but it also seems that the counts for almost every single gene are reduced. I thought that only high expressing genes would have duplicate reads. I also did a correlation analysis between the regular samples and the de-duplicated samples and saw that there was excellent correlation between them, but I'm just confused now.
If basically every gene has duplicates, what does it mean? Should I only use de-duplicated samples for further analysis? I know there are lots of other threads on this issue but it seems like my duplication is more severe.