Entering edit mode
2.5 years ago
backpackbio
•
0
I have collected a lot of RNA-seq(Cancer) data from different sources to be used for standardisation for a Differential Expression analysis pipeline. A lot of samples(>50) contain high duplication levels(80-90%) and Total Number of reads is also very high(around 150-250 Millions). Is there a set cut-off for Duplication levels in RNA-seq? I have tried searching in few literature but they don't seem to help much. It would be a huge help if anyone can suggest any literature or a source where I can find my answers. Thanks in advance!
No there is none. Unless you can identify optical/PCR duplicates (which requires UMI) one can't decide if the read is a real copy or sequencing duplicate. There is a study that says most of the RNAseq data is real. (LINK).
If you are collecting data from diverse sources there is going to be a lot of batch effects. You should be mindful of that possibility, if you are using such data for any standardization.
Hi GenoMax, thanks for your reply and the linked article. I am aware of the batch effects(it's a pain) and we are working on it to resolve that. Thank you for your kind suggestions. :)