Hi,
I am doing a DE analysis on an RNA-seq dataset and have a question about PCR duplicates. The organism is a bacteria with a small genome and the libraries were over-sequenced, resulting in duplication levels >90% (from Picard; sequencing was paired-end). I know the general consensus is to not remove PCR duplicates for DE analysis. I have read a lot of posts about this, but can't really find comparable cases where the duplication rate is this high. I have concerns about the validity of the analysis if almost all the data come from duplicates. If I remove duplicates I would still have plenty of data to do DE analysis. I am hoping to get some feedback on whether to proceed with removing duplicates since the rate is so high or whether it would still be better to leave them in. Thank you!
What kind of coverage do you have now? Is it in 100s of fold? If you over sequenced the libraries then the better option may be to randomly downsample the data so you end up with 25-30x coverage. Perhaps do a couple three sets to avoid any kind of bias. Then see what you end up with.
You can use
reformat.sh
from BBMap suite or any other software of your choice.^Yes, I second this actually.
Thanks so much for your suggestions. I don't have the data in front of me but coverage is definitely very high (certainly the hundreds). Regarding downsampling, I did think about that as an option. However, wouldn't this lead to the same problem? If I randomly sampled from a set of reads that are 90% duplicates, wouldn't the resulting sample also be expected to be 90% duplicates? On the other hand, if I removed duplicates this would selectively downsample so I don't have that problem. I will also run the analysis both ways, but welcome any other thoughts or suggestions. Thanks again.
I don't think the problem of 90% read dups would go away but there is some hope that the percentage will reduce, if you try random sampling.
I assume you don't have UMI's so you can't really say that these are all PCR dups with certainty. If the problem is a result of over amplification of low input RNA, then no amount of informatic wrangling is going to fix the issue.
At this point you will only lose time. So perhaps try sampling and de-duplicating like @dsull recommended.