I am running STAR-solo on SMART-seq2 data. In one manual (https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md), the Exact mode was recommended plate-based sequencing, which is about deduplication. However, I am still confused whether to deduplicate the reads or not, since many discussion said deduplication is not necessary for RNA-seq data as it could not differentiate PCR or biological duplication.
I have compare them and found that they are quite different regarding the count matrix. Would you recommend me to analyze both Exact and NoDedup mode for downstream analysis to see any difference?
Try both and see if there's any difference in your downstream analysis and what's the correlation between the two approaches.
It's impossible to do anything about the PCR duplicates from the pre-amplification step, but the location-based deduplication might help (or, alternatively, be overly aggressive) for the post-tagmentation PCR.
My personal view: I think people should always try different approaches if computationally feasible (and if they have the time), and report their findings in the supplementary materials of the papers they write (e.g. "we tried X, Y, and Z and observed a very slight difference (Supp. Data 1), and we ultimately decided to proceed with X for the remainder of our analysis"). It might not be relevant to the conclusions of the paper (especially if it's a biology paper and RNAseq is only one method you're using as evidence of your findings) but it would help the bioinformatics field.
Thank you, sir! I will try both methods and observe any differences and correlations. I will carefully read the paper you cited. Before delving into it, I have one more question: How do you determine which approach is correct and more suitable (removing or retaining duplicate reads) if you observe differences in downstream analysis (e.g., differentially expressed genes are different)? I'm not sure if there are different considerations for single-cell and bulk data. I know it's too early to discuss since I haven't obtained any results, but I still wish to seek your recommendations.
In the paper you cited, they concluded that solely removing duplicated reads based on their mapping coordinates would introduce substantial bias. Does this mean that not deduplicating reads would be better in cases where UMI is not available, such as SMART-seq2 data?
The "gold standard" for determining whether an approach is correct is to see if it can recapitulate known biology (e.g. you know, from previous findings and numerous other assays, that a certain cell type exists but it's only found via approach X but not approach Y).
Otherwise, the best thing you can do is to determine whether and to what extent there are differences between the two approaches.
In the paper, they basically conclude that there may be improvements when deduplicating based on mapping coordinates (if that's the only option) but bias will be introduced because it's a flawed way to deduplicate.
What is "better"? I don't have an answer to that question (nor do I think an answer exists). Anyway, I noticed you posted on the STAR github repo -- someone there (or here) may provide an additional opinion.
Thank you, sir! I will try both methods and observe any differences and correlations. I will carefully read the paper you cited. Before delving into it, I have one more question: How do you determine which approach is correct and more suitable (removing or retaining duplicate reads) if you observe differences in downstream analysis (e.g., differentially expressed genes are different)? I'm not sure if there are different considerations for single-cell and bulk data. I know it's too early to discuss since I haven't obtained any results, but I still wish to seek your recommendations.
In the paper you cited, they concluded that solely removing duplicated reads based on their mapping coordinates would introduce substantial bias. Does this mean that not deduplicating reads would be better in cases where UMI is not available, such as SMART-seq2 data?
The "gold standard" for determining whether an approach is correct is to see if it can recapitulate known biology (e.g. you know, from previous findings and numerous other assays, that a certain cell type exists but it's only found via approach X but not approach Y).
Otherwise, the best thing you can do is to determine whether and to what extent there are differences between the two approaches.
In the paper, they basically conclude that there may be improvements when deduplicating based on mapping coordinates (if that's the only option) but bias will be introduced because it's a flawed way to deduplicate.
What is "better"? I don't have an answer to that question (nor do I think an answer exists). Anyway, I noticed you posted on the STAR github repo -- someone there (or here) may provide an additional opinion.
Thank you for your answer Sullivan!