Hi,
I have two datasets from different sources. Unfortunately one group have done unstranded RNA-seq while the second one has done stranded. When I do the PCA analysis of normalized reads using DESeq2, I see them clustering far from each other. Now I am doubtful if there is an artefact coming from the unstranded reads of the first group or is the difference real. Could anyone enlightment me if it would be appropriate to use these two datasets for comparisons for differential gene expression or will get wrong information for transcripts on the reverse strand?
Thanks.
But doesn't DESeq2 takes into account the difference in library depth? What else could be contributing to variation?
Batch effect is far more than library depth. The same samples prepped in different hands will have slightly different gene expression values. That's just life in experimental science.
But main question is just because one library is stranded and the other is unstranded, would that make them incomparable? I understand differences from human and machines are also involved.
It depends on how you want to do the analysis. If you are looking for DE genes between both datasets, then it will be difficult to distinguish between genes that are different due to the library prep protocol or the biology of those datasets. If it is possible to mix the two data sets then do the analysis then it is more likely to come up with a decent DE gene list. This scenario would be possible if the biological question being asked is the same ie, both datasets sequenced lung cancer and normal lung. So mixing the samples would reduce the noise from the sample prep. Hope that helps.
Ok. Thanks. I am comparing cerebellum to medulloblastoma (cancer of cerebellum). The only thing I think bothers me is if the anti-sense transcripts for an overlapping mRNA would be improperly quantified.
Ideally, wouldn't mix the datasets, but everything would be done exactly the same. However, there is also the potential for insight if the analysis is done right as it could help highlight whether there is important information gathered from anti-sense transcripts.