I have several different databases of RNA-seq data, in a mix of TPM, RPKM, and raw counts. I plan to convert all to TPM and then to use Voom in Limma to normalize the samples and identify DE genes. The problem is that one of the databases is paired end while the rest are single end. Is it possible to make comparisons between the different chemistries? One thing that worries me is that the PE database is from a different tissue than the rest, so I would expect the PE samples to cluster separately regardless of sequencing chemistry.
I found this post differential expression analysis-- paired-end and single-end which has given me some hope (my PE counts data is also from HTSeq). It is my understanding that Voom will remove the batch effects of the SE vs PE chemistry. Am I correct in my thinking that I can just throw everything Voom after calculating normalization factors and it will be ok? Or am I trying to do something impossible?
Any advice is appreciated, Thanks
Edit: i should have mentioned, my goal is to identify potential therapeutic targets. i only need to find the largest expression differences between tissues. If i miss or misassign fine-scale variation it is of no concern.
Thank you for the response,
I am attempting to replicate what was done in this paper
Where they state:
In the above paper the normal tissues and the neuroblastoma samples come from different databases. This is essentially what I did (or tried to do) in the older post of mine you linked (both my datasets were PE and raw counts). I guess I do not understand how what I did differs from what was described in the above paper.
My plan then was to expand what I did with more datasets, but I guess I need to put that on hold if it is invalid!
I cannot comment on this paper since I am not familiar with it and will not invest the time to really get deeply into the method sections. I gave my opinion above on what I consider the typical obstacle when combining independent datasets, especially if the tumor and normals are fully confounded which means that all tumors come from one database and the all normals come from one or many other databases.
Still, the methods say
I read this a way that they compared the tumors independently to every of the tissues and not just pooled them, but this is speculation. If that was true and if they took then the strict intersections (so genes always differential) then then found some more or less reliable genes, but this is speculation. Again, I do not judge any of what the authors did. In your case I would be careful with blindly merging datasets. Results are probably not reliable, but this is just my opinion.
I agree with ATpoint's assessment, especially as they plot minimum logFC (so the minimum of the logFCs obtained among the 31 comparisons) and the maximum adjusted p-value.
Looking at the paper, they did a rough analysis and they probably lost statistical power to detect more candidates but that was not the goal of the paper. They wanted to prioritize a gene that could be a promising novel therapeutic target and they succeeded in doing so. Their goal was not create a super reliable candidate genes list, report accurate gene expression differences, find rigorous+reproducible patterns in the data, or derive a robust gene signature.
Technical confounders are most definitely present but it would have definitely been unlucky (and pretty unlikely) for the one gene they thought was most promising to be entirely due to a technical artifact.
My opinion: When you work with large human cancer datasets, expect to have to deal with technical confounders and, unfortunately, there's really no good way to get around that. You're going to have missing control samples, samples prepared+sequenced via different protocols, etc. You can still make very interesting discoveries (and please do -- we need better cancer therapies!) but just be aware of all the things that might confound your analysis and think of some things you could do to give you more confidence in your findings. From my experience (as someone who worked in an oncology lab for many years), the difference in gene expression profiles between cancer vs. normal tissue are huge so you can usually find pretty strong effects when doing such analyses (compared to say, comparing two different classes of closely-related hepatocyte tissue).
Thanks. I should have specified in my post that my goal is also to identify therapeutic targets, but for a different maligniemcy. I think ill have to role with your advice and do what i can increase robustness.