Question

Comparing single end and paired end TPM data

0

Entering edit mode

3.9 years ago

rzm0015 • 0

I have several different databases of RNA-seq data, in a mix of TPM, RPKM, and raw counts. I plan to convert all to TPM and then to use Voom in Limma to normalize the samples and identify DE genes. The problem is that one of the databases is paired end while the rest are single end. Is it possible to make comparisons between the different chemistries? One thing that worries me is that the PE database is from a different tissue than the rest, so I would expect the PE samples to cluster separately regardless of sequencing chemistry.

I found this post differential expression analysis-- paired-end and single-end which has given me some hope (my PE counts data is also from HTSeq). It is my understanding that Voom will remove the batch effects of the SE vs PE chemistry. Am I correct in my thinking that I can just throw everything Voom after calculating normalization factors and it will be ok? Or am I trying to do something impossible?

Any advice is appreciated, Thanks

Edit: i should have mentioned, my goal is to identify potential therapeutic targets. i only need to find the largest expression differences between tissues. If i miss or misassign fine-scale variation it is of no concern.

RNA-Seq • 2.0k views

ADD COMMENT • link 3.9 years ago by rzm0015 • 0

score 1 · Answer 1 · 2020-05-17

1

Entering edit mode

3.9 years ago

ATpoint 82k

The problem is that one of the databases is paired end while the rest are single end.

That is by far the least of your problems.

You cannot simply merge a collection of studies. This would be even problematic if you had raw counts for all of them due to the unavoidable batch effects between studies that will dominate or mask the true biological results. Mixing different types of counts (TPM, RPKM, raw) would make it even worse and normalizing already normalized counts then adds additional bias to your results.

Am I correct in my thinking that I can just throw everything Voom after calculating normalization factors and it will be ok?

No. Analyze each study independently and then consider comparing the results e.g. with a meta-analysis.

Is merging everything what you did here? Too many differentially expressed genes in voom

ADD COMMENT • link 3.9 years ago by ATpoint 82k

0

Entering edit mode

Thank you for the response,

I am attempting to replicate what was done in this paper

Where they state:

For the neuroblastoma tumor and normal tissue gene level differential expression analysis, we utilized RNA sequencing data from the 126 high-risk neuroblastoma tumors (TARGET) and normal tissues (GTEx) as described above and downloaded gene level data that was previously processed by the UCSC Computational Genomics laboratory using STAR alignment and RSEM normalization using hg38 as the reference genome and GENCODE v23 gene annotation (Vivian et al., 2016). The voom procedure was used to normalize the RSEM generated expected counts followed by differential expression testing using the R package limma to obtain p values and Log-fold changes (LogFCs) (Law et al., 2014, Ritchie et al., 2015).

In the above paper the normal tissues and the neuroblastoma samples come from different databases. This is essentially what I did (or tried to do) in the older post of mine you linked (both my datasets were PE and raw counts). I guess I do not understand how what I did differs from what was described in the above paper.

My plan then was to expand what I did with more datasets, but I guess I need to put that on hold if it is invalid!

ADD REPLY • link 3.9 years ago by rzm0015 • 0

0

Entering edit mode

I cannot comment on this paper since I am not familiar with it and will not invest the time to really get deeply into the method sections. I gave my opinion above on what I consider the typical obstacle when combining independent datasets, especially if the tumor and normals are fully confounded which means that all tumors come from one database and the all normals come from one or many other databases.

Still, the methods say

Specifically, a total of 60,498 genes were tested for differential expression between the neuroblastoma tumors and normal tissues using this RNA sequencing data, a total of 1,889,388 (31 normal tissues x 60,498) computations. Only genes which were differentially expressed in all 31 normal tissue comparisons were considered for subsequent interrogation.

I read this a way that they compared the tumors independently to every of the tissues and not just pooled them, but this is speculation. If that was true and if they took then the strict intersections (so genes always differential) then then found some more or less reliable genes, but this is speculation. Again, I do not judge any of what the authors did. In your case I would be careful with blindly merging datasets. Results are probably not reliable, but this is just my opinion.

ADD REPLY • link 3.9 years ago by ATpoint 82k

0

Entering edit mode

I agree with ATpoint's assessment, especially as they plot minimum logFC (so the minimum of the logFCs obtained among the 31 comparisons) and the maximum adjusted p-value.

Looking at the paper, they did a rough analysis and they probably lost statistical power to detect more candidates but that was not the goal of the paper. They wanted to prioritize a gene that could be a promising novel therapeutic target and they succeeded in doing so. Their goal was not create a super reliable candidate genes list, report accurate gene expression differences, find rigorous+reproducible patterns in the data, or derive a robust gene signature.

Technical confounders are most definitely present but it would have definitely been unlucky (and pretty unlikely) for the one gene they thought was most promising to be entirely due to a technical artifact.

My opinion: When you work with large human cancer datasets, expect to have to deal with technical confounders and, unfortunately, there's really no good way to get around that. You're going to have missing control samples, samples prepared+sequenced via different protocols, etc. You can still make very interesting discoveries (and please do -- we need better cancer therapies!) but just be aware of all the things that might confound your analysis and think of some things you could do to give you more confidence in your findings. From my experience (as someone who worked in an oncology lab for many years), the difference in gene expression profiles between cancer vs. normal tissue are huge so you can usually find pretty strong effects when doing such analyses (compared to say, comparing two different classes of closely-related hepatocyte tissue).

ADD REPLY • link 3.9 years ago by dsull ★ 5.8k

0

Entering edit mode

Thanks. I should have specified in my post that my goal is also to identify therapeutic targets, but for a different maligniemcy. I think ill have to role with your advice and do what i can increase robustness.

ADD REPLY • link 3.9 years ago by rzm0015 • 0