Question: Possible approach to select normal tissue samples for cancer RNA-Seq data without reference data for downstream analyses
gravatar for svlachavas
7 months ago by
svlachavas680 wrote:

Dear Community,

based on a clinical project of high-throughput genomics data, we have gathered a high number of RNA-Seq samples from patients with different solid tumors, that have undergone conventional therapy prior sequencing. All the data have been uniformly processed through R. The major issue that we would like to perform differential expression analysis or machine learning techniques, to select the most DE or more informative genes based on some reference sample group, but unfortunately we do not have any reference normal or control samples for the whole cohort.

I thought a naive idea of using external normal data sources, such as GTEx-however, my main concern is that still batch effect correction might not be applicable, such as ComBat, because both batch studies are totally confounded ? (i.e. both sample types are not represented in both studies..)

Any ideas or suggestions how this issue might be addressed ?



batch effect gtex rna-seq de R • 182 views
ADD COMMENTlink written 7 months ago by svlachavas680

If you dd not obtain any normal tissues and processed with the same kits as the tumor samples and any differences you will see will most likely be caused by technical batch effects. Comparing your data with any downloaded data in the same statistical analysis is pointless. This is (sorry to say) something you should have thought about before gathering the tumor samples. Only change I see would be to gather normal samples now, process them identically in the wetlab, plus some additional tumor samples to correct for batch differences and then run the analysis. If this is not possible you are limited to comparisons within your cohort, e.g. splitting samples into like high/low based on expression of important genes.

ADD REPLYlink written 7 months ago by ATpoint40k

Dear ATpoint,

thank you for your strong point- to be honest, i did not participate in any prior experimental design of the project, and I was implicated after the creation of the data. Unfortunately, these are some older data, that's why as I very recently got into the analysis and any relative information, I also saw the bottleneck of the absence of the normal samples. In addition, in your opinion, based on putative limited implementations, you would think also a co-expression network would help ? for the identification of "important genes" ? or to rank the genes based on any measure ?

ADD REPLYlink written 7 months ago by svlachavas680

That fully depends on the question you want to answer. I just wanted to point out that you should not include independent datasets into the same analysis as batch effects will dominate the results.

ADD REPLYlink written 7 months ago by ATpoint40k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 859 users visited in the last hour