Question: Differential analysis show different results
gravatar for Biologist
11 weeks ago by
Biologist70 wrote:

I'm dealing with data contains 47 tumor and 5 normal samples. Aim is to find upregulated genes in tumor. Before doing a differential analysis I made a clustering heatmap to check how well samples are clustered.

For clustering:

As I have simple counts (featureCounts) data, I transformed the data into vsd matrix using deseq2.

From vsd_matrix I took top 10% highly variable genes for visualization.

vars <- apply(vsd_matrix, 1, IQR)
set <- vsd_matirx[vars > quantile(vars, 0.9), ]

With this I calculated z-score and plotted the data Clustering heatmap [In the heatmap annotation blue color is normal and brown is tumor]

From the heatmap I see that some of the tumor samples are not clustered well with other. Tumor samples are formed into two clusters.

I removed two normals which show some very bad library sizes for the further analysis.

When I did differential analysis on all those 47 tumor and 3 normal, among the differential expressed genes I see only 4 upregulated in tumor.

But when I did differential analysis (DEA) b/w 3 normal and 35 tumor samples which formed into a cluster, I found apprx 30 upregulated genes.

In the same way I did DEA b/w 3 normal and 12 tumor which formed into another cluster, I found around 60 upregulated.

Why different results with different analysis? Do I need to remove some tumor samples for DEA based on clustering?

Any help is appreciated

ADD COMMENTlink written 11 weeks ago by Biologist70

My big question for you is this: how are you conducting differential expression analysis? From a literal interpretation of your text, one would assume that you directly transformed your raw featureCounts counts via the variance-stabilsation transformation (VST) of DESeq2, which is of course an incorrect procedure, and that you are then possibly conducting differential expression on the VST-derived counts, which again is not correct. Can you clarify?

Other issues include the large imbalance between tumour and normal (sample n), but this should not necessarily negate the conduction of the differential expression analysis.

The other things that you're doing, i.e., filtering your samples and then re-generating p-values: It is perfectly logical that you'll then obtain different p-values by doing this. This happens due to any one or more of so many reasons, including the alteration of the background data distribution, the inadvertent selection of a particular sub-type of cancer, etc. It is not exactly a good procedure to do, by the way, without major justification for filtering the samples in this way.

Finally, depending on your cancer type, it is logical that tumours would not cluster together. Apart from the fact that each tumour cell is different, it is recognised that many cancers are sub-divided into main molecular types. For example, breast cancer is divided by IHC based on ER, PGR, and HER2 status, and has known molecular sub-types, too (luminal A/B, basal, triple-negative, etc.).

Added note: be cautious, in addition, of how you define 'normal' in the context of cancer. If you have 'normal' tissue that was merely extracted from the surrounding tumour, then it is most likely not normal at all and will have a cancer-like profile.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kevin Blighe25k

Hi Kevin,

Thanks for the reply. Only for the part of clustering, I used vsd_matrix. For differential analysis I didn't use VST derived counts. I understand that tumor samples would not cluster together. But When I use all the tutor samples together I see very less upregulated genes. All those are normal tissues without any tumor. What all options should I consider to filter out some samples?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Biologist70

Your data normalisation should be conducted on the entire cohort, i.e., all samples. By, thereafter, including certain categorical variables into your DESeq2 design model, you can then derive p-values for comparisons between different sub-groups.

May I ask what is the ultimate aim of your study? Is the data from the TCGA or is it your own data?

ADD REPLYlink written 11 weeks ago by Kevin Blighe25k

This my own data. I did normalisation on whole cohort only and I'm using edgeR with logFC 1.2 and FDR < 0.05.

Do you think t-test can be used for differential analysis instead of edgeR or deseq2.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Biologist70

If you normalise your data via EdgeR or DESeq2, then you should be using the statistical tests provided by those tools. This is also what the developers of those tools would tell you.

ADD REPLYlink written 11 weeks ago by Kevin Blighe25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1573 users visited in the last hour