Question

DESeq2- high number of dysregulated genes in differential gene expression analysis

0

Entering edit mode

3.6 years ago

ginny • 0

Hi. I am a beginner analysing cancer vs normal tissue samples to find the differentially-expressed genes using DESeq2. However, there seems to be one common trend in the results- almost 30-40% genes are being shown as dysregulated. I keep getting 6000-8000 genes up/downregulated (even 10,000 in one case) with padj less than 0.05 and lfc>1, which are too many for my downstream analyses.

My question is- is this variability normal, or am I doing something wrong? I'm additionally also filtering the genes according to the following code, but I still find a huge number of genes in the results:

dds<-DESeqDataSetFromMatrix(countData = data, colData = cols, design = ~condition, tidy = TRUE)

dds<-estimateSizeFactors(dds)
nc<-counts(dds, normalized=TRUE)
filter<-rowSums(nc>=10) >=3
dds<-dds[filter,]
dds<-DESeq(dds)

How do I extract biologically significant data from these results, so as to shorten the DEG list and filter out irrelevant genes? Is there some additional step in DESeq2 that I'm missing?

Additionally, I'm finding a lot genes that have a very low base mean (around 0.8 to 2). Are these genes okay to be included? I was wondering if there is any cutoff for base mean that has to be provided, since other genes' base means are over 1000s?

Secondly, the experimental samples I'm comparing are unbalanced, i.e. unequal in number (100 normal samples, 40 cancer category I, 120 cancer category II,... so on). Do you think this has an effect on the DEGs that I'm getting? I know I should preferably keep equal samples but my study demands it. Should I cut down my samples to make them equal?

I apologize for my lack of knowledge, I'm very new to this. Thanks.

RNA-Seq R DESeq2 DifferentialGeneExpression TCGA • 1.4k views

ADD COMMENT • link 3.6 years ago by ginny • 0

1

Entering edit mode

Large number of DEGs are not uncommon in cancer vs normal. I cannot comment further due to the lack of details.

As of filtering genes, you can use the lfcThreshold argument in results() to explicitely against a certain fold change rather than zero. This will ensure that significant genes at least have this fold change. One could set it to e.g. log2(1.4) which then eventually would lead to a fold change of about 1.5 as minimum (one always sets the threshold slightly below the actual cutoff one wants, here say we want 1.5 so we use 1.4). This will remove significant genes with low effect sizes. You can set it even higher if you feel like you are getting too many genes. The unbalanced number of samples should be no problem, eventually they are used to estimate a dispersion, the more the better since then it is more accurate. Are the cancer and normals from the same study (same kits used etc) so you can exclude a strong batch effect?

As for downstream analysis there are multiple options. Perform pathway enrichment analysis, e.g. with gprofiler2 to see whether interesting terms pop up beyond the normal cancer stuff (maybe signaling pathways), depends on the question you want to answer. You can cluster the DEGs and then see whether interesting pattern come up. As said, depends on the goal.

ADD REPLY • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

Thank you so much for responding! I will try using the threshold and see if I get better results. As for the batch effects, I'm not too sure because I have TCGA data, and I can't seem to find the batches that identify each sample. I'm still looking into it, and would want to incorporate it into my design if it affects my results so much. Do you have any idea where can I find the batches for TCGA? Again, thanks a lot for your time.

ADD REPLY • link 3.6 years ago by ginny • 0

0

Entering edit mode

No clue, never used TCGA. You can run PCA as described in the DESeq2 manual to explore how your data cluster, that should imho be a standard QC for every dataset you get.

ADD REPLY • link 3.6 years ago by ATpoint 82k

0

Entering edit mode

This is the PCA plot that I've got. I have five sample groups, and I know that four of them are clustering together, but I am still supposed to find DEGs between them! :/ Do you think I have strong batch effects that need to be accounted for? Thanks.

enter image description here

https://ibb.co/bsdvD21 PCA

ADD REPLY • link 3.5 years ago by ginny • 0

0

Entering edit mode

Also want to include this- one of my seniors from other lab performed DEG analysis on the same datset using limma voom. The number of genes identified from limma voom are fairly lesser, but 70-80% are common from what I obtained from DESeq2, however the fold change differs. Do you think I should proceed with the genes that are common with both the methods? Or choose one of the two? Oh, and I have my data as HT-Seq counts matrix. I'm just so confused and don't know who to turn to. Any kind of help is appreciated! Thanks.

ADD REPLY • link 3.6 years ago by ginny • 0