Question

RNA-SEQ: Lowering fold change cutoff from 2 to 1.5.

0

Entering edit mode

6.2 years ago

Muha0216 • 0

does lowering fold change values from 2 to 1.5, which will allow for more DEGs necessarily mean higher chance of getting enriched (Q values of <0.05) GO/PATHWAY terms? or it entirely depends still on whether there's overrepresentation of DEGs under a term?

RNA-Seq • 3.2k views

ADD COMMENT • link updated 6.2 years ago by Carlo Yague 8.6k • written 6.2 years ago by Muha0216 • 0

1

Entering edit mode

Apparently with lower fold change cutoff you get (as a rule) more DE genes, and with more DE genes there is higher chance to enrich for something (IMHO).

ADD REPLY • link 6.2 years ago by grant.hovhannisyan ★ 2.6k

2

Entering edit mode

Yes, as per Grant, more genes equates to more enrichment. One could technically enrich all protein-coding genes, though, but the result would be meaningless. Your cut-offs have to be a fine balance between selecting genes that are differentially expressed and leaving enough room such that the enrichment algorithms can function adequately.

Just one piece of advice though: in no way should you base your study's conclusions on an in silico enrichment. RNA-seq is a rich resource and there are lots of things that you can do. Just doing enrichment does not do the data-type justice.

Things that you could try:

comprehensive literature search of the most differentially expressed genes (DEGs)
clustering and heatmaps using DEGs showing how they can segregate groups
develop predictive signatures (regression modelling) using your DEGs
correlate some clinical parameters to your DEGs and see which are most statistically significant

et cetera

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

1

Entering edit mode

I wouldn't say "more genes equates to more enrichment", but rather there is a chance that some of newly popped-up DE genes might be in the same pathway :)

ADD REPLY • link 6.2 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

Well, my phrase was so general that it could be interpreted in any shape or form. I meant 'more genes equates to more enrichment terms'

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi kevin your input above is valuable. Can you care to elaborate because im not sure if i understood your suggestions well enough.

comprehensive literature search of the most differentially expressed genes (DEGs)-->

in my context im trying to uncover novel as well as known mechanosensitive genes. Do you mean that i go through maybe papers on work done on mechanical stress and sieve out genes that are typically(commonly) differentially expressed across mechanical stress studies? Then state which of my genes are novel and which are already well known mechanosensitive.

clustering and heatmaps using DEGs showing how they can segregate groups-->

im still learning how to interpret heatmaps but sometimes i really feel its redundant. Yes it helps to cluster genes that are upregulated and down regulated but on the outside if i were to put it in a powerpoint slide or research paper, the reader cant see which of the genes are involved? Its just a chunk of red and green. And also i only have two treatment groups. Cells treated with low pressure and cells treated with high pressure apart from the controls.

develop predictive signatures (regression modelling) using your DEGs-->

what is meant by this?

Correlate some clinical parameters to your DEGs and see which are most statistically significant-->

also not clear what you are suggesting.

Hope you can further elaborate and many thanks on giving me ideas on interpreting my data.

ADD REPLY • link updated 6.1 years ago by Kevin Blighe 87k • written 6.1 years ago by Muha0216 • 0

2

Entering edit mode

My dear friend, to answer these questions properly, I will first ask you an obvious question: what was the purpose of your study? There must have been a hypothesis or idea such that you wanted to perform RNA-sequencing on your samples. With neither a hypothesis nor leadership, a study will of course struggle to progress.

comprehensive literature search of the most differentially expressed genes (DEGs)

Yes, I just mean to look at your most differentially expressed genes and to see what has already been reported on them. Spend a full day doing this and you will get new leads and ideas. For example, if I found MS4A1 (CD20) as being highly differentially expressed in my RNA-seq study of an immune condition, I would go to Google and search for:

ncbi ms4a1 immunity

I will then find tonnes of hits because CD20 is a B-cell marker.

If I was studying an eye condition and found numerous RP genes as differentially expressed, I'd search for:

ncbi rp1 rp20 rp33 eye retina

There would be further hits because RP genes have been shown to cause different types of retinitis pigmentosa.

A word of advice: don't use the search bar in PubMed for literature searching. It's sub-standard compared to Google.

--------

clustering and heatmaps using DEGs showing how they can segregate groups

Well, you're probably the first person that I have ever met who does not appear to like heatmaps - kudos to you. You're correct in that they don't show too much, but usually people want to see how well their genes of interest can segregate cases from controls, which is played out in the dendrogram, mostly, but also the heatmap.

Also, one of the 'greatest' heatmaps ever was by Charles Perou, a breast cancer pathologist, I believe, who identified gene expression signatures in breast cancer tumours and thus identified the 4 different primary breast cancer sub-types that we now know today.

You should take a look at my various postings on heatmaps:

...and my recent publication where I identified novel clusterings in metabolomics:

Vitamin D prenatal programming of childhood metabolomics profiles at age 3 y

----------

develop predictive signatures (regression modelling) using your DEGs

Again, depending on the nature of your study, one may have the intent to identify a gene signature that can define a particular condition. The DEGs that you identify, even after best efforts of normalisation and FDR threshsolding, still likely comprise a large chunk of genes that provide a minimal amount of information in terms of defining the condition in which they are found as highly or lowly expressed.

Typically, one identifies a group of DEGs and then puts these to the test via regression modelling, where the endpoint may be disease status or disease classification (like tumour stage in cancer). Regression modelling can then be fed into ROC analysis where one can derive test statistics such as sensitivity and specificity, i.e., in the end, one could arrive at a gene panel that has sensitivity of 90% via ROC analysis in identifying Alzheimer's patients from blood expression data.

Again, I have posted resources on this on Biostars:

-------------------

Correlate some clinical parameters to your DEGs and see which are most statistically significant-

For many diseases, current diagnostic and prognostic criteria are based on laboratory based assessments, or even things like family history. For example, the PSA test for prostate cancer measures an antigen in the urine; many immune disorders are measured through immunohistochemistry (IHC) of cell markers, etc. It can be useful to correlate our expression data to these clinical markers in order to see which genes may be related to certain parameters and,therefore, which genes the expression of which could be used as surrogate markers of these parameters.

Again, another posting:

CorLevelPlot - Visualise correlation results, e.g., clinical parameter correlations

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

My objective of doing this rna seq study is to find novel and known mechanosensitive genes upon compressing cancer cells. I also want to study how cancer behavior might be affected and using rna seq global analysis can guide me on this.

Thanks kevin! I will consider your points. Will probably lock myself in the room doing exploration of my rna seq data and find the best direction to take from here on.

ADD REPLY • link 6.1 years ago by Muha0216 • 0

1

Entering edit mode

Okay, in that case, I presume that you can measure the level of applied compression, which would then be a 'clinical' parameter that you could use, of course.

Just quickly back to gene enrichment: terms that come up in gene enrichment should merely guide you as you then conduct literature searches (i.e. just include the GO term in Google as you search).

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi kevin your input above is valuable. Can you care to elaborate because im not sure if i understood your suggestions well enough.

Things that you could try:

comprehensive literature search of the most differentially expressed genes (DEGs)--> in my context im trying to uncover novel as well as known mechanosensitive genes. Do you mean that i go through maybe papers on work done on mechanical stress and sieve out genes that are typically(commonly) differentially expressed across mechanical stress studies? Then state which of my genes are novel and which are already well known mechanosensitive.

clustering and heatmaps using DEGs showing how they can segregate groups--> im still learning how to interpret heatmaps but sometimes i really feel its redundant. Yes it helps to cluster genes that are upregulated and down regulated but on the outside if i were to put it in a powerpoint slide or research paper, the reader cant see which of the genes are involved? Its just a chunk of red and green. And also i only have two treatment groups. Cells treated with low pressure and cells treated with high pressure apart from the controls.

develop predictive signatures (regression modelling) using your DEGs--> what is meant by this?

correlate some clinical parameters to your DEGs and see which are most statistically significant--> also not clear what you are suggesting.

Hope you can further elaborate and many thanks on giving me ideas on interpreting my data.

ADD REPLY • link 6.1 years ago by Muha0216 • 0

score 5 · Answer 1 · 2018-02-24

5

Entering edit mode

6.2 years ago

Carlo Yague 8.6k

more genes equates to more enrichment.

This is only true if the proportion of GO terms in DE genes is the same when the fold change cutoff decreases. We can simulate that with some R code using Fisher's exact test for enrichment:

build_mat=function(){
  return(matrix(c(DE_genes_in_pathway, DE_genes-DE_genes_in_pathway, Pathway_genes-DE_genes_in_pathway, Tot_genes-DE_genes-Pathway_genes+DE_genes_in_pathway ),
                nrow = 2,
                dimnames = list(DE = c("Y", "N"),pathway = c("Y", "N"))))
}

# basal case with 200 DEGs

Tot_genes=20000
DE_genes=200
Pathway_genes=100
DE_genes_in_pathway=5

fisher.test(build_mat()) #p-value = 0.003324

# with twice more DEGs and the enrichment remains the same

DE_genes=400
DE_genes_in_pathway=10

fisher.test(build_mat()) #p-value = 3.196e-05

# with twice more DEGs but there is no more genes in pathway in the 200 additional DEGs

DE_genes=400
DE_genes_in_pathway=5

fisher.test(build_mat()) #p-value = 0.05038

# extreme case when all genes are DEGs.

DE_genes=20000
DE_genes_in_pathway=100

fisher.test(build_mat()) #p-value = 1

ADD COMMENT • link 6.1 years ago by Carlo Yague 8.6k

1

Entering edit mode

Thanks —neat piece of code— confirms why I will never base any clinical decision on gene enrichment (I know people who do), and why I'll continue to be overly cautious about making conclusions based on enrichment in a research setting.

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

0

Entering edit mode

thank you, great illustration!

ADD REPLY • link 6.2 years ago by grant.hovhannisyan ★ 2.6k

0

Entering edit mode

The code is good but makes the assumption that no new GO terms are added with increased number of DEGs. In practice, this scenario is unlikely based on the many hundreds of thousands of enrichment terms that exist. My experience tells me that lowering thresholds and incorporating more DEGs will almost always introduce a greater number of enriched terms, many of which are meaningless and could result in false-interpretation.

As per my comment (above), enrichment should not even be the main focus of the user's RNA-seq analysis.

ADD REPLY • link 6.2 years ago by Kevin Blighe 87k

1

Entering edit mode

What do you mean with "no new GO terms are added with increased number of DEGs" ? My code test enrichment for only one GO term. In practice, the tests are usually applied on all (or a subset of) the annotated GO terms, independently of the number of DEGs.

In the hypothetical case where the thresholds are so low that all genes are DEGs (I edited my code above), 0 GO term can be enriched, so lowering thresholds do not always results in more enriched terms.

I agree with the last line of your comment.

ADD REPLY • link 6.1 years ago by Carlo Yague 8.6k

1

Entering edit mode

Yes, I think that we were on different trains of thought - your code example is very good and is indeed making the point for a single enrichment term / pathway. I have been even more interested in your code because I recently had an in depth conversation with a colleague about gene enrichment and how the number of genes can affect it.

When I said that "more genes equates to more enrichment", what I meant was that the inclusion of a greater number of DEGs would result in a greater number of enriched terms due to new genes matching new enrichment terms.

Have a nice Saturday night

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k