Question: Some clarification on enrichment analyses and pathway analyses?
3
gravatar for kirannbishwa01
3.0 years ago by
kirannbishwa011.2k
United States
kirannbishwa011.2k wrote:

I have analyzed my RNAseq data and identified the genes with significant foldchanges and p-values. Next step is to do enrichment analyses.

I have been some extensive reading for last two weeks and have tried some analyses using command based tools, but things are getting confusing for me due to so many different packages and references that are available - but what is missing is a comprehensive and conceptual tutorial on how and why to do things in enrichment analyses?

Just a few questions:

  • Should I only select significant genes for my enrichment analyses, pathway analyses? Why, why not?
  • I have found several tutorials on DESeq/2, but I am not finding any one that gives a clean and comprehensive view on how to further process the data for downstream enrichment and visualization?

  • What is the difference between doing GO enrichment by CC vs. BP vs MF?

  • What is the difference between GO vs KEGG?
  • I am working with non model organism: in that case is it best to do these analyses by matching the geneID/name of my organism to orthlog geneID/name of a model organism? This may or maynot be a good idea because certain pathways between organisms might be different, but what is any proposed solution.

Any ideas please.

ADD COMMENTlink modified 3.0 years ago by Kevin Blighe66k • written 3.0 years ago by kirannbishwa011.2k
12
gravatar for Kevin Blighe
3.0 years ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

Well, gene enrichment (or 'gene-set enrichment analysis'; GSEA) is one of those things on which everyone has their own take, i.e., opinion. I've met people who don't even want to hear anything about it, to those who apparently idolise it. The way that you've carefully written your question tells me that you're in between these two extremes.

The first thing to consider is that gene enrichment is an in silico analysis, but many of the enrichment terms are based on curated datasets. For the Gene Ontology terms, for example, each and every term has an assigned evidence code, which can be taken into account when interpreting a particular enrichment. Take a look at my answer here: A: Go annotation reliability ?

Should I only select significant genes for my enrichment analyses, pathway analyses? Why, why not?

The general idea of gene enrichment is that you have identified a group of genes as being statistically significantly associated to a particular condition and that you want to learn more about the potential functions, processes, pathways et cetera, that may be altered as a result. Thus, it does not make much sense to perform the enrichment on non-significant genes.

Edit: 11th January 2019: some programs can specifically take all genes in your dataset, perform enrichment, and then determine degree/level of enrichment by utilising the p-values and fold-changes. These methods are more powerful, I feel.

I have found several tutorials on DESeq/2, but I am not finding any one that gives a clean and comprehensive view on how to further process the data for downstream enrichment and visualization?

You will never find a 'clean and comprehensive' tutorial - everyone has their own take on it. DESeq2 is excellent at conducting analyses of [primarily] RNA-seq data but it's not a gene enrichment program.

What is the difference between doing GO enrichment by CC vs. BP vs MF?

  • CC, cellular component
  • BP, biological process
  • MF, molecular function

Think of these as sub-classifications. Each of these will contain 1000s of gene enrichment terms that are organised in a hierarchical fashion. Most people will be interested in just BP and MF.

What is the difference between GO vs KEGG? These are different organisations/groups.

  • The Gene Ontology (GO) Consortium is based in the USA and is funded by the NHGRI. The consortium has been in existence for almost 20 years and its aim to is define natural/healthy biological processes, molecular functions, and components (as per the sub-classifications mentioned above). Their gene enrichment categories and terms are either based on in silico or confirmed laboratory evidence (as per the evidence codes that I mentioned above).
  • The Kyoto Encyclopaedia of Genes and Genomes (KEGG) is a consortium based in Japan. It has been in existence slightly longer than GO and is most recognised for the curation of pathways in human and other species. KEGG covers a lot of things other than pathways, though. Also KEGG focuses on both normal/healthy and also disease-related pathways.

NB - it's important to remember that some GO terms relate to pathways too.

I am working with non model organism: in that case is it best to do these analyses by matching the geneID/name of my organism to orthlog geneID/name of a model organism? This may or maynot be a good idea because certain pathways between organisms might be different, but what is any proposed solution.

If you use an enrichment tool like DAVID, your species of interest is most likely included in this and, in addition, with DAVID, you can do enrichment on both GO and KEGG (and other databases) at the same time. On DAVID's main page, go to Functional Annotation and there you'll see a text box where you can input your genes.


My advice to you is to do the enrichment but to be cautious about the interpretation of the results. It is quite easy to 'cherry pick' the enrichment terms that you want to see, i.e., those that fit your hypothesis(es). If you get lucky and everything comes up for which you had hoped, I would still exercise caution. Don't get too excited by gene enrichment.

In terms of filtering enriched terms, if you use DAVID, you can filter enrichment terms based on a Benjamini P value. In terms of displaying gene enrichments, I would recommend simple displays like these:

Captura_de_tela_de_2017_11_12_21_07_46

GSEA

ADD COMMENTlink modified 21 months ago • written 3.0 years ago by Kevin Blighe66k
3

I think Kevin is giving you great advice here!

I would add that the underlying assumption in any pathway enrichment analysis is that the genes in the pathway are assumed to be independent variables. In other words, enrichment analysis essentially only "counts" the number of DEGs on a any given pathway.

Depending on your experimental design, this may be an appropriate approach, but in my experience, people are usually looking for a more comprehensive analysis approach but are unaware of one.

You might consider using SPIA (Signaling Pathway Impact Analysis) if you want to use an R/bioconductor-based approach or possible RontoTools. These approaches use a topolgy-based approach which looks at each DEGs' role, position, and interaction to identify perturbed pathways rather than simply enriched.

If you want to use a web-based version of these tools (without any command-line use) you can try it for free in iPathwayGuide from Advaita Bioinformatics.

ADD REPLYlink written 3.0 years ago by andrew510

I would add that the underlying assumption in any pathway enrichment analysis is that the genes in the pathway are assumed to be independent variables. In other words, enrichment analysis essentially only "counts" the number of DEGs on a any given pathway.

That is a very good point, andrew

ADD REPLYlink written 3.0 years ago by Kevin Blighe66k
1

Thank you so much Kevin for highlighting your points in such a comprehensive manner. It has cleared my doubts a lot. If I may have more questions, I will let you know.

ADD REPLYlink written 2.9 years ago by kirannbishwa011.2k

Hi Kevin, Can you tell me what tool was used to generate these plots and link to this plot, so I can check some description.

Thanks,

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by kirannbishwa011.2k
1

Hi friend, the top plot is just an Enhanced OncoPrint, made using the complexheatmap package. The plot on the left is mutation data, whilst on the right it's gene enrichment based on the genes in which the mutations are found.

The bottom plot is my own but based on the functions provided by complexheatmap. It's essentially the same enrichment plot on the top-right, but I've added a lot of annotation and have split the heatmap based on up- and down-regulated genes.

I would encourage you to devote a single day to learning complexheatmap, as you will never then go back to using the other heatmap functions. It is highly flexible and the possibilities are endless. If you run into difficulty, post a question here and I should pick it up.

ADD REPLYlink written 2.9 years ago by Kevin Blighe66k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1229 users visited in the last hour