Question: PCA on sample GO terms
1
gravatar for gilagalad
4.7 years ago by
gilagalad10
United Kingdom
gilagalad10 wrote:

Hi,

I would like to cluster/make PCA among microarray samples accross two different platforms.I am afraid that clustering on the common genes between the platforms would be influenced more by the platform (different probes measuring different sequences of the transcripts and on different scale) then the treatment effect. As there is generally better consistency of the upregulated processes (enriched GO terms, pathways) I would like to cluster based on GO terms.

Suppose cells treated with compound A, B, C or D (each done in several replicates). Compare them to untreated control and that yields lists of differentially regulated genes. Determine GO terms (say for upregulated genes) GO.A, GO.B, GO.C and GO.D. This would be measured on platform 1. Then I would have cells treated with compound E, compared them to untreated control etc. to get GO.E. This experiment would be on platform 2. I would like to know, how similar is the effect of treatment E to A, B, C and D.

One solution that comes to my mind is first find common GO terms that are present on both platforms. Then compute GO.A, GO.B, GO.C, GO.D and GO.E. The GO terms not significantly changed (upregulated) would get p value 1. So I would have p values for all of the common GO terms. Then I would do for example PCA on the p values (I think they should be scaled first) and see the distance among the samples.

Does this make sense? Is there a better way?

Any suggestions appreciated!

Vojta

ADD COMMENTlink modified 2.7 years ago by igor11k • written 4.7 years ago by gilagalad10
1

It's an interesting approach. However I think variables used for PCA should be in principle independent from each other. GO terms on the other hand are structured as a tree, and I am not sure if this would break the principle of independency.

ADD REPLYlink written 4.7 years ago by Giovanni M Dall'Olio27k
1

Yes to me that is one of the concern if it is breaking the independency factor but then again is it viable to see 1x1 DEGs and then see the GO, if it is cross platform then ideal would be cross platform normalization and then find DEGs for the 4x4 samples to give a more statistically viable DEGs on which GO can be performed and then represented semantically.

ADD REPLYlink written 4.7 years ago by ivivek_ngs5.0k

thank you for your insight. do you think using enriched pathways instead of GO would amend this? or do you have in mind other way how to compare samples based on GO where the tree structure would not be problem?

ADD REPLYlink written 4.7 years ago by gilagalad10

It all depends on what you want to categorize as pathways. In GO enrichment the Biological Process is also closely associated to specific pathways or even Molecular Function is translated into pathways. So in a way you are trying to see how enriched are your genes for specific molecular functions (MF) or biological process (BP) and if some pathways which stands for your hypothesis are enriched from any of the categories in BP or MF then bingo that will help you to restrict your gene list. Usually when I refer to pathway I try to see pathways in KEGG or Ingenuity or Reactome. But they are more like downstream biological answers that corresponds to specific design. I guess you are looking for a preliminary approach that will help you so actually proceed with GO terms and either do a PCA on them or a correlation plot to see which are the terms that are closely associated. However am if you are looking for PCA should not it be done on the enrichment scores rather than pvalues? So you can select the significant GOs with pvalues along with their enrichment scores and then make a common venn diagram to see how all the enrichment scores behave across all the samples for the common GO and then either make a heatmap or PCA or correlation plot to make an understand how each samples are distanced.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by ivivek_ngs5.0k

Some link could be informative :

  1. Methods For Comparing Microarrays From Different Datasets
  2. Preprocessing Of Microarray Datasets Derived From Different Platforms
  3. What software is best for cross-platform microarray results comparison?
  4. Meta-analyses of data from two (or more) microarray data sets.
  5. http://barcwiki.wi.mit.edu/wiki/SOPs/normalizePublic
ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Tanvir Ahamed 290

thanks for the links, however they focus mainly on reducing of the datasets to common and expressed genes. this still retains some bias, I read one should verify that the probes target the same transcript region. however, generally the processess upregulated/downregulated on different platforms correspond more than the sole genes Li et al 2009

I may do the GO analysis anyway and compare that manually (see side by side which lists are similar), but I thought there would be some better approach :)

ADD REPLYlink written 4.7 years ago by gilagalad10
2
gravatar for Philipp Bayer
4.7 years ago by
Philipp Bayer6.8k
Australia/Perth/UWA
Philipp Bayer6.8k wrote:

I've done something similar with csbl.go (installation works with R 3.2.3, site is a bit outdated)

It groups all genes by GO-terms and by gene expression, here's an example picture I just made with 100 of my genes and 5 conditions:

Heatmap

The object that csbl.go makes can then be interrogated to check which genes are in which GO-group. Does that help with your question?

ADD COMMENTlink written 4.7 years ago by Philipp Bayer6.8k

Philipp, thank your for the suggestion. as I understand it, one needs GO annotated genes that are same in all of the samples. so it does clustering of the samples that is still "gene dependent" on GO level. would it be possible to extend that for multiple platforms so it would be "gene independent" but "GO dependant"?

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by gilagalad10

Aaah I see - so you could have sample (organism) A with gene A and gene B, but sample (organism) B has gene C and gene D, so you want to cluster "purely" by shared GO-terms.

Hmm you could fake an expression level for gene A and gene B for sample B by setting it to 0, and setting it to 0 for gene C and gene D in sample A, but I'm not sure whether that would work out, you may break some key assumptions.

In that case, my suggestion is likely to not work, sorry about that.

ADD REPLYlink written 4.7 years ago by Philipp Bayer6.8k
2
gravatar for ivivek_ngs
4.7 years ago by
ivivek_ngs5.0k
Seattle,WA, USA
ivivek_ngs5.0k wrote:

First of all if you have 2 conditions, treated and untreated and in each condition I believe you have 4 replicates then you should run differential expression analysis on this group 4x4 to find list of DEGs and then do GO term enrichment, that will be statistically viable, else 1x1 which you are doing what I understand from your query and then doing GO enrichment to me is not statistically viable. Since you are afraid that they are microarray from 2 different platforms so it might have a batch effect. So you can check this paper. Or you can also use RankProd to see how to find DEGs in microarray coming from different platforms to normalize cross-patform errors or biasnes or take a look at this thread and then finally use all samples 4x4 to find DEGs and then do GO enrichment and if too many GO terms are there you can do as below.

Have you tried to check ReViGO . It does not do PCA but yes it tries to see over semantic space how your GO terms are over-represented . There are different forms of representation there. You can check if it might be interesting. The input is GO terms and pvalues or qvalues.

ADD COMMENTlink written 4.7 years ago by ivivek_ngs5.0k

thanks for your reply. I added into the question decription that there are 4 replicates per treatment, so finding DEGs and GOs for each treatment would be viable. As I understand it, RankProd would be useful for meta-analysis of DEGs, but this is not what I want. I could reduce the dataset so there will be "common genes" shared by both platforms based on gene name. Then do the RankProd on conditions DEG.A, DEG.B... DEG.C and then do dimension reduction (PCA) of the samples based on the gene ranks. I may try it, but I don't like reducing of the gene list to "common genes", which still leaves some bias. REVIGO seems promising for analysis of a single dataset like GO term filtering, but I don't see extension to compare different samples.

ADD REPLYlink written 4.7 years ago by gilagalad10

Yes ideally RankProd was done for meta-analysis where one wanted to apply it on microarrays performed in different labs, in that case you can put your samples since they are microarray from different platforms so it is a kind of meta values but the power of the tools might not be sufficient since you have very small number of replicates. I believe rather than gene list you should be considering array probes, since gene lists are skewed and more than 1 probe may be associated to a single single. GO can also be performed on Microarray probes , you might have to take a look at the tools and the kind of input they take it. I am just concerned about how powerful the statistical method will be if you compare 1 sample against the other coming from 2 different platforms. Usually it might not, that is the reason we have tools that are taking into considerations cross platforms normalizations.

ADD REPLYlink written 4.7 years ago by ivivek_ngs5.0k
1
gravatar for igor
2.7 years ago by
igor11k
United States
igor11k wrote:

I think GO-PCA may be a good answer here: https://gopca.readthedocs.io/en/latest/intro.html

GO-PCA is an unsupervised method to explore gene expression data using prior knowledge. Briefly, GO-PCA combines principal component analysis (PCA) with nonparametric GO enrichment analysis in order to define signatures, i.e., small sets of genes that are both strongly correlated and closely functionally related.

The expression profiles of all signatures generated can be conveniently visualized as a heat map. This visualization, referred to as the signature matrix, aims to provide a systematic and easily interpretable view of biologically relevant expression patterns in the data. Together with other GO-PCA visualizations, it can serve as a powerful starting point for exploratory data analysis and hypothesis generation.

ADD COMMENTlink written 2.7 years ago by igor11k

This seems to be so cool! Thanks a lot for the link

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by ivivek_ngs5.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2233 users visited in the last hour