Question

Validity Of Gene List Comparisons For Lists Originating From Different Platforms?

3

Entering edit mode

12.2 years ago

Adamc ▴ 680

I'm trying to figure out the significance of the result of the intersection (venn) between two gene lists. While I often do this with lists from differential expression analyses originating from the same microarray platform, recently I've been using lists from different platforms (and also species). Applying the hypergeometric method for calculating the probability of an observed overlap makes sense to me when the intersecting lists originate from the same platform, but can it also be used when the source platforms are different?

Similar questions have been asked before- such as http://biostar.stackexchange.com/questions/15594/probability-of-gene-list-overlap and this one at crossvalidated but I didn't feel like they fully addressed this. Then again, I'm still having issues getting into the way that statisticians think.

I didn't even think previously that there would be possible issues with significance of list overlap if the lists had already been statistically filtered using reasonable cutoffs, so now I'd like to modify a web service we have to generate list intersections to also calculate these statistics.

microarray statistics • 3.1k views

ADD COMMENT • link updated 12.1 years ago by Hanif Khalak ★ 1.3k • written 12.2 years ago by Adamc ▴ 680

0

Entering edit mode

This doesn't warrant a full answer. But are you sure that the gene naming nomenclatures are the same, i.e. are "all" the gene symbols coming from one platform represented in the other and visa versa? Sometimes different versions of say RefSeq have different symbols for the same gene. This could really be a problem comparing between species. Just a thought.

ADD REPLY • link 12.2 years ago by Ian 6.0k

score 3 · Answer 1 · 2012-01-26

I think the key thing you'll need to pay attention to is how to define the background gene list. For example, if Gene List 1 contains Gene X, but Gene X is not represented on the platform of Gene List 2, then it doesn't make much sense to consider it in the enrichment. So as a starting point, I'd first take the intersection of the genes on the platform as the background, and discard the data specific only to one array or the other.

By similar rationale, you should only consider genes in the analysis whose reporters had a reasonable shot of being differentially expressed. For example, if you have a dead probe set on one or the other array, it similarly should be discarded. How to define "reasonable shot"? One strategy we frequently use is to simply consider the genes that were detectably expressed on the platform (independent of whether they were differentially expressed among your comparison groups). Not perfect, but it's much better than using the entire genome's worth of genes.

Finally, you of course might want to think about where you set your thresholds to define your gene lists. Different platforms can have different dynamic ranges, so 2-fold in one might be equivalent to 5-fold in another. One strategy for dealing with this might be to use moving thresholds on both platforms and let the data tell you what thresholds to use to obtain maximum overlap. There is a danger with overfitting here so use caution, but it's the same basic idea as in GSEA.

Interesting question. Hope these thoughts help...

Ram · Answer 2 · 2012-01-26

Multivariate vs. Meta Analysis

Using gene list overlap to address repeatability with "replicates" across platform doesn't strike me as the best way to go about it. The lack of significant overlap between lists of genes means the "replicates" are not comparable, and so raises doubts about any meta-analysis using the data you have. Usually, this is a source of confusion - which dataset (gene list) should I use; should I do another "replicate"?.

A more systematic [computational] experimental design would be to code the platform as a factor ("Affy","Agilent"), and as Andrew Su suggests, only include genes (collapsing multiple probes per gene) that are represented on both platforms. Then, multivariate regression using limma will give you a single robust gene list, but still allow you to compare platforms. I realize this procedure may not be sensitive, but likely more specific, which serves the goal of removing doubt about differences in gene list.

Alternatively, perform gene-set enrichment analysis on each gene list, and compare THOSE results - probably more robust and interpretable.