So I'm currently interested in playing with inferring gene regulatory networks from microarray data. I've downloaded a longitudinal study from GEO and I'm playing with various available BioC packages out there. I was wondering what kind of pre-processing people use/recommend for this kind of task. I've done lots of differential expression kind of things, but have no experience with this.
Are there a different set of protocols / methods for this? Which normalisation method would people recommend?
So Will kind of answered this a while ago. I was wondering if anyone had anymore suggestions: for example in this paper biomedcentral.com/1752-0509/1/37 they remove probes with a maximum intensity value < 5 . All of these thresholds seem arbitrary to me and a bit rubbish. Any more comments / ideas greatfully recieved. I m not sure of the proper etiquette to re-open a question.
Your comment on minimum gene expression cut-offs bears some further thought. Weakly expressed probes on any microarray platform are less reliable overall than strongly expressed probes. These probes will tend to show much greater variance but are less likely, overall, to provide true signal. It is desirable to reduce the influence of probes that have a low prior probability of being informative, particularly when using a statistical framework that penalizes you the number of tests you perform. A separate but interlinked issue is that many inference techniques (e.g. Bayesian network techniques) are incredibly expensive computationally as the number of nodes becomes non-trivial.
Although some normalization methods provided by the manufacturers provide a means of calling "present/absent", with RMA you do not get this sort of a call. As a practical solution, I often look at the normalized expression level of negative controls for guidance as to what is likely to be an uninformative probe. These probes may represent genes that are truly expressed, genes you'd see with a more sensitive technique such as RT-PCR, but with the microarray you have to be pragmatic and rule them out because they mostly 1) contribute to false positives that cannot be replicated 2) in bulk, greatly reduce your statistical power. I often use a simple heuristic that rules in probes where either mean expression across all probes is above X, or some number of probes is above higher value Y. This avoids excluding probes where most samples are at background but a few samples have high expression, as these may be biologically very interesting. The outcome may be an arbitrary-looking threshold, but in a pragmatic field this is not a disqualifying feature so long as the threshold is sensible and the authors communicate how it was arrived upon.
so Will kind of answered this a while ago. I was wondering if anyone had anymore suggestions: for example in this paper, they remove probes with a maximum intensity value < 5 . All of these thresholds seem arbitrary to me and a bit rubbish.
Any more comments / ideas gratefully received.
I m not sure of the proper etiquette to re-open a question.