Question

Understanding Microarray Datasets

11

Entering edit mode

13.3 years ago

Linda ▴ 150

Hello,

Could anyone point me to a resource for understanding microarray datasets. Typically, when I look at some publicly available datasets I find tab delimited files of expression data. In some cases it seems the files indicate fold change in expression levels. In others though, that does not makes sense since most of the numbers are in several 1000's. How does one convert that to fold change?

microarray dataset format • 8.1k views

ADD COMMENT • link updated 13.3 years ago by boczniak767 ▴ 850 • written 13.3 years ago by Linda ▴ 150

score 13 · Answer 1 · 2010-12-22

Linda, if you're looking at datasets on ArrayExpress or GEO then the FAQs, and the information associated with the datasets should be enough, however I think you are making an incorrect assumption about what the values might be.

Typically an array dataset (lets stick with single colour arrays for now) will report the unique probe ID (generally a gene, or region of a gene), an expression value (intensity level), and perhaps a detection p-value, followed by some annotation information.

What you're most likely looking at is an the expression value. This is not necessarily a fold change. A fold change is a calculated difference between 2 samples. In a single colour experiment, this has no meaning for a single chip. Whilst it is certainly possible to see fold changes in the 1000s when you compare 2 chips, if the majority of the data is in that range it is likely to be the un-normalised (raw) expression value.

With two-colour data, a single chip is hybridised with 2 samples, and the ratio of the intensity between these samples is often reported - effectively a fold change.

So you need to ascertain what the platform is (two-colour or single colour) and what is being reported. This is further confused by the fact the data can be reported raw or normalised. It is good practice to deposit raw data values than normalised ones, to allow researchers to apply their own normalisation should they wish to re-analyse the data.

To convert expression values into fold changes, you need to bring the raw data into an analysis package, normalise it, annotate the samples with their relevant phenotpyic information (drug treated, control etc.) and then compare two treatments to generate a fold change. As long as you're aware that fold change is not always a good discriminator for differential gene expression. Something like a volcano plot would allow you to plot a t-test p-value against fold change information, which is often more useful.

If the dataset in question is available online, if you provide a link to it we should be able to clarify the situation further.

score 13 · Answer 2 · 2010-12-22

I had the exact same confusion three years ago when I started working with microarrays. Here is a list of things that I wish somebody had told me back then:

DNA Microarray Virtual Lab shows you in a fun way what is a microarray experiment and how it is done.
Wikipedia can also be helpful: Gene Expression page and Expression Profiling page are two of them.
My background is in CS so at the beginning reading and understanding stuff like two wikipedia pages above was impossible for me. So I started learning basics in biology. Learn Genetics made by the University of Utah is a great place to start. Khan Academy on youtube is for high school kids, still very helpful, for example this one. And if you are a programmer then you will like DNA seen through the eyes of a coder.
After learning basics about microarrys you can start play around with them. The best place to start, in my opinion, is Gene Expression Omnibus (GEO). Read their documentations and FAQ briefly. Then take a look at DataSet Browser. They have a Data Analysis tool that is very useful for doing some simple statistical tests and getting the feeling of data.
There are lots of softwares and online tools for working with gene expression microarry data. To list couple of them: GEPAS, MIADAW, Expression Profiler, Array Track, some tools developed in Eisen lab, and a bunch more at Gene Ontology and Stanford Microarray Database. To be honest I never used any of them as I rather to use programming tools. Any comment here from other people?!!
And finally R and Bioconductor, two great programming tools for doing almost anything you want. Bioconductor has a workflow for working with microarrys yet still I believe the best resource here is a book called Bioconductor Case Studies (Use R) published by Springer, you can read it online here. I cannot emphasize enough how good this book is.

If I remember something else I will update this post.

Hope this helps and good luck.

score 3 · Answer 3 · 2011-11-29

Hi All, my advice is that each experiment's data type (as well as other characteristics) must be interpreted independently. [?]Ok, even if experiment description conforms MIAME guidelines, one can see that attached "raw" data files (in some cases just tables or txt files) sometimes don't have strictly defined columns (dye used, sample, etc.) so in this cases without help from submitter it is unreliable to say if this is raw or log2 data, if this is background corrected,... We can interpret if data is log2 transformed or not just looking at the range, but it's preferable to have this information from person responsible for a given experiment. I have also bad experience with Experiment Design Images provided at ArrayExpress website in case of big experiments it is really messy.

Of course, sometimes "raw data" is provided as really raw, i.e. gpr or CEL files - aquired after scanning. Sometimes submitter is so kind to send even tiffs on request.