Question

Microarray Dataset Formats: What Should My Program Know How To Read?

2

Entering edit mode

13.4 years ago

David Quigley 11k

I've written an application that biologists can use to perform analysis on microarray data. Often the hardest part of using someone else's tool is getting the data into it in the first place. I've defined my own simple data format but that may be a barrier to entry for non-technical users. I'm writing an "Import Dataset" function, designed to suck in a expression data, sample data, and probe descriptions from various formats. My question:

What well-defined formats should my software know how to read?

Assume the expression data are normalized already (e.g. an NCBI GEO Matrix file) rather than CEL files or some other raw data. Text-based formats popularized by well-established tools that encapsulate both sample and probe data would be most helpful.

microarray format software • 3.4k views

ADD COMMENT • link updated 13.3 years ago by User 59 13k • written 13.4 years ago by David Quigley 11k

score 6 · Answer 1 · 2010-12-03

6

Entering edit mode

13.4 years ago

User 59 13k

Hmm. If the data is normalised already, chances are it could have been generated from any one of a number of packages for any one of any number of platforms. I'd argue that this was a harder case to cover than just taking in the original data and processing it yourself especially for something like an Affy chip. Parsers for the probe and gene level outputs from Illumina platform should be straightforward, and essential. GEO and ArrayExpress parsers too - already implemented in BioConductor etc. anyway. I don't know whether you're going to want to, or need to, have converters in for various ID types, to be honest I quite often get data with very little annotation information in at all. Most people would like to be able to attach more in this case, and you can't rely on the underlying data source to have it.

I don't know really about standard formats for describing the setup of the experiment, other than the phenodata style tab delimited descriptions used for BioConductor packages.

Let's face it if you're dealing with biologists and their data you'd need to write an import function for Excel files which could also read their minds as to what the contents of said file might be ;)

ADD COMMENT • link 13.4 years ago by User 59 13k

1

Entering edit mode

+1 for starting with non-normalized data because of better-defined file format standards

ADD REPLY • link 13.4 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

The software was written as a stand-alone executable and currently doesn't require R, so it doesn't have native access to bioconductor. While I generally prefer to start from raw data, it's not always available, and sometimes for a quick look I am okay with using Matrix files from GEO. Mostly I need to describe the samples (e.g. "Mutant vs. WT", "Treated vs. Untreated") and know what platform was used.

ADD REPLY • link 13.4 years ago by David Quigley 11k

0

Entering edit mode

Fair enough, sounds like quite an endeavour anyway! You don't always need to capture much more than treatment, timepoint and replicate information - at least not in my experience. I still don't think there's a standard format for this, but I like the way GeneSpring does the 'conditions' and 'interpretations' to capture and condense this data. Still end up typing it all in though..

ADD REPLY • link 13.4 years ago by User 59 13k