Hi, I'm looking through data on GEO from affymetrix arrays and using NetAffx to determine what probes I should search for based upon the platform they claim to use. (for example [HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array or [HG-U133A] Affymetrix Human Genome U133A Array). What I'm finding though is that the data in the Soft files from GEO is often missing a large number of probe ids NetAffx says they should have especially of things like lncRNAs. Not only that but GEO datasets using the same platform are often missing probe ids the other has. Are there different versions of the same type of array, and would it be described in GEO somewhere I could check? Or is it just standard practice to truncate GEO datasets in some way?
I'd consider just looking at the raw data (CEL files) when possible and going from there. I wouldn't necessarily trust the already normalized data being uploaded by the users, because its really to their discretion what they did to it. We've done a ton of curation work on our OncoLand (TCGA and more) (http://www.omicsoft.com/oncoland-service) and ImmunoLand (http://www.omicsoft.com/immunoland) which pull heavily from GEO and ArrayExpress, and we found it was best to just go back to the unnormalized Affymetrix data when possible,
> Are there different versions of the same type of array, and would it be described in GEO somewhere I could check?
Yes, the designs you mention have different numbers of probes, and represent different generations of the HG-U133 family of designs. The "A" array was one half of a pair (A and B) and the _Plus_2 more or less combined the A and B designs into a single array. GEO Array platforms are described here:
For each platform the full list of probe IDs is provided.
HG-U133_Plus_2 here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL570 (54675 probes)
HG-U133A 2.0 here: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL571 (22277 probes)