What are the conventions/how to tell what a GSM has had done to it?
2
0
Entering edit mode
14 months ago
James • 0

I'm struggling to understand how to tell what a GSM of methylation data has had done to it. How can I determine, in a concrete manner, whether the methylation values have been normalized in some manner, and if they have been, how can I tell exactly what has been done? Is it the case that:

1. GSM files should always be raw data.
2. You have whatever information is shared on the GEO accession page and nothing else, leaving you to guess what 'normalized beta values' actually means.
3. There is a programmatic way to tell exactly what has been done from the GSM file itself.

I'm a bit at sea, because I want to compare several datasets to reproduce another scientists experimental findings (for verification), but it seems to me that the information I would obtain from exclusively downloading the GSE/GSM files is ambiguous and consequently confounding to cross study analysis. For example, the pData of the sample I have open in front of me has a column labeled "data_processing" which simply contains the word 'minfi', and this is the extent of the information I can see indicating what kind of normalization the samples have undergone.

GSE GSM methylation minfi GEOquery • 461 views
0
Entering edit mode
14 months ago

Hi James,

Basically, there is no guarantee that the data will be normalised. It's quite possible that the authors add the data processing steps, and this information is accessible by clicking on each GSM ID from the main accession page.

You can also take guidance from the automatically-generated code via GEO2R.

Kevin

0
Entering edit mode
12 months ago
Marc ▴ 10

Every author uploads their data to NIH GEO to create GSM / GSE datasets without any real quality control, so there's no guarantee that the raw data is actually unmodified.

Because I wanted to compare data sets in GEO without fussing with all the format variance, I added a function into methylprep (that I maintain) called beta bake from the command line interface (CLI). It should help to get more data sets that have nonstandard formats and will warn you when there's no raw data posted.