Hello
I'm creating a gene regulatory networks inferring application, which uses data from GEO datasets in (not full) SOFT format. While method (described in http://www.ncbi.nlm.nih.gov/pubmed/22088843) that i'm using only needs numerical data without greater understanding what are they representing, i met some issues that need closer look. I have rather little biological knowledge on this field and sometimes can't distinguish between records that are file-format specific, and common conceptions (that are recognized by everyone who has interest in it, thus not described in file format definitions like http://www.ncbi.nlm.nih.gov/geo/info/soft-seq.html). I work with biologist, who helps me, but can't answer all of my questions, so i want to complete the knowledge that i need, and check with other sources things i already know.
1) sometimes SOFT file representing GDS contains very big number of data rows, even exceeding number of genes in organism in question. For example, GDS878 has over 70k data rows (about 39k unique IDENTIFIERs). There are even platforms that sport about million of rows (like GPL17585). Of course, there can be many entries for same IDENTIFIER, control data, damaged data to be ommited etc, but what exactly unique IDENTIFIER value stand for ?. As my objects of interests are genes, my question is, how to find out if several different IDENTIFIERs are representing the same gene ? Also, is there a naming convention that allows to distinguish IDENTIFIERS that are gene names from other ?
2) while viewing some GDS SOFT files, i found some things that i suppose to mean damaged or control data, Those are: - ID_REF value beginning with 'AFFX-'. - IDENTIFIER value '--Control' - IDENTIFIER value 'NO INFORMATION' - 'null' in data row - IDENTIFIER value beginning with 'NC_' I've been told that those are control, damaged or other unwanted data lines that need to be omitted, when inferring gene regulatory network. Is that true that they're of no use for me? Are there other such records that should be omitted ?
3) for each subset should i create separate GRN, using only values that correspondent to samples belonging to that subset ? (in example for issue #4, this would mean creating two GRNs, one using data from first two columns, second using data from last two columns)
4) When there are more than one data row with the same IDENTIFIER value, it means that they represent several microarray spots detecting the same DNA strand ? And what to do with this data, as i use only one row of data per IDENTIFIER? I've been told to choose row with biggest mean of values, but I don't know how to treat columns belonging to samples from different subsets. Consider this example:
(...)
^SUBSET = GDS1934_1
!subset_dataset_id = GDS1934
!subset_description = wild type
!subset_sample_id = GSM89493,GSM89494
!subset_type = genotype/variation
^SUBSET = GDS1934_2
!subset_dataset_id = GDS1934
!subset_description = TCR transgenic
!subset_sample_id = GSM89495,GSM89496
!subset_type = genotype/variation
(...)
ID_REF IDENTIFIER GSM89493 GSM89494 GSM89495 GSM89496
(...)
104824_at 2210011C24Rik -237.500 -179.300 -494.900 -156.900
104825_g_at 2210011C24Rik 327.100 25.900 183.100 948.900
104826_at 2210011C24Rik 1570.100 1057.900 2031.600 2683.800
(...)
should i count mean from all value in the row, or for two subsets separately ?
Also, should i use absolute values of values in data table ? If so, when I compute mean value, should i use absolute values before computing, or take absolute value of mean ?
With regard to question #1, not all data in GEO represent gene expression. There is also copy number, methylation, chip-seq, etc.