Question

Using Geo Soft File To Infer Gene Regulatory Networks

0

Entering edit mode

10.6 years ago

piotr.smietana • 0

Hello

I'm creating a gene regulatory networks inferring application, which uses data from GEO datasets in (not full) SOFT format. While method (described in http://www.ncbi.nlm.nih.gov/pubmed/22088843) that i'm using only needs numerical data without greater understanding what are they representing, i met some issues that need closer look. I have rather little biological knowledge on this field and sometimes can't distinguish between records that are file-format specific, and common conceptions (that are recognized by everyone who has interest in it, thus not described in file format definitions like http://www.ncbi.nlm.nih.gov/geo/info/soft-seq.html). I work with biologist, who helps me, but can't answer all of my questions, so i want to complete the knowledge that i need, and check with other sources things i already know.

1) sometimes SOFT file representing GDS contains very big number of data rows, even exceeding number of genes in organism in question. For example, GDS878 has over 70k data rows (about 39k unique IDENTIFIERs). There are even platforms that sport about million of rows (like GPL17585). Of course, there can be many entries for same IDENTIFIER, control data, damaged data to be ommited etc, but what exactly unique IDENTIFIER value stand for ?. As my objects of interests are genes, my question is, how to find out if several different IDENTIFIERs are representing the same gene ? Also, is there a naming convention that allows to distinguish IDENTIFIERS that are gene names from other ?

2) while viewing some GDS SOFT files, i found some things that i suppose to mean damaged or control data, Those are: - ID_REF value beginning with 'AFFX-'. - IDENTIFIER value '--Control' - IDENTIFIER value 'NO INFORMATION' - 'null' in data row - IDENTIFIER value beginning with 'NC_' I've been told that those are control, damaged or other unwanted data lines that need to be omitted, when inferring gene regulatory network. Is that true that they're of no use for me? Are there other such records that should be omitted ?

3) for each subset should i create separate GRN, using only values that correspondent to samples belonging to that subset ? (in example for issue #4, this would mean creating two GRNs, one using data from first two columns, second using data from last two columns)

4) When there are more than one data row with the same IDENTIFIER value, it means that they represent several microarray spots detecting the same DNA strand ? And what to do with this data, as i use only one row of data per IDENTIFIER? I've been told to choose row with biggest mean of values, but I don't know how to treat columns belonging to samples from different subsets. Consider this example:

(...)
^SUBSET = GDS1934_1
!subset_dataset_id = GDS1934
!subset_description = wild type
!subset_sample_id = GSM89493,GSM89494
!subset_type = genotype/variation
^SUBSET = GDS1934_2
!subset_dataset_id = GDS1934
!subset_description = TCR transgenic
!subset_sample_id = GSM89495,GSM89496
!subset_type = genotype/variation
(...)    
ID_REF    IDENTIFIER    GSM89493    GSM89494    GSM89495    GSM89496
(...)
104824_at    2210011C24Rik    -237.500    -179.300    -494.900    -156.900
104825_g_at    2210011C24Rik    327.100    25.900    183.100    948.900
104826_at    2210011C24Rik    1570.100    1057.900    2031.600    2683.800
(...)

should i count mean from all value in the row, or for two subsets separately ?

Also, should i use absolute values of values in data table ? If so, when I compute mean value, should i use absolute values before computing, or take absolute value of mean ?

geo expression • 3.5k views

ADD COMMENT • link updated 10.6 years ago by Devon Ryan 104k • written 10.6 years ago by piotr.smietana • 0

1

Entering edit mode

With regard to question #1, not all data in GEO represent gene expression. There is also copy number, methylation, chip-seq, etc.

ADD REPLY • link 10.6 years ago by Sean Davis 26k

score 2 · Answer 1 · 2013-09-20

They can stand for anything. These needn't be unique and if you're using a file where this is a transcript ID (I'm sure they're out there) and you're interested in genes, then you'll get wrong results. It's probably easiest to just use the annotation packages in bioconductor. BTW, the identifier's in your example are from Riken.
Whether control probe signals are useful for you depends on how your method works. You could use them to normalize arrays, though perhaps your method either doesn't do that or does it in a different way. Identifiers starting with NC_ can also be NCBI chromosome ID, though I wouldn't expect to see that used as an identifier.
Sort of depends on how many samples you have. The meaningfulness of your network predictions is partly dependent on sample number, so if a subset lacks sufficient size then you'll need to group subsets. Otherwise, it might depend on the treatment that differentiates the networks. You'll need to create some networks and do some experiments.
They likely detect different parts of the same feature. If that feature is a gene, they're likely different exons or transcripts. There's no one-size fits all solution to how to deal with this.

Whether there are negative values or not will depend on how things were processed prior to creating the file. Whatever you do, don't just blindly take the absolute value, that'd be a terrible idea. You should read a few papers to find out how these values are actually arrived at.

Soft files have a structured format, but the actual data type held in them can differ widely.