Question: GEO2R - significantly DEGs
gravatar for hcv
14 months ago by
hcv0 wrote:

For an assignment, I am analysing data from accession GSE54536. However, when considering the adjusted p-value, no DEGs are found. For the next assignment, however, I need a list of DEGs from this dataset. Should I not look at the adjusted p-value but just the p-value? Thanks in advance for any help.

deg geo2r geo • 503 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by hcv0

Could you expand on the reasoning that led to you log-transforming the dataset, please. That is, what was it about the "distribution of the data in the Value Distribution section" that led you to log-transforming?

There's a couple of flags in there for me: The box-plot given within the Value Distribution indicates lower-quartiles around zero for most of the samples in the version I'm looking at; it's less than zero for a couple of the samples.

Unless you made it yourself, you can rarely be sure where the whiskers end on a box plot, but the lower whisker will lie somewhere within the range of the data and it is negative for every one of those samples.

If there's a negative value in a dataset, what will happen to it upon log-transformation?

ADD REPLYlink written 14 months ago by russhh4.1k

@OP: what was your cutoff? Regarding data transformation and consequent results, i suggest to you to read methods in the paper and reproduce author's results ( by going through their publication This would help you to put analysis in perspective.

ADD REPLYlink modified 14 months ago • written 14 months ago by cpad011210k

Would following the author's protocols for working with the raw data be of much value if the data have been transformed before upload to GEO? I'd strongly suggest that OP reads some of the metadata for the GSMxxxx files first to work out what manipulations were performed before upload.

ADD REPLYlink written 14 months ago by russhh4.1k

The only metadata in the files is 'normalized signal intensity' which is clear from the box plot since values are comparable. Upon log-transformation of the data, NaNs are produced, but also the data is normally distributed, which is needed for the statistical tests performed in limma. My guess is I should focus on the p-value rather than the adjusted p-value since the adjustment of the p-value for DEG selection is not specified in the M&M of the paper.

ADD REPLYlink written 14 months ago by hcv0

Why are you log transforming normalized signal it not necessary?

What is your cutoff adj. p-value to define a DEG?

ADD REPLYlink written 14 months ago by theobroma221.1k

You have to be careful with your wording here:

When you say that the data is normally distributed following log-transformation, do you mean the 'distribution of intensities across all probes for a given sample' or the 'distribution of intensities for a given probe across all samples' was normally distributed?

(I suspect you've looked at the former, which is kind of irrelevant to a statistical model that applies across-samples)

Fundamentally, limma doesn't require that the data is normally distributed across all the samples. It is an assumption of the model that there is some normally-distributed noise around the fitted values. That is, in an A-vs-B experiment like this, the values for a given probe should be (approx) normally distributed within group A, and also within group B (they don't need to look normally distributed in the amalgamation of groups A and B). So if you want to check whether your data are ok to be put into limma, you should look at a few different probes and for each of those probes do a box-plot, split by the experimental arms.

Theoretically, the reason for working with logged-data, is because fold-changes (multiplicative differences between groups) in the original space correspond to additive differences between groups in the logged data; and linear models estimate additive differences between groups. So if you're applying linear models to data that's been logged two times, the coefficients you estimate no longer correspond to a fold-change in the original space - so you HAVE to know whether the data has been transformed.

Practically, in this dataset - if you're transforming your data into a bunch of missing values, you're probably applying the wrong transformation.

As a rule of thumb, if you look at a microarray dataset and it contains negative values, or the maximum value is much lower than 10000, or the difference between the max and min values is less than a thousand, it's probably been log-transformed already.

Good luck with your work. But as another rule of thumb - check the published paper after you've analysed the dataset (correctly) yourself. You'd be surprised how many papers could have been a lot better .. (as an aside, I have no experience or connection with this dataset or paper, which may well be perfectly good).

A third rule of thumb: all cut-offs are arbitrary. Any downstream analysis you do could be critically dependent upon an arbitrarily set significance threshold at an early step in your analysis, so if you do plan to do GO/KEGG/genefriends/IPA type stuff cut the data at multiple thresholds and check that your downstream analysis is robust to your arbitrary cutpoints.

ADD REPLYlink written 14 months ago by russhh4.1k

Thank you very much for your elaborate reply, it's helping me tremendously to understand precisely what I am and what I should be doing.

ADD REPLYlink modified 14 months ago • written 14 months ago by hcv0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1572 users visited in the last hour