Question: Discussion Regarding P-Value
1
2.8 years ago by
mora.jason2250
mora.jason2250 wrote:

Good Morning, Afternoon, and Evening wherever you are in the world!

Small Background: My name is Jason Mora and I stumbled here from /r/Bioinformatics. I am a fairly new recent graduate from California State University, East Bay where I did my undergraduate degree in mathematics with a minor in computer science.

Currently I am working on a bioinformatics project and would like some clarification. The data that I am working with consist of 16 features: 8 features are the various stress conditions including the control variable whose data are gene expression values (that are averaged per gene) and the other 8 are the respective p-values for those conditions i.e. Control and Control.Pvalue or Stress1 and Stress1.PValue and the rows are all of the genes for the organism. I was not involved in the calculation of the p-values so I am assuming the data is ready to go for data mining and interpretation. I read various papers in the research are we are studying and also approaches in analyzing biological data. I come across papers saying along the lines of "The log2FC is 3.45 (p < 0.01) ....". Now when exploring the data I see that there are genes who have p-values < significant (using < 0.05 for now) across all 8 conditions and some genes I notice in the data set whose p-value are significant for the stress conditions but not for the control condition.

## Lets say Gene A:

Gene A Control Gene Exp: ###
Gene A Control P-Value: 0.31
Gene A Stress 1 Gene Exp: ######
Gene A Stress 1 P-Value: 0.001253
Gene A Stress 2 Gene Exp: ####
Gene A Stress 2 P-Value: 0.002512

My question is the following: When I want to calculate the fold change, do both the control's p-value and the stress conditions p-value must satisfy the significance level applied? Or does it only apply for the stress condition only??

Again, I am not working on the assays themselves, the data is already given in an excel file (as described above). I simply want to know whether I can calculate fold changes across the control and stress variable within a significance level or its safe to assume that the significance level for the control can be 'ignored'. I am using Python and many other python libraries to help me visualize and analyze this data set.

Any clarification would be greatly appreciated of this would be greatly appreciated! Any question you have, I would try my best to answer and clarify any concerns you may have.Thank you for taking the time to read this!

modified 2.8 years ago • written 2.8 years ago by mora.jason2250
3

The clarifications have to come first from the data-generating side:
- What are the expression values, what do they measure ?
- What test was used to generate the p-values ? What was the null hypothesis ?
To me, at the moment, you have a bunch of meaningless numbers.
You can always compute a fold change value, this has nothing to do with p-values but you need to make sure that you have appropriate measures of expression. I could imagine that your expression values are already expressed as fold change relative to something with the associated p-values being a measure of confidence of those changes.

1

Thank you very much for your response:
1. The data given to me are the gene expression values and not fold changes because the description for the data in those columns would say X.AVG_Signal; which I am assuming that it is the average signal for that gene.
2. I am not sure how the p-values were generated because as I stated I was not involved in that process and unfortunately the person who did the microarray analysis for the project is no longer working at that laboratory anymore. Also I do not know the null-hypothesis. All of this microarray analysis was contracted to another laboratory, the person in charge of that laboratory as transferred to the Univ. of Chicago. I could as my P.I. how this was done and maybe get back with all the logistics but as far as I know, I have all genes for the organism (the rows of the matrix) with a control and the stress conditions applied to those genes (that are the average signals, which I am assuming the average gene expression values) and the respective p-value associated with those conditions for all genes as my columns. Just how I have in that small example I have above. I simply want see how that gene changed from lets for example say Irradiated / Control by calculating its log2FC. Since I have the corresponding p-value for each gene for all conditions, I want to know if its reasonable to ensure that both conditions must meet within a statistical significant p-value pVal[Control] < 0.05 and pVal[Irradiated] < 0.05 or do I only care about the gene's pVal[Irradiated] < 0.05? Because it would not make sense for me to calculate the log2FC if the pVal[Control] is not significant.

2

Ah, it's microarray data. The p-values presumably then come from a comparison to some of the control probes. A probe with a significant p-value then has a signal significantly higher, which just means, "detected at some level above 0". You can ignore the p-values for calculating fold-changes.

1

Thank you so much Devon for your insights! Can you elaborate more on that please? Does that mean that if a gene expressed significantly for the irradiated state but didn't expressed significantly for the control state, can I still calculate its fold change to see a significant change? Or do both states need to be significant to do this calculation?

1

Once again, the p-values are of absolutely positively no use for you. Delete those columns from the excel table. Consequently, use values regardless of what their associated useless p-value is.

1

There's no information to confirm what these p-values are. Saying they result from comparison with some probe is an assumption and even if true, you don't know what this probe was. So the only thing to do with these p-values is to ignore them.

1

... which I am assuming the average gene expression values

When it comes to data analysis, don't assume, be certain. If the data was generated with some sort of data management system in place, all necessary information should be available and not depend on the person having done the work to be still available.

I do not know the null-hypothesis

So you don't know what these p-values are so you can't use them. Ask yourself, what exactly does p < x mean here ? What was tested ?

We can't give you good advice if we also don't know what the data is.

1

http://imgur.com/a/ZPclv

This is screen shot of the data. I will take your advice and ask for all the logistics of the project.

2

Note that your values already appear to be log transformed (no clue what the base is).

2

tldr: Do not continue with this dataset until whoever gave it to you can clearly state where these values came from.

I want to strongly reiterate what Jean-Karim Heriche wrote, you can completely ignore the p-values, they are completely and utterly useless. Given that those are present, I would not blindly rely on the expression values being either correct or useful for anything either. At this point, your time would be better spent going to get a coffee (or 10) until whoever is responsible for this gets back to you.

1
2.8 years ago by
mora.jason2250
mora.jason2250 wrote:

Thank you very much for your replies Jean and Devon. I appreciate your insights and also your advice and will do as recommended. This has definitely been a learning experience and I appreciate the time out of your busy day to answer my question!