Question

Forum:is valid to apply several cut offs thresholds to a data-matrix used to perform coexpression analysis until reach the desire result?

0

Entering edit mode

3.2 years ago

cyntsc10 • 0

Hi everybody

I made a wgcna std analysis with RNASeq normalized data (log2). I setup a Signed-Ntw with dynamic cut off (Pearson) according with most of the recommendations.

At the beginning I didn't get a good correlation score regarding with the scale-free topology for dynamic cut-off soft thresholds. So, to reach this goal, I built several matrices in which basically I applied a data-cutoff based on the quartiles (stats) until I reached a decent correlation score (0.82)., During this process obviously the original matrix reduced its size (from ~ 27000 genes to ~ 2000) in 17 samples. In theory this is right because I just want to keep the highest expression scores, but mathematically I am not sure if I am biasing the experiment applying this criteria.

Thus, my question is if when performing several cutoff to a data-matrix until get the desire behavior is a normal practice? or, am I biasing the experiment? ... The think is, that at the end I have proper results, but I want to be sure that these results are also valid.

I highly appreciate any comment.

These are distributions on each cut off for your reference

distributions

RNA-Seq matrix-filtering coexpression-analysis • 1.1k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.2 years ago by cyntsc10 • 0

0

Entering edit mode

Hi Sudbery
So in that sense, the most best practice would be to deal with the less manipulated matrix closest to the expecting result?
The issue with the original matrix is that the clusters gotten are a bit huge and also the clutter's itself are huge to analyze one by one. I am really stuck in this point, because I have not found a practical strategy to get what I am looking for with out lost resolution in my data. I see your point, but not sure to get a specific answer. If something to add, please share with me.
Thanks
Cynthia

ADD REPLY • link 3.2 years ago by cyntsc10 • 0

0

Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized. This comment belongs under @Ian's answer.

SUBMIT ANSWER is for new answers to original question.

ADD REPLY • link 3.2 years ago by GenoMax 141k

score 1 · Answer 1 · 2021-02-22

There are a couple of ways to deal with this, and a lot of comes down to what question you are trying to answer. If your conclusions are in any way connected with or correlated with expression, correlation score or scale free topology, then yes you've biased the data. If they are not, then you have not.

Lets say you want to know what pathway a gene, GeneA, is functioning in. The plan would be to take the cluster/network which contained GeneA and run a pathway enrichment analysis. Any pathway analysis that you did would have to take into account the reduced size of the matrix by setting the enrichment background to only the 2000 genes still in the matrix. If you did a pathway analysis and got out PathwayA, you could be fairly confident that the association was real. However, if it is important that PathwayB didn't come out of the analysis, then you have problems. Its entirely possible that you've reduced the power to detect enrichment in PathwayB by excluding too many of its genes and it not coming out in the analysis is a false negative because of this.