I have seen issues of normalization of rna-seq data from TCGA raised before and I am not exactly sure they answer my question but I hope someone can comment on the results I am observing.
I wish to partition BRCA tumors into high levels of gene X and low levels of gene Y
So I partition the tumors into 1. lowX, 2. highX and 3. lowY, 4. highY
I used the unnormalized gene counts. Then I normalized it by the column sum ("library size") so I can do between sample comparison of the same gene. Similar results are true when I use the RSEM normalized gene values. [Please note, I prefer to start with unnormalized counts because I want to be sure exactly what steps I am taking in processing my data.]
Please see figure below:
As expected expression of X is low in lowX group and X is high in highX group.
However, strangely, expression of X is high when Y is high. Similarly, expression of Y is high when X is high (or normal).
This suggests that either the tumors in which X is high, everything else is high also (global) and similarly for Y.
Alternative possibility is that my normalization is not doing a proper job. What I want is to normalize the data such that in each tumor the genes are measured relatively and I think the zscore is the proper way to do this--it would have to be within sample z-score. i.e. I would take all the gene expression within a sample and normalize it to 0 mean and std 1 and then look at the z-scores of genes X and genes Y.
Does this sound reasonable? I appreciate any suggestions or advice. Thank you