I also want to try out Paradigm but am confused by this same issue. You posted your question 11 weeks ago; are you any closer to deciphering how Paradigm's discretization works? Here are a few quotes from the original paper (Vaske et al. 2010) that might shed light on this issue:
"These variables represent the differential state of each entity in comparison with a ‘control’ or normal level rather than the direct concentrations of the molecular entities. This representation allows us to model many high-throughput datasets, such as gene expression detected with DNA microarrays, that often either directly measure the differential state of a gene or convert direct measurements to measurements relative to matched controls."
This paragraph suggests to me that Paradigm assumes that the data in the input expression matrix have already been processed to represent a differential state (either a fold-change or a P-value). If this is the case, then I think it definitely represents a weakness in the method since suitable controls (matched or not) do not exist for patient gene expression data in cancer. Rather, in cBioportal's presentation of TCGA expression data, for example, a Z-score is reported for each gene-sample combination for a given gene by standardizing to the mean value of that gene, which I think is probably the best that can be done.
There are multiple allusions in the original paper and elsewhere that support the notion that Paradigm is discretizing over all the variables. In the paper they analyze a breast cancer microarray dataset (Naderi et al.) for which no matched normal expression is available, and report the following normalization approach in the methods:
"All data were non-parametrically normalized using a ranking procedure including all sample-probe values and each gene-sample pair was given a signed P-value based on the rank. A maximal P-value of 0.05 was used to determine gene-samples pairs that were significantly altered."
From the above, my guess is that the P-values are used to discretize the data into activated (signed P < 0.05), deactivated (signed P > -0.05), and unchanged (otherwise). Of course this is problematic since genes with high ranks are not necessarily activated but just highly expressed. They also analyze a glioblastoma cohort from TCGA, in which they take advantage of a small number of "normal" samples from normal tissue adjacent to the tumour:
"The glioblastoma data from TCGA (2008) was obtained from the TCGA data portal providing gene expression for 230 patient samples and 10 adjacent normal tissues on the Affymetrix U133A platform. The probes for the patient samples were normalized to the normal tissue by subtracting the median normal value of each probe. ... [The dataset was] non-parametrically normalized using the same procedure as the breast cancer data"
Trying to obtain differential information for a tumour sample by pooling a set of adjacent normals is flawed (or at least controversial), since the cell-type composition of surrounding normal tissue will differ strongly from the tumour itself, which is thought to be composed of largely undifferentiated dividing progenitors present at low concentration in normal tissue. Hence, the change in expression between tumour and normal will be confounded by the differing expression profiles of stem cells and differentiated cells.
Finally, there is this pretty clear statement on the Paradigm help page:
Internally the system will attempt to remove platform biases by performing a non-parametric normalization technique. This is accomplished by rank-ordering the entire datafile and assign new values to each point of rank / total. That would make the lowest value in a datafile with 10 samples and 10 genes 1/100, and the highest value 100/100. The understanding of the discrete values each gene/sample is going to be assigned is useful for interpreting the results, and changing the discrete cutoffs can cause the pathway results to vary widely.
So it appears that the whole file is used for discretization, which means that unless the data have already been processed to represent differential states (a dicey proposition in patient tumour samples), then our inference won't be based on "activated" and "deactivated" genes but rather highly and lowly expressed genes. I would think this would limit the overall ability of the algorithm to detect consistent perturbations across pathways and between patients.