Possible methodology implemented in R/Bioconductor packages to integrate and analyze clinical data with gene expression data in R
2
1
Entering edit mode
7.1 years ago
svlachavas ▴ 760

Dear Biostars Community,

I want to address a specific data integration procedure I would like to implement in R/Bioconductor via any relative R package/methodology. In detail, I have acquired clinical data accompanying my gene expression microarray data (which I have preprocessed and analyzed: affymetrix colorectal cancer datasets-60 samples:paired cancer and control samples). In particular, the clinical data represents Positron Emission Tomography(PET) measurements on each sample (of each patient, both cancer and adjacent control), such as SUV(Standardized Uptake Value), Fractal Dimension(FD) and other kinetic parameters[in total 8 "variables"-parameters with continuous measurements(numbers with units). A very small subset of these clinical data just for illustration are presented below (the presented variables are SUV, VB, FD and K1):

Parameter         Unit                 Sample_1       Sample_2    Sample_3
SUV                                           8.085            10.255           3.2744
VB                                             0.00595          0.063967       0.032291
FD                                              1.3546            1.3923          1.2349
K1            ml/ml Tiss/min            0.6953             0.4653          0.3942........


Thus, my crucial question is if there is an appropriate methodology implemented in any package in R, in order to perform appropriate integration and subsequent analysis of my gene expression data with the corresponding PET data, in order to search for any interesting correlations or patterns identified? And also to be able to perform any necessary transformation to the above data(maybe scaling or normalization of the above continuous variables, and also removal of any samples with a lot of missing values) .The only package I have noticed is the FactoMineR R package, but as I have no experience in any similar kind of analysis I don't know if could be used for my specific purposes. I insist on R/Bioconductor, because in R I have analyzed my gene expression data, and so I would like to use this platform/language to implement also my above goals.

Any suggestions, comments or help would be beneficial!

R data-integration clinical-data microarray-data • 2.7k views
0
Entering edit mode

Dear Vassiak,

thank you for your answer. I highlighted R, because i perform generally data analysis mainly on R and use some other tools for functional enrichment analysis. I have heard for the other tools you mention(MeV, STATA), but im a bit reluctant of using them, as i would like to have complete control of any analysis/steps performed-although this as you accurately pinpoit is a more time consuming-.Also i didnt know that Cytoscape has such plugins for data integration. I will search in detail.

1
Entering edit mode
7.1 years ago
Irsan ★ 7.6k

Put your data in an ExpressionSet object defined in the affy-package (bioconductor). Then annotate with feature data (gene annotation) and phenotype data (your PET data). Then I would recommend to first do unsupervised hierarchical clustering on the 60 samples with different clustering algorithms (Ward, Complete, Average, McQuitty) on Pearson distances (inverse of pearson correlations). Check how the sample clusters relate to your PET results. Then do supervised analyses, by identifying differentially expressed genes between relevant PET contrasts and make clustering heatmaps of the differential genes.

0
Entering edit mode

Dear Irsan,

thank you for your interesting approach !! Regarding your answer, you mean incorporating in my phenoData the PET information ? A small subset of my phenotype information of my expression set:

head(pData(eset_COMBAT))
**Disease**      ***Location***                  Meta_factor        Study
St_1_WL57.CEL           Normal       sigmoid_colon           0                      hgu133plus2
St_2_WL57.CEL           Cancer       sigmoid_colon           0                      hgu133plus2
St_N_EC59.CEL           Normal       sigmoid_colon           0                      hgu133plus2
St_T_EC59.CEL           Cancer       sigmoid_colon           0                      hgu133plus2
St_N_EJ58.CEL           Normal                cecum              0                      hgu133plus2
St_T_EJ58.CEL           Cancer               cecum               0                     hgu133plus2


and then I would have 8 more "variables" in my phenotype information? That is also continuous variables except from categoricals? Also, I would like to ask you if it is necessary to transform first the above measurements(i.e. scaling or normalizing ) before including them in my phenotype information and perform any type of clustering to the samples?

Finally, regarding the supervised approach you pinpoint, I believe that limma could handle both categorical and continuous variables?

1
Entering edit mode

Yes, thats what phenoData is for. All types of variables will fit and limma can handle many different study design. For almost each scenario, limma provides examples so take the time to read the docs. BTW, for unsupervised clustering do

distPearson <- function(x) as.dist(1-cor(t(x), method="pearson"))
yourClustering <- hclust(distPearson(yourExpressionSet),method="complete")
plot(yourClustering)

I guess the samples first divide depending from what tissue they were obtained, than disease state, then ... ?

0
Entering edit mode

Dear Irsan,

I have used extensively limma for many scenarios, but not like this one where are both categorical and continuous data, for which maybe useful to use even an general non-linear regression with splines. Anyway, my first goal is to incorporate appropriately the values with the new variables-so regarding this aspect, you would also scale and/or normalize the values before merging any values?

• Also, regarding your code, you mention supervised-but hierarchical clustering isn't unsupervised??
• Finally, according to the information about my phenotype, I care mostly about disease (the tissue you mentioned) and also the Location is a interesting one, regarding the anatomic location of each colorectal tumor. The variable study was used to inspect any batch affects after merging the two datasets into one, and the meta_factor just indicates(but not have used either in the gene expression analysis any synchronous liver metastases except the primary tumor cases)
0
Entering edit mode

I think I don't exactly understand what you mean with scaling and normalizing your values. Are you talking about the one continuous PET variable? Or is it like all PET variables are continuous like in your example at the top? If so, I think it largely depends on what you want to do with that variable (co-variate, variable of interest, ...), the sample size, the normality and variance of this variable, and maybe more ... Also, I think it very much depends on how clinicians use this PET data. For example, when they binarize SUV because the SUV numbers are dichotomous, you might want to do so as well.

Yes, the code was for unsupervised (I edited the comment)

If you are interested in a cancer vs normal contrast you should use that in the design matrix for limma and use Location and batch as covariates/blocking factors.

0
Entering edit mode

As i checked again the PET variables, all of them comprise of continue values-also for the SUV, which for instance ranges from number from ~3 to 19 etc. Yes, my main goal is to perform some kind of "correlation" analysis(maybe is called multifactor-or something like this) to correlate and find interesting patterns of the gene expression data correlating with the PET variables:for instance, maybe the cancer samples have a relationship with a specific PET value, or similar approaches. That why i highlighted the possible need of transforming first these values.

0
Entering edit mode
7.1 years ago
vassialk ▴ 200

Use MeV, Expander, Genesis, JMP, STATA software first (they have many tricks you will enjoy), then experiment with R packages, which is always a rather time consuming task, despite the years of code writing. You can use Cytoscape plugins for building math models with various input of clinical and laboratory data.