Question: Possible methodology implemented in R/Bioconductor packages to integrate and analyze clinical data with gene expression data in R
gravatar for svlachavas
5.1 years ago by
svlachavas680 wrote:

Dear Biostars Community,

i want to adress a specific data integration procedure i would like to implement in R/Bioconductor via any relative R package/methodology. In detail, I have acquired clinical data accompaning my gene expression microarray data(which i have preprossed and analyzed:affymetrix colorectal cancer datasets-60 samples:paired cancer and control samples). In particular, the clinical data represent Positron Emission Tomography(PET) measurements on each sample(of each patient, both cancer and adjucent control), such as SUV(Standardized Uptake Value), Fractal Dimension(FD) and other kinetic parameters[in total 8 "variables"-parameters with continuous measurements(numbers with units). A very small subset of these clinical data just for illustration are presented below(the presented variables are with bolt):

Parameter         Unit                 Sample_1       Sample_2    Sample_3

SUV                                           8.085            10.255           3.2744

VB                                             0.00595          0.063967       0.032291

FD                                              1.3546            1.3923          1.2349

K1            ml/ml Tiss/min            0.6953             0.4653          0.3942........


Thus, my crusial question is if there is an appropriate methodology implemented in any package in R, in order to perform appropriate integration and subsequent analysis of my gene expression data with the correspoding PET data, in order to search for any interesting correlations or patterns identified ? And also to be able to perform any necessay transformation to the above data(maybe scaling or normalization of the above continuous variables, and also removal of any samples with a lot of missing values) .The only package i have noticed is the FactoMineR R package, but as i have no experience in any similar kind of analysis i dont know if could be used for my specific purposes. I instist on R/Bioconductor, because in R i have analyzed my gene expression data, and so i would like to use this platform/language to implement also my above goals.

Any suggestions, comments or help would be beneficial !!



ADD COMMENTlink modified 5.1 years ago by Irsan7.2k • written 5.1 years ago by svlachavas680

Dear Vassiak,

thank you for your answer. I highlighted R, because i perform generally data analysis mainly on R and use some other tools for functional enrichment analysis. I have heard for the other tools you mention(MeV, STATA), but im a bit reluctant of using them, as i would like to have complete control of any analysis/steps performed-although this as you accurately pinpoit is a more time consuming-.Also i didnt know that Cytoscape has such plugins for data integration. I will search in detail.

ADD REPLYlink written 5.1 years ago by svlachavas680
gravatar for Irsan
5.1 years ago by
Irsan7.2k wrote:

Put your data in an ExpressionSet object defined in the affy-package (bioconductor). Then annotate with feature data (gene annotation) and phenotype data (your PET data). Then I would recommend to first do unsupervised hierarchical clustering on the 60 samples with different clustering algorithms (Ward, Complete, Average, McQuitty) on Pearson distances (inverse of pearson correlations). Check how the sample clusters relate to your PET results. Then do supervised analyses, by identifying differentially expressed genes between relevant PET contrasts and make clustering heatmaps of the differential genes.

ADD COMMENTlink modified 11 months ago by RamRS30k • written 5.1 years ago by Irsan7.2k

Dear Irsan,

thank you for your interesting approach !! Regarding your answer, you mean incorporating in my phenoData the PET information ? A small subset of my phenotype information of my expression set:

                                   **Disease**      ***Location***                  Meta_factor        Study
St_1_WL57.CEL           Normal       sigmoid_colon           0                      hgu133plus2
St_2_WL57.CEL           Cancer       sigmoid_colon           0                      hgu133plus2
St_N_EC59.CEL           Normal       sigmoid_colon           0                      hgu133plus2
St_T_EC59.CEL           Cancer       sigmoid_colon           0                      hgu133plus2
St_N_EJ58.CEL           Normal                cecum              0                      hgu133plus2
St_T_EJ58.CEL           Cancer               cecum               0                     hgu133plus2

and then I would have 8 more "variables" in my phenotype information? That is also continuous variables except from categoricals? Also, I would like to ask you if it is necessary to transform first the above measurements(i.e. scaling or normalizing ) before including them in my phenotype information and perform any type of clustering to the samples?

Finally, regarding the supervised approach you pinpoint, I believe that limma could handle both categorical and continuous variables?

ADD REPLYlink modified 11 months ago by RamRS30k • written 5.1 years ago by svlachavas680

Yes, thats what phenoData is for. All types of variables will fit and limma can handle many different study design. For almost each scenario, limma provides examples so take the time to read the docs. BTW, for unsupervised clustering do

distPearson <- function(x) as.dist(1-cor(t(x), method="pearson"))
yourClustering <- hclust(distPearson(yourExpressionSet),method="complete")

I guess the samples first divide depending from what tissue they were obtained, than disease state, then ... ?

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Irsan7.2k

Dear Irsan,

I have used extensively limma for many scenarios, but not like this one where are both categorical and continuous data, for which maybe useful to use even an general non-linear regression with splines. Anyway, my first goal is to incorporate appropriately the values with the new variables-so regarding this aspect, you would also scale and/or normalize the values before merging any values?

  • Also, regarding your code, you mention supervised-but hierarchical clustering isn't unsupervised??
  • Finally, according to the information about my phenotype, I care mostly about disease (the tissue you mentioned) and also the Location is a interesting one, regarding the anatomic location of each colorectal tumor. The variable study was used to inspect any batch affects after merging the two datasets into one, and the meta_factor just indicates(but not have used either in the gene expression analysis any synchronous liver metastases except the primary tumor cases)
ADD REPLYlink modified 11 months ago by RamRS30k • written 5.1 years ago by svlachavas680

I think I don't exactly understand what you mean with scaling and normalizing your values. Are you talking about the one continuous PET variable? Or is it like all PET variables are continuous like in your example at the top? If so, I think it largely depends on what you want to do with that variable (co-variate, variable of interest, ...), the sample size, the normality and variance of this variable, and maybe more ... Also, I think it very much depends on how clinicians use this PET data. For example, when they binarize SUV because the SUV numbers are dichotomous, you might want to do so as well.

Yes, the code was for unsupervised (I edited the comment)

If you are interested in a cancer vs normal contrast you should use that in the design matrix for limma and use Location and batch as covariates/blocking factors.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by Irsan7.2k

As i checked again the PET variables, all of them comprise of continue values-also for the SUV, which for instance ranges from number from ~3 to 19 etc. Yes, my main goal is to perform some kind of "correlation" analysis(maybe is called multifactor-or something like this) to correlate and find interesting patterns of the gene expression data correlating with the PET variables:for instance, maybe the cancer samples have a relationship with a specific PET value, or similar approaches. That why i highlighted the possible need of transforming first these values.

ADD REPLYlink written 5.1 years ago by svlachavas680
gravatar for vassialk
5.1 years ago by
vassialk190 wrote:

Use MeV, Expander, Genesis, JMP, STATA software first (they have many tricks you will enjoy), then experiment with R packages, which is always a rather time consuming task, despite the years of code writing. You can use Cytoscape plugins for building math models with various input of clinical and laboratory data.

ADD COMMENTlink written 5.1 years ago by vassialk190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1580 users visited in the last hour