Question: Can ordinal logistic regression be a good method to test correlation?
0
gravatar for lykosherdshep
5 weeks ago by
lykosherdshep70 wrote:

I'm sorry, the description may be too long.
I wanted to find gene whose expresison values positively correlate to gleason grade in prostate cancer using TCGA data, the gleason grade is divided into five group according to the gleason scores(Grade I: gleason scores <= 6; Grade II: gleason scores = 3+4; Grade III: gleason scores = 4+3; Grade IV: gleason scores = 4+4|3+5|5+3; Grade V: gleason scores = 9|10), the variable gleason grade should be a orderd factors. the code for above opetation is below:

## download data from TCGA througn TCGAbiolinks
mrnaquery <- GDCquery(project = "TCGA-PRAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "HTSeq - Counts")
 GDCdownload(mrnaquery, files.per.chunk = 8)
 mrna_se <- GDCprepare(mrnaquery)
 ## filter low tumor purity samples and normal samples
 mrnatp <- TCGAquery_SampleTypes(getResults(mrnaquery)$cases,
                            typesample = "TP")
  mrna_se <- mrna_se[,mrna_se$barcode %in% 
                 c(TCGAtumor_purity(mrnatp,
                                    estimate = 0,
                                    absolute = 0,
                                    ihc = 0,
                                    lump = 0,
                                    cpe = 0.7)$pure_barcodes)]
   rm(mrnatp, mrnant)
   ## filter FFPE sampels
    mrna_se <- mrna_se[,!mrna_se$is_ffpe]
    ## check if there are some duplicated samples
    mrna_se$bcr_patient_barcode[mrna_se$shortLetterCode == "TP"] %>% 
       duplicated() %>% 
       sum()
    ## omit samples without gleason scores
    mrna_se <- mrna_se[,!is.na(mrna_se$subtype_Clinical_Gleason)]

    # construct a group factor according to gleason score
     mrna_se$gleason_grade5g <- ifelse(mrna_se$subtype_Clinical_Gleason == "3+3"| 
     mrna_se$subtype_Clinical_Gleason == "2+4",
                                   "I", 
                                   ifelse(mrna_se$subtype_Clinical_Gleason == "3+4", 
                                          "II", 
                                          ifelse(mrna_se$subtype_Clinical_Gleason == "4+3", 
                                                 "III", 
                                                 ifelse(mrna_se$subtype_Clinical_Gleason == "4+4"| 
     mrna_se$subtype_Clinical_Gleason == "3+5" | mrna_se$subtype_Clinical_Gleason == "5+3", 
                                                        "IV", 
                                                        "V"))))
     mrna_se$gleason_grade5g <- factor(mrna_se$gleason_grade5g, 
                              levels = c("I", "II", "III", "IV", "V"))

I tried two different methods to find my desied gene.

the first I fitted a ordinal logistic regression across all gene using ordinal package in R where the gleason grade variable is a dependent variable and gene expression values which has been transformed using the function vst() in DESeq2 package is a independent variable. and I could obtain the coefficient of each gene in the regression and it's corresponding p values, after obtaining all gene's coefficients and pvalues I calculated the adjusted P using "BH" method, I thought the gene whose coefficient is above zero and adjusted P is below 0.05 is what I need.

here is my code for my first methods:

mrna_dds5g <- DESeqDataSet(mrna_se, design = ~gleason_grade5g)
## filter low expression gene
mrna_dds5g <- mrna_dds5g[rowSums(assay(mrna_dds5g)) >= 10,]
## Data transformation
mrna_vst5g <- vst(mrna_dds5g, blind = F)
tcga_gg_ord <- as.ordered(mrna_vst5g$gleason_grade5g)
## construct data for ordinal logistic regression
tcga_gg_ord <- data.frame(samples = colnames(mrna_vst5g),
                                       gleason_grade = tcga_gg_ord,
                                       stringsAsFactors = F)
### construct a data.frame with the first and second column respectively
###  representing the sample barcode and gleason grade and the other columns
### standing for the gene expression values
tcga_logit_data <- assay(mrna_vst5g) %>% 
    as.data.frame() %>% 
    rownames_to_column(var = "gene_ensembl") %>% 
### add a column corresponding to the gene symbol 
    mutate(Symbol = mapIds(org.Hs.eg.db,
                     keys = gene_ensembl,
                     column = "SYMBOL",
                     keytype = "ENSEMBL"),) %>% 
### filter gene without gene symbol and omit gene_emsembl column
    dplyr::filter(!is.na(Symbol)) %>% 
    dplyr::select(-gene_ensembl) %>% 
    dplyr::select(Symbol, everything()) %>% 
 ### there are some gene ensembl mapping to the same gene symbol
 ### retain the gene symbol rows whose mean expression values is the most
    mutate(sum = rowSums(.[-1])) %>% 
    arrange(desc(sum)) %>% 
    distinct(Symbol, .keep_all = T) %>% 
###  transform data to a format where the first and second column respectively
###  representing the sample barcode and gleason grade and the other columns
###  standing for the gene expression values
    column_to_rownames(var = "Symbol") %>% 
    t() %>% as.data.frame() %>% rownames_to_column(var = "samples") %>% 
    inner_join(tcga_gg_ord, by = "samples") %>% 
    dplyr::select(samples, gleason_grade,everything())


### apply clm() to each gene 
tcga_clm_logit <- apply(tcga_logit_data[-(1:2)], 2,
                   function(gene){
                     logit <-  clm(tcga_logit_data$gleason_grade ~ gene)
                     logit_sum <- summary(logit)
                     ord_logit <- list(clm = logit, summary = logit_sum)
                     return(ord_logit)
                   })
### take out coefficients, standard error, z value, and p value for every gene
tcga_clm_res <- sapply(tcga_clm_logit, function(gene){
temp <- gene$summary$coefficients["gene", ]
 })
### fetch the results data.frame which have six columns 
### gene symbol, standard error, z value, and p value, adjusted pvalues
tcga_clm_res <- t(tcga_clm_res) %>% 
    as.data.frame() %>% 
   `names<-`(c("Estimate", "Std.error", "z value", "pvalue")) %>% 
     rownames_to_column(var = "gene") %>% 
     mutate(padj = p.adjust(pvalue, method = "BH"))
tcga_clm_upgene <- filter(tcga_clm_res, Estimate > 0, padj < 0.01) %>% 
     arrange(padj)
tcga_clm_downgene <- filter(tcga_clm_res, Estimate < 0, padj < 0.01) %>% 
     arrange(padj)

of course, I also test the proportional assumption using nominal_test() function, but weird thing is that all p values is NA. and I don't know if the logistic regression can be a good method in such circumstance?

the second method is clumsy but I think it can be proper though the results were not so well. I made a differential expression among all gleason grade group. V vs IV, IV vs III, III vs II, II vs I. and I take out all results from above compare and intersect them but there is no one existing in all group compare.

the code for the second method is blew:

## differential expression between 5 groups
    mrna_gleason5g_de <- DESeq2::DESeq(mrna_dds5g)

### up
res_IIvsI <- results(mrna_gleason5g_de, 
                 contrast = c("gleason_grade5g", "II", "I"),
                 alpha = 0.01)
summary(res_IIvsI)
IIvsI_gene <- tidy.DESeqResults(res_IIvsI) %>% 
      filter(estimate > 1, p.adjusted < 0.01) %>% 
      .$gene
res_IIIvsII <- results(mrna_gleason5g_de, 
                   contrast = c("gleason_grade5g", "III", "II"),
                   alpha = 0.01)
summary(res_IIIvsII)
IIIvsII_gene <- tidy.DESeqResults(res_IIIvsII) %>% 
       filter(estimate > 1, p.adjusted < 0.01) %>% 
      .$gene
res_IVvsIII <- results(mrna_gleason5g_de, 
                   contrast = c("gleason_grade5g", "IV", "III"),
                   alpha = 0.01)
 summary(res_IVvsIII)
 IVvsIII_gene <- tidy.DESeqResults(res_IVvsIII) %>% 
      filter(estimate > 1, p.adjusted < 0.01) %>% 
      .$gene
  res_VvsIV <- results(mrna_gleason5g_de, 
                 contrast = c("gleason_grade5g", "V", "IV"),
                 alpha = 0.01)
 summary(res_VvsIV)
 intersect(IIvsI_gene, IIIvsII_gene)
 intersect(IIIvsII_gene, IVvsIII_gene)
 intersect(IVvsIII_gene, VvsIV_gene)

for some question summarization:
1. Can the ordinal regression be a good methods for correlation between a ordered variable and a continuous variable?
2. what does the NA p value returned by nominal_test() function from R ordinal package mean? I Googled it but it seems no one come to such situation.

Thank you all!

ADD COMMENTlink modified 5 weeks ago by Kevin Blighe51k • written 5 weeks ago by lykosherdshep70
1
gravatar for Kevin Blighe
5 weeks ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Ordinal regression seems like a good choice, but describing the process as "correlation between a ordered variable and a continuous variable" does not seem correct to me. You may just want to say that you conducted an ordinal regression between the Gleason Grades and gene expression.

It's difficult to 'proof' every part of your workflow but things like the following can influence your end result:

  • sample numbers overall and per Gleason group
  • pre-filtering of low count genes
  • outliers (<- these can result in, for example, a completely erroneous interpretation from a regression model)

Also, for this code chunk:

### apply clm() to each gene 
tcga_clm_logit <- apply(tcga_logit_data[-(1:2)], 2,
                   function(gene){
                     logit <-  clm(tcga_logit_data$gleason_grade ~ gene)
                     logit_sum <- summary(logit)
                     ord_logit <- list(clm = logit, summary = logit_sum)
                     return(ord_logit)
                   })

I am not sure that this code is doing what you want (?). For one, you do not have to remove gleason_grade from the data - you should leave it within the data and then have the formula: gleason_grade ~ gene

Also, double check that tcga_logit_data$gleason_grade is indeed still encoded as a categorical variable and is ordered correctly. Also spot-check the gene expression data to ensure that they are encoded as continuous variables.

I think that your code should be something like:

tcga_clm_logit <- lapply(
  colnames(tcga_logit_data)[3:ncol(tcga_logit_data)],
  function(gene) {
    formula <- as.formula(paste0('gleason_grade ~ ', gene))
    logit <- clm(formula, data = tcga_logit_data)
    logit_sum <- summary(logit)
    ord_logit <- list(clm = logit, summary = logit_sum)
    return(ord_logit)
})

--------------------------------------------

Regarding the pairwise differential expression between the Gleason grades, you should be making use of the lfcshrink() function after running results(). Also... I'm not sure... would you expect the statistically significantly differentially expressed genes to appear across all pairwise comparisons? - I would not.

Kevin

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe51k

Thank you, Kevin. Sorry for my unproper description and thanks for your correcting me.

I have just tried your code, but it returns me a error- Error in eval(predvars, data, env) : can't find object 'NKX3', It seems the gene symbol name cause such problems.

and my code tcga_clm_logit <- apply(tcga_logit_data[-(1:2)], 2, function(gene){ logit <- clm(tcga_logit_data$gleason_grade ~ gene) logit_sum <- summary(logit) ord_logit <- list(clm = logit, summary = logit_sum) return(ord_logit) }) can give a result.

for my differential expression operation. I just thought the gene having a up differential expression in all the group compare(for this, they were II vs I, III vs II, IV vs III, V vs IV) imply their positive correlation with gleason grade.
Kevin, the ordinal logistic regression assume the effects of any explanatory variables are consistent or proportional across the different thresholds I test the proportional assumption using nominal_test() function, but weird thing is that all p values is NA. do you know what it mean? and if this can influence my results?

ADD REPLYlink written 5 weeks ago by lykosherdshep70

Hey again, so, you are saying that the apply(... function produces results? Do they make sense?

I am not quite sure about the NA p-values. The issue has been reported previously: https://github.com/runehaubo/ordinal/issues/15

ADD REPLYlink written 5 weeks ago by Kevin Blighe51k
1

Hey, Kevin, I solved it.
the colnames in my data are just gene symbol containing non-syntactic character, and R allow non-syntactic column names with backticks surrounding them. so I just add backticks to each gene symbol and it works. I changed the code to below:

tcga_clm_logit <- lapply(colnames(tcga_logit_data)[3:ncol(tcga_logit_data)],
                     function(gene) {
                       formula0 <- paste0("gleason_grade ~", "`", gene, "`")
                       logit <- clm(as.formula(formula0), data = tcga_logit_data)
                       logit_sum <- summary(logit)
                       ord_logit <- list(gene = logit_sum)
                       return(ord_logit)
                     })

and it works. Thanks for your help.

ADD REPLYlink written 5 weeks ago by lykosherdshep70

Yes, only the apply function give me a result, I tried the code you gave me again and I can understand the code, but it return a error(Error in eval(predvars, data, env) : object 'NKX3' not found). Maybe there are some conflicts with the package ordinal. I'll try another ordinal logistic regression package. Thanks for your patient instruction.

ADD REPLYlink written 5 weeks ago by lykosherdshep70

I have checked tcga_logit_data$gleason_grade and it's a ordered factor. and the number for each group respectively 23, 95, 52, 38, 65 for gleason grade I, II, III, IV, V. the number seems unbalanced.

ADD REPLYlink written 5 weeks ago by lykosherdshep70

Sorry, I also wanted to confirm my uncertain question, I thinks I should stablize the variance to do other analysis with expression values. in my ordinal regression workflow, the gene expression values I use were transformed by vst() function in DESeq2 package. Can this be a acceptable method?
and I have read a post in biostar concerning survival analysis where using voom().
I think the both can be used, Am I right? or there are other methods best to me? Thanks Kevin.

ADD REPLYlink written 5 weeks ago by lykosherdshep70

The transformed expression levels produced by vst() should be fine; however, you can also try those produced by rlog().

ADD REPLYlink written 5 weeks ago by Kevin Blighe51k

Thanks for your help, Kevin.

ADD REPLYlink written 5 weeks ago by lykosherdshep70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour