Question: Differential Gene expression between low survival and high survival samples from RNAseq
0
gravatar for lawarde.ankita1
14 months ago by
lawarde.ankita130 wrote:

Hello,

I have one question regarding the approach of differential gene expression analysis from RNAseq data of cancer.

Approach:

1) To find out the Defferentially Expressed Genes between low survived and high survived samples of patients. (example: here in my case, patients survived less than 1 year and patients survived more than 3 years)

I am using GDC data of lung cancer (TCGA-LUAD) and using this above approach i want to find the DEGs. So for this analysis i am using and following TCGAbiolinks package and its tutorials. but even if the TCGAbiolinks analysis is giving the DEG number but the heatmap of those genes is not showing any differentiating pattern. So i did the same analysis using edgeR analysis and limma using the expression set obtained from summarizedExperiment for the DEG, but then this analysis doesn't give any DEGs between less than 1 year and more than 3 years samples of patients.

SO my question is,

1) Is this approach of finding DEGs between less than 1 year and more than 3 years is correct? or is there anything i am missing in the analysis?

Suggestions regarding this approach will be very helpful and it will surly help me to clear my confusions. I look forward to hear valuable suggestions.

Thank you in advance.

rna-seq survival deg R • 628 views
ADD COMMENTlink written 14 months ago by lawarde.ankita130

Can you please detail further how you used edgeR? and the connection to limma..

ADD REPLYlink written 14 months ago by roy.granit790

Hello roy.granit,

here i provide with the R code i used for the DEG analysis usign edgeR, limma,

#RNAseq- analysis,
library(edgeR)
library(limma)
library(gplots)
library(org.Hs.eg.db) #for annotations
library(RColorBrewer)
library(TCGAbiolinks)
library(DT)

query.exp <- GDCquery(project = "TCGA-LUAD", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

luad.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "TCGA_LUAD_Exp.rda", 
                       summarizedExperiment = TRUE)

save(luad.exp,file = "luad_exp.RData")

load(file = "luad_exp.RData") #brca.exp
luad.exp

library(SummarizedExperiment)
# get expression matrix
data <- assay(luad.exp)

I have clinical info from the datatype follow_up and i have made two files containing patient clinical data for days to death more than three years and less than three years. the following code i used to match those barcodes of patients with the sample barcodes.

more3_years <- read.delim(file = "more_3_years.txt", sep = "\t", header = TRUE)

less1_year <- read.delim(file = "less_than_or_equal_1_year.txt", sep = "\t", header = TRUE)

data_days_to_death_3year <- subset(data, select = substr(colnames(data),1,12) %in% more3_years$bcr_patient_barcode)

data_days_to_death_1year <- subset(data, select = substr(colnames(data),1,12) %in% less1_year$bcr_patient_barcode)

data_1year_3year <- cbind(data_days_to_death_1year,data_days_to_death_3year)

Following is the RNA seq analysis, this im following according to this tutorial ( http://combine-australia.github.io/RNAseq-R/06-rnaseq-day1.html )

#Filtering to remove lowly expressed genes
myCPM <- cpm(data_1year_3year)
head(myCPM)

# Which values in myCPM are greater than 0.5?
thresh <- myCPM > 0.5

head(thresh)

table(rowSums(thresh))
# we would like to keep genes that have at least 2 TRUES in each row of thresh
keep <- rowSums(thresh) >= 2

# Subset the rows of countdata to keep the more highly expressed genes
counts.keep <- data_1year_3year[keep,]

# DGElist object
yy <- DGEList(counts.keep)

#add annotations 
ensemble_id <- rownames(yy)

genes <-select(org.Hs.eg.db, keys=ensemble_id ,columns = "SYMBOL", keytype="ENSEMBL")

genes <- genes[!duplicated(genes$ENSEMBL),]
yy$genes <- genes

group <- c(rep("less_than_1",51),rep("more_than_3",40))
group <- factor(group)

#Normalisation for composition bias
# Apply normalisation to DGEList object
y <- calcNormFactors(yy)

#Differential expression with limma-voom

# Specify a design matrix without an intercept term
design <- model.matrix(~ 0 + group)
colnames(design) <- levels(group)
desig
#Voom transform the data
v <- voom(y,design,plot = TRUE)

#Testing for differential expression
# Fit the linear model

fit <- lmFit(v)
cont.matrix <- makeContrasts(diff= more_than_3-less_than_1, levels=design)
fit2 <- eBayes(contrasts.fit(fit, cont.matrix))
summa.fit <- decideTests(fit2)
summary(summa.fit)
dia_ven <- vennDiagram(summa.fit,include = c("up","down"))

tt_1 <- topTable(fit2, coef =   ,p.value = 0.05,lfc = 2, number = nrow(v$E))
dim(tt_1)

The topTable gives zero DEGs If this process to find DEG is wrong then please let me know that, it would really help me.

ADD REPLYlink written 14 months ago by lawarde.ankita130

Sorry to add this code as a separate comment, but the number of charcters in the post was exceeding than the 5000

Below is the code where i have followed TCGAbiolinks tutorial,

library(TCGAbiolinks)
library(DT)

query.exp <- GDCquery(project = "TCGA-LUAD", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification",
                      workflow.type = "HTSeq - Counts",
                      experimental.strategy = "RNA-Seq")

GDCdownload(query.exp)

luad.exp <- GDCprepare(query = query.exp, save = TRUE, save.filename = "TCGA_LUAD_Exp.rda", 
                       summarizedExperiment = TRUE)


#get clinical info

query <- GDCquery(project = "TCGA-LUAD", 
                  data.category = "Clinical")

GDCdownload(query)

clinical <- GDCprepare_clinic(query, clinical.info = "follow_up")

datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)

names(clinical)

write.table(clinical, file = "clinical_followup.txt",sep = "\t")

more3_years <- read.delim(file = "more_3_years.txt", sep = "\t", header = TRUE)
colnames(more3_years)

less1_year <- read.delim(file = "less_than_or_equal_1_year.txt", sep = "\t", header = TRUE)
colnames(less1_year)

dataPrep <- TCGAanalyze_Preprocessing(object = luad.exp , cor.cut = 0.6, filename = "LUAD_correlation_sample.png")

dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = TCGAbiolinks::geneInfoHT,
                                      method = "gcContent")

dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                  qnt.cut =  0.25) 

dataFilt_3year <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% more3_years$bcr_patient_barcode)
dataFilt_1year <- subset(dataFilt, select = substr(colnames(dataFilt),1,12) %in% less1_year$bcr_patient_barcode)
dim(dataFilt_1year)
dim(dataFilt_3year)

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt_3year,
                            mat2 = dataFilt_1year,
                            Cond1type = "more than 3 years",
                            Cond2type = "less than 1 year",
                            fdr.cut = 0.05 ,
                            logFC.cut = 2,
                            method = "glmLRT")

these two methods i used to find the DEG, if there is anything wrong with one or the method please let me know as this would help me clear my doubt. And please tell the approach I'm using to find the DEG is correct or not correct.

Thank you so much for the help.

ADD REPLYlink written 14 months ago by lawarde.ankita130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1678 users visited in the last hour