Not able to get DEG from FireBrowse data using R
0
1
Entering edit mode
3.0 years ago

Hello everyone,

I'm working on LUAD (Lung adenocarcinoma) data from FireBrowse. I want to find Differentially expressed genes according to the stage of the tumor. i have the clinical data with me and from that data i was able to take the samples which belongs to stage I(sub types a and b), Stage II (subtypes a and b), Stage III (subtypes a and b) and Stage IV. LUAD.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data this is the file of RNAseq data with around 56000 genes and 570 samples.

Below i have pasted the R code i used to find the DEG number, reference link i used (Survival analysis of TCGA patients integrating gene expression (RNASeq) data)

But I'm not able to understand what is going wrong in this analysis as i'm not able to get the DEG in tumor samples of stage I. II , II ,IV.

1) If my analysis is wrong please let me know what wrong steps i'm using. 2) what particular Package i'm suppose to use? 3)Any sample code of R would really help me.

Any suggestions would be very helpful to clear my doubts.

#read RNA file

# remove the first row as we don't need it,

rna <- rna[-1,]
dim(rna)

#remove genes whose expression is ==0 there are 50% of such samples,
rem <- function(x){
x <- as.matrix(x)
x<- t(apply(x,1,as.numeric))
r <- as.numeric(apply(x,1,function(i)sum(i == 0)))
remove <- which(r > dim(x)[2]*0.5)
return(remove)
}
remove <- rem(rna)
rna <- rna[-remove,]
dim(rna)
colnames(rna)

colnames(rna) <- gsub('\\.','-',substr(colnames(rna),1,28))
library(xlsx)

Tumor_stage <- read.xlsx(file = "Clinical/Tumor_stage_data.xlsx", sheetIndex = 1)
colnames(Tumor_stage)
Tumor_stage$Stage_I_Barcode <- toupper(Tumor_stage$Stage_I_Barcode)

Tumor_stage$Stage_II_Barcode <- toupper(Tumor_stage$Stage_II_Barcode)
Tumor_stage$Stage_III_Barcode <- toupper(Tumor_stage$Stage_III_Barcode)
Tumor_stage$Stage_IV_Barcode <- toupper(Tumor_stage$Stage_IV_Barcode)

#select only tumor samples

Tumor_Samples <- subset(rna, select = substr(colnames(rna),14,14) == '0')
dim(Tumor_Samples)

#subset the stages of tumor samples from Tumor_samples
Stage_I <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_I_Barcode) dim(Stage_I) stage_I_colnames <- as.data.frame(colnames(Stage_I)) stage_I_colnames write.xlsx(stage_I_colnames, file = "samples_stage.xlsx") Stage_II <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_II_Barcode)
dim(Stage_II)
stage_II_colnames <- as.data.frame(colnames(Stage_II))
stage_II_colnames

write.xlsx(stage_II_colnames, file = "samples_stageII.xlsx")

Stage_III <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_III_Barcode) dim(Stage_III) stage_III_colnames <- as.data.frame(colnames(Stage_III)) #stage_II_colnames write.xlsx(stage_III_colnames, file = "samples_stageIII.xlsx") Stage_IV <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_IV_Barcode)
dim(Stage_IV)
stage_IV_colnames <- as.data.frame(colnames(Stage_IV))
#stage_II_colnames

write.xlsx(stage_IV_colnames, file = "samples_stageIV.xlsx")

Tumor_stage_data <- cbind(Stage_I, Stage_II, Stage_III, Stage_IV)
dim(Tumor_stage_data)

#groups of samples
groups <- c(rep("Stage_I",277), rep("Stage_II",122), rep("Stage_III",84),rep("Stage_IV",26))
group <- factor(groups)

group

#design matrix
design <- model.matrix(~ 0 + group)
design
colnames(design)<- c("Stage_I", "Stage_II", "Stage_III","Stage_IV" )
tail(design)
dim(design)
dim(Tumor_stage_data)

library(limma)
x <- t(apply(Tumor_stage_data,1,as.numeric))

#Testing for differential expression
# Fit the linear model

fit <- lmFit(x, design)
names(fit)
group
cont.matrix <- makeContrasts(I_II = Stage_I - Stage_II,
I_III = Stage_I - Stage_III,
I_IV = Stage_I - Stage_IV,
II_III = Stage_II- Stage_III,
II_IV = Stage_II - Stage_IV,
III_IV = Stage_III - Stage_IV, levels=design)
cont.matrix

fit2 <- eBayes(contrasts.fit(fit, cont.matrix))
dim(fit2)
colnames(fit2)
summa.fit <- decideTests(fit2)
summary(summa.fit)

tt_I_II <- topTable(fit2, coef ="I_II", adjust.method = "BH",p.value = 0.05,lfc = 2, number = nrow(x))
dim(tt_I_II)

tt_I_III <- topTreat(fit2, coef ="I_III",adjust.method = "BH",p.value = 0.05,lfc = 2, number = nrow(x))
dim(tt_I_III)

tt_I_IV <- topTable(fit2, coef ="I_IV",adjust.method = "BH",p.value = 0.05,lfc = 2, number = nrow(x))
dim(tt_I_IV)

tt_II_III <- topTable(fit2, coef ="II_III",adjust.method = "BH",p.value = 0.05,lfc = 2, number = nrow(x))
dim(tt_II_III)

tt_II_IV <- topTable(fit2, coef ="II_IV",adjust.method = "BH",p.value = 0.05,lfc = 2, number = nrow(x))
dim(tt_II_IV)

tt_III_IV <- topTable(fit2, coef ="III_IV",adjust.method = "BH",p.value = 0.05,lfc = 2, number = nrow(x))
dim(tt_III_IV)

##heatmap
colnames(x) <- c(rep("Stage_I",277), rep("Stage_II",122), rep("Stage_III",84),rep("Stage_IV",26))

idx_I_III <- rownames(tt_I_III)
DEG_I_III <- x[idx_I_III,]
dim(DEG_I_III)

hr <- hclust(as.dist(1-cor(t(DEG_I_III), method="pearson")),method="complete")
plot(as.dendrogram(hr))

hc <-  hclust(as.dist(1-cor(DEG_I_III, method="spearman")), method="complete")
plot(as.dendrogram(hc))
colnames(DEG_I_III)


similar for remaining Contrast groups to plot the heatmap as above.

When i follow above code i get DEG in four contrasting groups but when i plot the heatmap there is no differentiating pattern (i know the number of samples is more than 100, but if there is any valid reason behind it please let me know.)

there is one more way i tried to find DEG by only taking Stage I and IV, Stage III and IV, Stage II and IV but here, in this case the number DEG changed and in case of Stage III and IV the DEG number was zero (and for this same group in first code the DEG number was 160)

The heatmap images i have attached with this question

R RNA-Seq FireBrowse DEG • 1.1k views