I'm new to RNA-seq analysis and would appreciate it if I could get some feedback on whether my workflow makes sense. I've been trying to learn on my own whilst doing a lab-based PhD... I've realised I enjoy this more than pipetting. Anyway, thanks to this community for all the help and support.
I'm analysing the data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32863 But I'm working with the non_normalised data
library(dplyr) library(limma) library(edgeR) readr::local_edition(1) library(GEOquery) my_id <- "GSE32863" ## extract geo data Sys.setenv(VROOM_CONNECTION_SIZE = 256000000) expr <- getGEO(my_id)[] sampleInfo <- pData(expr) annot <- fData(expr) #annotation data ####read in the data and convert to an ElistRaw object baseDir <- './' #import data x <- read.table(paste0(baseDir, 'GSE32863_non-normalized.txt.gz'), header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0) x <- x[order(x$ID_REF),] #extract pvalues detectionpvalues <- x[,grep('Detection.Pval', colnames(x))] rownames(detectionpvalues) <- x$ID_REF #remove pvalues from dataframe x <- x[,-grep('Detection.Pval', colnames(x))] #extract probes, 1st column probes <- x$ID_REF #convert to matrix x <- data.matrix(x[,2:ncol(x)]) #columns and rows names rownames(x) <- probes colnames(x) <- colnames(edata) colnames(detectionpvalues) <- colnames(x) #verify rownames in x match those in annot #dataframes should have same number of observations and rows annot <- annot[which(annot$ID %in% rownames(x)),] # create a custom EListRaw object gse<- new('EListRaw') gse@.Data[] <- 'illumina' gse@.Data[] <- sampleInfo gse@.Data[] <- NULL gse@.Data[] <- x gse@.Data[] <- NULL gse$E <- x gse$targets <- sampleInfo gse$genes <- annot gse$other$Detection <- detectionpvalues ########### Background correction / normalisation gse.bgcorrect.norm <- neqc(gse, offset = 16) # filter out control probes, those with no symbol, and those that failed Control <- gse.bgcorrect.norm$genes$Source=="ILMN_Controls" NoSymbol <- gse.bgcorrect.norm$genes$Symbol == "" isexpr <- rowSums(detectionpvalues <= 0.05) >= 3 gse.bgcorrect.norm.filt <- gse.bgcorrect.norm[!Control & !NoSymbol & isexpr, ] dim(gse.bgcorrect.norm) # 48803 116 dim(gse.bgcorrect.norm.filt) # 26225 116 ##summarise across genes by mean gse.bgcorrect.norm.filt.mean <- avereps(gse.bgcorrect.norm.filt, ID = gse.bgcorrect.norm.filt$genes$Symbol) dim(gse.bgcorrect.norm.filt.mean) # 19485 116 library(limma) design<- model.matrix(~0 + sampleInfo$source_name_ch1) ## the column names are a bit ugly, so we will rename colnames(design) <- c("Non_tumor","Adenocarcinoma") #calculate cutoff cutoff <- median(gse.bgcorrect.norm.filt.mean$E) #filter is_expressed <- gse.bgcorrect.norm.filt.mean$E > cutoff keep <- rowSums(is_expressed) > 2 gse.bgcorrect.norm.filt.mean$E <- gse.bgcorrect.norm.filt.mean$E[keep,] #fit the model to the data #The result of which is to estimate the expression level in each of the groups that we specified. fit <- lmFit(gse.bgcorrect.norm.filt.mean$E, design) contrasts<-makeContrasts(Adenocarcinoma-Non_tumor,levels = design) fit2 <- contrasts.fit(fit, contrasts) #apply the empirical Bayes step to get our differential expression statistics and p-values. fit2 <- eBayes(fit2) topTable(fit2) decideTests(fit2) tfit <- treat(fit2, lfc=1) dt <- decideTests(tfit) adeno_vs_non <- topTreat(tfit, coef=1, n=Inf)
I am also uploading the plots from this session
Indeed, this is not an RNA-seq analysis but Microarray. There are, however, various RNA-seq workflows for R on Bioconductor that may serve as a template.
If you at some point want to run an analysis with hundreds or even thousands of samples, R and the Bioconductor workflows are unfortunately not going to cut it. In that case, you will find mature RNA-seq pipelines written in Nextflow, Snakemake, Common Workflow Language etc.
@Matthias since your post is not directly addressing the question I have moved it to a comment.
Sure. I just assumed that the actual purpose of the question was to teach themselves RNA-seq analysis and therefore pointed out the respective workflows to retrace instead. But I should have asked instead, if the OP is nonetheless interested in feedback regarding the script.