HELP- analysis RNa-seq from GEO
2
0
Entering edit mode
8 weeks ago
PM • 0

Hello,

I'm new to RNA-seq analysis and would appreciate it if I could get some feedback on whether my workflow makes sense. I've been trying to learn on my own whilst doing a lab-based PhD... I've realised I enjoy this more than pipetting. Anyway, thanks to this community for all the help and support.

I'm analysing the data from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32863 But I'm working with the non_normalised data

library(dplyr)
library(limma)
library(edgeR)
library(GEOquery)

my_id <- "GSE32863"

## extract geo data
Sys.setenv(VROOM_CONNECTION_SIZE = 256000000)
expr <- getGEO(my_id)[[1]]

sampleInfo <- pData(expr)
annot <- fData(expr) #annotation data

####read in the data and convert to an ElistRaw object
baseDir <- './'

#import data
x <- read.table(paste0(baseDir, 'GSE32863_non-normalized.txt.gz'), header = TRUE, sep = '\t', stringsAsFactors = FALSE, skip = 0)
x <- x[order(x$ID_REF),] #extract pvalues detectionpvalues <- x[,grep('Detection.Pval', colnames(x))] rownames(detectionpvalues) <- x$ID_REF

#remove pvalues from dataframe
x <- x[,-grep('Detection.Pval', colnames(x))]
#extract probes, 1st column
probes <- x$ID_REF #convert to matrix x <- data.matrix(x[,2:ncol(x)]) #columns and rows names rownames(x) <- probes colnames(x) <- colnames(edata) colnames(detectionpvalues) <- colnames(x) #verify rownames in x match those in annot #dataframes should have same number of observations and rows annot <- annot[which(annot$ID %in% rownames(x)),]

# create a custom EListRaw object
gse<- new('EListRaw')
gse@.Data[[1]] <- 'illumina'
gse@.Data[[2]] <- sampleInfo
gse@.Data[[3]] <- NULL
gse@.Data[[4]] <- x
gse@.Data[[5]] <- NULL
gse$E <- x gse$targets <- sampleInfo
gse$genes <- annot gse$other$Detection <- detectionpvalues ########### Background correction / normalisation gse.bgcorrect.norm <- neqc(gse, offset = 16) # filter out control probes, those with no symbol, and those that failed Control <- gse.bgcorrect.norm$genes$Source=="ILMN_Controls" NoSymbol <- gse.bgcorrect.norm$genes$Symbol == "" isexpr <- rowSums(detectionpvalues <= 0.05) >= 3 gse.bgcorrect.norm.filt <- gse.bgcorrect.norm[!Control & !NoSymbol & isexpr, ] dim(gse.bgcorrect.norm) # 48803 116 dim(gse.bgcorrect.norm.filt) # 26225 116 ##summarise across genes by mean gse.bgcorrect.norm.filt.mean <- avereps(gse.bgcorrect.norm.filt, ID = gse.bgcorrect.norm.filt$genes$Symbol) dim(gse.bgcorrect.norm.filt.mean) # 19485 116 library(limma) design<- model.matrix(~0 + sampleInfo$source_name_ch1)
## the column names are a bit ugly, so we will rename

#calculate cutoff
cutoff <- median(gse.bgcorrect.norm.filt.mean$E) #filter is_expressed <- gse.bgcorrect.norm.filt.mean$E > cutoff
keep <- rowSums(is_expressed) > 2

gse.bgcorrect.norm.filt.mean$E <- gse.bgcorrect.norm.filt.mean$E[keep,]

#fit the model to the data
#The result of which is to estimate the expression level in each of the groups that we specified.
fit <- lmFit(gse.bgcorrect.norm.filt.mean\$E, design)

fit2 <- contrasts.fit(fit, contrasts)
#apply the empirical Bayes step to get our differential expression statistics and p-values.
fit2 <- eBayes(fit2)
topTable(fit2)

decideTests(fit2)
tfit <- treat(fit2, lfc=1)
dt <- decideTests(tfit)


microarray • 440 views
0
Entering edit mode

Indeed, this is not an RNA-seq analysis but Microarray. There are, however, various RNA-seq workflows for R on Bioconductor that may serve as a template.

If you at some point want to run an analysis with hundreds or even thousands of samples, R and the Bioconductor workflows are unfortunately not going to cut it. In that case, you will find mature RNA-seq pipelines written in Nextflow, Snakemake, Common Workflow Language etc.

0
Entering edit mode

@Matthias since your post is not directly addressing the question I have moved it to a comment.

0
Entering edit mode

Sure. I just assumed that the actual purpose of the question was to teach themselves RNA-seq analysis and therefore pointed out the respective workflows to retrace instead. But I should have asked instead, if the OP is nonetheless interested in feedback regarding the script.

1
Entering edit mode
8 weeks ago

Analysis looks fine (although I'm not too familiar with this old school tech).

My biggest concern would be referring to microarray data as RNA-seq...

0
Entering edit mode

Sorry! Yes, got my wires crossed! They’re not the same thing!

Thanks for pointing that out

0
Entering edit mode

Please edit the original post/title and remove the references to RNAseq since that does not apply.

0
Entering edit mode
8 weeks ago
Gordon Smyth ★ 5.1k

I'm not quite clear on your purpose. Are you analysing this Illumina microarray dataset because you are interested in the biological results from this experiment or because you thought it would teach you about RNA-seq analysis?

If the former, then there are several ways in which your analysis could be improved or simplified. For one thing, neqc needs to know the detection p-values but you removed them for some reason. For another, this experiment is a paired design but you have not taken into account the pairing in your analysis.

If the latter, then there are RNA-seq workflows for the packages you are using here:

Traffic: 2256 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.