Hi everyone, this is an R code I've been studying based on this tutorial.
This tutorial of mine has helped a lot, but I want to learn more about expression analysis using R.
My doubts are regarding these codes:
- nrow(topTable(fit.eb, coef=1, number=10000, lfc=2))
What the lfc=2 means and how to use it right?
- contrast.matrix <- makeContrasts(disease-control, levels=design)
In this part, I've seen it just like that,disease-control, , but also so: disease-control, control-disease
What would be the difference between the two ways of using it?
- How do I adjust the p-value and FDR in this code? Can I do this manually in the end?
- Some genes get the symbol "NA", how to solve this problem?
- How to make this code better? I appreciate all the tips
Best regards, Leite
#Install the core bioconductor packages, if not already installed >source("http://bioconductor.org/biocLite.R") > >biocLite() #Install additional bioconductor libraries, if not already installed >biocLite("GEOquery") >biocLite("affy") >biocLite("gcrma") >biocLite("affyPLM") >biocLite("RColorBrewer") >biocLite("limma") >biocLite("hgu133plus2.db") >biocLite("annotate") >biocLite("simpleaffy") #Load the necessary libraries >library(GEOquery) >library(affy) >library(gcrma) >library(affyPLM) >library(RColorBrewer) >library(hgu133plus2.db) >library(annotate) >library(simpleaffy) #Downloading the file >getGEOSuppFiles("**MyData**") #Before starting to work, you have to unzip the files. >untar("**MyData**/**MyData**_RAW.tar", exdir="data") >cels <- list.files("data/", pattern = "[gz]") >sapply(paste("data", cels, sep="/"), gunzip) >cels #Loading Files #Phenodata must have names have to be the same as file names >celfiles <- read.affy(covdesc="phenodata.txt", path="data") #Normalising the data >celfiles.rma <- rma(celfiles) #Checking the quality #Load colour libraries >library(RColorBrewer) #Set colour palette >cols <- brewer.pal(8, "Set1") #Plot a boxplot of unnormalised intensity values >boxplot(celfiles, col=cols) #Now a normalised boxplot >boxplot(celfiles.rma, col=cols) #Plot a density vs log intensity histogram for the unnormalised data >hist(celfiles, col=cols) #Plot a density vs log intensity histogram for the normalised data >hist(celfiles.rma, col=cols) #The first step of the analysis is to filter non-informative data >celfiles.filtered <- nsFilter(celfiles.rma, require.entrez=FALSE, remove.dupEntrez=FALSE) #Find differentially expressed probes >samples <- celfiles.rma$Target #Convert into factors >samples <- as.factor(samples) # Check factors have been assigned >samples #Set up the experimental design >design <- model.matrix(~0 + samples) >colnames(design) <- c("disease", "control") #Inspect the experiment design >design #At this point we have normalized filtered data and a description of the data and the samples and the experimental design. #Fit the linear model to the filtered expression set >fit <- lmFit(exprs(celfiles.filtered$eset), design) #Set up a contrast matrix to compare disease v control >contrast.matrix <- makeContrasts(**disease-control,** levels=design) #Now, the contrast matrix is combined with the linear fit model by set of probes. >fit.con <- contrasts.fit(fit, contrast.matrix) >fit.eb <- eBayes(fit.con) >nrow(topTable(fit.eb, coef=1, number=50000, **lfc=2**)) >probeset.list <- topTable(fit.eb, coef=1, number=10000, lfc=2) #Annotating the results with associated gene symbols >gene.symbols <- getSYMBOL(rownames(probeset.list), "hgu133plus2") >results <- cbind(probeset.list, gene.symbols) >head(results) #To save results > write.table(results, "results.txt", sep="\t", quote=FALSE)