Question

How can I improve my R code?

0

Entering edit mode

7.7 years ago

Leite ★ 1.3k

Hi everyone, this is an R code I've been studying based on this tutorial.

This tutorial of mine has helped a lot, but I want to learn more about expression analysis using R.

My doubts are regarding these codes:

nrow(topTable(fit.eb, coef=1, number=10000, lfc=2))

What the lfc=2 means and how to use it right?

contrast.matrix <- makeContrasts(disease-control, levels=design)

In this part, I've seen it just like that,disease-control, , but also so: disease-control, control-disease

What would be the difference between the two ways of using it?

How do I adjust the p-value and FDR in this code? Can I do this manually in the end?
Some genes get the symbol "NA", how to solve this problem?
How to make this code better? I appreciate all the tips

Best regards, Leite

#Install the core bioconductor packages, if not already installed
>source("http://bioconductor.org/biocLite.R")
>
>biocLite()

#Install additional bioconductor libraries, if not already installed
>biocLite("GEOquery")
>biocLite("affy")
>biocLite("gcrma")
>biocLite("affyPLM")
>biocLite("RColorBrewer")
>biocLite("limma")
>biocLite("hgu133plus2.db")
>biocLite("annotate")
>biocLite("simpleaffy")

#Load the necessary libraries
>library(GEOquery)
>library(affy)
>library(gcrma)
>library(affyPLM)
>library(RColorBrewer)
>library(hgu133plus2.db)
>library(annotate)
>library(simpleaffy)

#Downloading the file
>getGEOSuppFiles("**MyData**")

#Before starting to work, you have to unzip the files.
>untar("**MyData**/**MyData**_RAW.tar", exdir="data")
>cels <- list.files("data/", pattern = "[gz]")
>sapply(paste("data", cels, sep="/"), gunzip)
>cels

#Loading Files
#Phenodata must have names have to be the same as file names
>celfiles <- read.affy(covdesc="phenodata.txt", path="data")

#Normalising the data
>celfiles.rma <- rma(celfiles)

#Checking the quality
#Load colour libraries
>library(RColorBrewer)

#Set colour palette
>cols <- brewer.pal(8, "Set1")

#Plot a boxplot of unnormalised intensity values
>boxplot(celfiles, col=cols)

#Now a normalised boxplot
>boxplot(celfiles.rma, col=cols)

#Plot a density vs log intensity histogram for the unnormalised data
>hist(celfiles, col=cols)

#Plot a density vs log intensity histogram for the normalised data
>hist(celfiles.rma, col=cols)

#The first step of the analysis is to filter non-informative data
>celfiles.filtered <- nsFilter(celfiles.rma, require.entrez=FALSE, remove.dupEntrez=FALSE)

#Find differentially expressed probes
>samples <- celfiles.rma$Target

#Convert into factors
>samples <- as.factor(samples)

# Check factors have been assigned
>samples

#Set up the experimental design
>design <- model.matrix(~0 + samples)
>colnames(design) <- c("disease", "control")

#Inspect the experiment design
>design

#At this point we have normalized filtered data and a description of the data and the samples and the experimental design.
#Fit the linear model to the filtered expression set
>fit <- lmFit(exprs(celfiles.filtered$eset), design)

#Set up a contrast matrix to compare disease v control
>contrast.matrix <- makeContrasts(**disease-control,** levels=design)

#Now, the contrast matrix is combined with the linear fit model by set of probes.
>fit.con <- contrasts.fit(fit, contrast.matrix)
>fit.eb <- eBayes(fit.con)
>nrow(topTable(fit.eb, coef=1, number=50000, **lfc=2**))
>probeset.list <- topTable(fit.eb, coef=1, number=10000, lfc=2)

#Annotating the results with associated gene symbols
>gene.symbols <- getSYMBOL(rownames(probeset.list), "hgu133plus2")
>results <- cbind(probeset.list, gene.symbols)
>head(results)

#To save results
> write.table(results, "results.txt", sep="\t", quote=FALSE)

RNA-Seq R gene rna-seq • 2.7k views

ADD COMMENT • link updated 7.7 years ago by Devon Ryan 105k • written 7.7 years ago by Leite ★ 1.3k

score 5 · Accepted Answer · 2017-11-01

lfc=2 sets a threshold where the results output by topTable() will be filtered such that only rows with a log fold-change greater than 2 are returned. Note that this is a fold change in either direction, so > 2 and < -2. Since this is log2, that corresponds to a fold-change of at least 4 in either direction, which is rather large.
The order in the contrasts will end up affecting the sign of the fold changes. If there's a 2x difference in disease vs. control then a contrast of disease - control will result in a fold-change of 1, whereas control - disease will result in a fold-change of -1. If you've actually seen disease - control, control - disease specified, then that's just demonstrating that you can give multiple contrasts to makeContrasts(), where commas are used as separators. You would then specify which one to access in topTable(). Note that there's no point to using something like a - b, b - a as contrasts, since the two are identical with signs reversed.
topTable() already adjusts the p-values, it's the adj.P.values column. You're always free to use p.adjust() yourself on the unadjusted p-values, of course (you'll get the same values).
Some of your probes don't have genes associated with them. You might want to filter these out.
A cursory look doesn't reveal anything obviously wrong with the code.