Question: How can I improve my R code?
0
gravatar for Leite
2.0 years ago by
Leite890
São Paulo - Brazil - Unifesp
Leite890 wrote:

Hi everyone, this is an R code I've been studying based on this tutorial.

This tutorial of mine has helped a lot, but I want to learn more about expression analysis using R.

My doubts are regarding these codes:

  1. nrow(topTable(fit.eb, coef=1, number=10000, lfc=2))

What the lfc=2 means and how to use it right?

  1. contrast.matrix <- makeContrasts(disease-control, levels=design)

In this part, I've seen it just like that,disease-control, , but also so: disease-control, control-disease

What would be the difference between the two ways of using it?

  1. How do I adjust the p-value and FDR in this code? Can I do this manually in the end?
  2. Some genes get the symbol "NA", how to solve this problem?
  3. How to make this code better? I appreciate all the tips

Best regards, Leite

#Install the core bioconductor packages, if not already installed
>source("http://bioconductor.org/biocLite.R")
>
>biocLite()

#Install additional bioconductor libraries, if not already installed
>biocLite("GEOquery")
>biocLite("affy")
>biocLite("gcrma")
>biocLite("affyPLM")
>biocLite("RColorBrewer")
>biocLite("limma")
>biocLite("hgu133plus2.db")
>biocLite("annotate")
>biocLite("simpleaffy")

#Load the necessary libraries
>library(GEOquery)
>library(affy)
>library(gcrma)
>library(affyPLM)
>library(RColorBrewer)
>library(hgu133plus2.db)
>library(annotate)
>library(simpleaffy)

#Downloading the file
>getGEOSuppFiles("**MyData**")

#Before starting to work, you have to unzip the files.
>untar("**MyData**/**MyData**_RAW.tar", exdir="data")
>cels <- list.files("data/", pattern = "[gz]")
>sapply(paste("data", cels, sep="/"), gunzip)
>cels

#Loading Files
#Phenodata must have names have to be the same as file names
>celfiles <- read.affy(covdesc="phenodata.txt", path="data")

#Normalising the data
>celfiles.rma <- rma(celfiles)

#Checking the quality
#Load colour libraries
>library(RColorBrewer)

#Set colour palette
>cols <- brewer.pal(8, "Set1")

#Plot a boxplot of unnormalised intensity values
>boxplot(celfiles, col=cols)

#Now a normalised boxplot
>boxplot(celfiles.rma, col=cols)

#Plot a density vs log intensity histogram for the unnormalised data
>hist(celfiles, col=cols)

#Plot a density vs log intensity histogram for the normalised data
>hist(celfiles.rma, col=cols)

#The first step of the analysis is to filter non-informative data
>celfiles.filtered <- nsFilter(celfiles.rma, require.entrez=FALSE, remove.dupEntrez=FALSE)

#Find differentially expressed probes
>samples <- celfiles.rma$Target

#Convert into factors
>samples <- as.factor(samples)

# Check factors have been assigned
>samples

#Set up the experimental design
>design <- model.matrix(~0 + samples)
>colnames(design) <- c("disease", "control")

#Inspect the experiment design
>design

#At this point we have normalized filtered data and a description of the data and the samples and the experimental design.
#Fit the linear model to the filtered expression set
>fit <- lmFit(exprs(celfiles.filtered$eset), design)

#Set up a contrast matrix to compare disease v control
>contrast.matrix <- makeContrasts(**disease-control,** levels=design)

#Now, the contrast matrix is combined with the linear fit model by set of probes.
>fit.con <- contrasts.fit(fit, contrast.matrix)
>fit.eb <- eBayes(fit.con)
>nrow(topTable(fit.eb, coef=1, number=50000, **lfc=2**))
>probeset.list <- topTable(fit.eb, coef=1, number=10000, lfc=2)

#Annotating the results with associated gene symbols
>gene.symbols <- getSYMBOL(rownames(probeset.list), "hgu133plus2")
>results <- cbind(probeset.list, gene.symbols)
>head(results)

#To save results
> write.table(results, "results.txt", sep="\t", quote=FALSE)
rna-seq R gene • 1.2k views
ADD COMMENTlink modified 2.0 years ago by Devon Ryan92k • written 2.0 years ago by Leite890
4
gravatar for Devon Ryan
2.0 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:
  1. lfc=2 sets a threshold where the results output by topTable() will be filtered such that only rows with a log fold-change greater than 2 are returned. Note that this is a fold change in either direction, so > 2 and < -2. Since this is log2, that corresponds to a fold-change of at least 4 in either direction, which is rather large.
  2. The order in the contrasts will end up affecting the sign of the fold changes. If there's a 2x difference in disease vs. control then a contrast of disease - control will result in a fold-change of 1, whereas control - disease will result in a fold-change of -1. If you've actually seen disease - control, control - disease specified, then that's just demonstrating that you can give multiple contrasts to makeContrasts(), where commas are used as separators. You would then specify which one to access in topTable(). Note that there's no point to using something like a - b, b - a as contrasts, since the two are identical with signs reversed.
  3. topTable() already adjusts the p-values, it's the adj.P.values column. You're always free to use p.adjust() yourself on the unadjusted p-values, of course (you'll get the same values).
  4. Some of your probes don't have genes associated with them. You might want to filter these out.
  5. A cursory look doesn't reveal anything obviously wrong with the code.
ADD COMMENTlink written 2.0 years ago by Devon Ryan92k
1

Our posts crossed by 1 minute Devon. Great to see that our answers agree well though :)

ADD REPLYlink written 2.0 years ago by Kevin Blighe51k
1

Great minds think alike, as they say.

ADD REPLYlink written 2.0 years ago by Devon Ryan92k
1

Dear Kevin and Dear Devon, thank you very much for the information, this helped me a lot to understand the code.

ADD REPLYlink written 2.0 years ago by Leite890

You're welcome, but if my answer also assisted you, you should accept that too! Our answers are virtually the same!

ADD REPLYlink written 2.0 years ago by Kevin Blighe51k

Thanks Leite. This is to help other users know which answers were helpful. Many of these questions/answers will pop up again in search engines long into the future.

ADD REPLYlink written 2.0 years ago by Kevin Blighe51k
3
gravatar for Kevin Blighe
2.0 years ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Good evening / morning,

nrow(topTable(fit.eb, coef=1, number=10000, lfc=2))

What the lfc=2 means and how to use it right?

It is the absolute log (base 2) fold change difference that will be used by toptable for the purposes of filtering. Note that for any R function, you can bring up the help page by just writing, for example, ?toptable. Here is what it says for 'lfc':

lfc: minimum absolute log2-fold-change required. ‘topTable’ and ‘topTableF’ include only genes with (at least one) absolute log-fold-changes greater than ‘lfc’. ‘topTreat’ does not remove genes but ranks genes by evidence that their log-fold-change exceeds ‘lfc’.

contrast.matrix <- makeContrasts(disease-control, levels=design)

In this part, I've seen it just like that,disease-control, , but also so: disease-control, control-disease

What would be the difference between the two ways of using it?

This determines the nominator and denominator used for calculating the fold changes. The P values from these will be the same. If we have:

disease=5 ;

control=2.5 ;

'disease-control' would return a fold change of 2.0 (or +1 when logged)

'control-disease' would return a fold change of 0.5 (or -1 when logged).

How do I adjust the p-value and FDR in this code? Can I do this manually in the end?

You do not have to do it manually. Just add the 'adjust.method' parameter to toptable. This will then adjust the values for you. Options include: "none", "BH", "BY", and ‘"holm"’.

NB - if you don't set this parameter, then toptable automatically adjusts the values by BH (Benjamini-Hochberg). If you just want nominal (undjusted) P values, set it to none.

NB - when you produce adjusted P values, the other parameter 'p.value' is then assumed to be your adjusted P value cutoff (similar idea to the lfc parameter). If you select 'none' for 'adjust.method', then 'p.value' relates to the nominal P value.

Some genes get the symbol "NA", how to solve this problem?

Check again your probeset.list object. The names may be stored in the first column and not the rownames.

How to make this code better?

We just did ;)

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Kevin Blighe51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2123 users visited in the last hour