Can Limma package be used for RNA-Seq Differential gene expression analysis for Quantile normalized log transformed RPKM data
1
5
Entering edit mode
8.9 years ago
David_emir ▴ 490

Hello All,

I have retrieved the data matrix from TCGA breast invasive carcinoma (BRCA) - expression data, The data is Level_3 Data (file names: *.rsem.genes.normalized_results) downloaded from TCGA DCC, log2(x+1) transformed, and processed data

I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.

Sample data is as follows:

Normalised data matrix for 113 Normal Vs 60 Breast cancer patients

Gene_ID    TCGA-A7-A0CE-11  TCGA-BH-A0AY-11
ARHGEF10L  9.998            9.768
HIF3A      6.1594           7.7241
RNF17      4.1813           0
RNF10      11.7961          11.7325

I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.

I'm using the following R-script:

library(limma)
raw.data <- read.delim("INPUT-QuantileNormalizedLog2Transformed-RPKM-Data.txt")
attach(raw.data)
names(raw.data)
d <- raw.data[, 2:5]
rownames(d) <- raw.data[, 1]
##To design matrix---
Group<-factor(pheno$Status,levels=levels(pheno$Status))
design<-model.matrix(~0+Group)  
###Assigning colnames###
colnames(design)<-c("Control", "Tumor")
###To assign the designed matrix in linear model using limma
fit <-lmFit(eset,design)
###Designing contrast matrix#
cont.wt<-makeContrasts("Tumor-Control",levels=design)
fit2 <-contrasts.fit(fit,cont.wt)
## eBayes step for calculating p-values, fold change etc###
fit3<-eBayes(fit2)

Example of my Pheno File:

Sample            Status    pheno
TCGA-A1-A0SN-01   Tumor     1
TCGA-A8-A08S-01   Tumor     1
TCGA-AC-A23C-01   Tumor     1
TCGA-AQ-A0Y5-01   Tumor     1
TCGA-AQ-A1H2-01   Tumor     1
TCGA-A7-A0CE-11   Control   0
TCGA-BH-A0AY-11   Control   0
TCGA-A7-A0CH-11   Control   0
TCGA-BH-A0BW-11   Control   0
TCGA-BH-A208-11   Control   0
sessionInfo() 
R version 3.2.0 beta (2015-04-01 r68134) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final) 
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:

 [1] parallel  stats     graphics  grDevices utils     datasets  methods
 [8] base

other attached packages:

 [1] limma_3.24.3        affy_1.46.0         Biobase_2.28.0
 [4] BiocGenerics_0.14.0

loaded via a namespace (and not attached):

 [1] zlibbioc_1.14.0       BiocInstaller_1.18.1  tools_3.2.0
 [4] affyio_1.36.0   preprocessCore_1.30.0

I am getting the following output:

GENE_SYMBOL  coefficients  df.residual  sigma         stdev.unscaled  Amean     s2.post    t         df.total    p.value    lods       F          F.p.value
ARHGEF10L    -0.608719705  171          0.495810737   0.159737986     9.65395   0.244892   -7.70056  172.05824   1.02E-12   17.86426   59.29856   1.02E-12
HIF3A        -5.727480133  171          0.495810737   0.159737986     9.65395   0.244892   -7.70056  172.05824   1.02E-12   17.86426   59.29856   1.02E-12

My Concern/Question:

  1. Can someone please let me know what I'm doing is correct procedure for finding out Differential gene expression analysis for two Groups (control Vs Normal in this case) or where I am going wrong here.
  2. I am bit confused about the Out put here, I am failing to sort data. Based on what stats should I take cut off for genes which are differentially Expressed (please have a look at the output, how to sort the data?).

Thanks a ton for your kind help

-Ateeq Khaliq

RNA-Seq limma TCGA • 11k views
ADD COMMENT
0
Entering edit mode

I am aware that this thread's been dormant for nearly 2 years but I've been dropped into something of a similar nature and found it interesting. In their user guide the authors of the limma package say: "In the limma approach to RNA-seq, read counts are converted to log2-counts-per-million (logCPM) and the mean-variance relationship is modelled either with precision weights or with an empirical Bayes prior trend. The precision weights approach is called “voom” and the prior trend approach is called “limma-trend”. " It would be my understanding that RSEM values could be used as well since they're based on log-normalised count values - or am I mistaken?

ADD REPLY
1
Entering edit mode
8.9 years ago

Doesn't TCGA give expected counts? That'd be preferable to RPKMs. Just preprocess them with voom().

ADD COMMENT
0
Entering edit mode

Thanks Devon, can you please Elobrate more on voom(), I never used it, at what point should I use voom()? Can preprocessed data be used for this?

Edit: I forgot to mention, I have only this matrix file with me, can you please tell me where will I get the count file?

Is it not possible to conduct DGE with this type of Data?

Sorry, I may sound stupid, but I am learning. Thanks for help ...

-Ateeq

ADD REPLY
2
Entering edit mode

voom() comes with limma and is intended to preprocess RNAseq data. Ideally, you would feed it unnormalized data and just have it do the quantile normalization. Don't use RPKMs as the input, estimated counts would be preferable (rsem produces them and since TCGA is using RSEM I assume they're available). For more on voom, see the limma vignette.

ADD REPLY
0
Entering edit mode

Thanks Devon,

I'm trying to understand the method of differential expression analysis described in: Stem cell transcriptome profiling via massive-scale mRNA sequencing Nicole Cloonan et al NATURE METhODS | VOL.5 NO.7 | JULY 2008 | 613 http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1223.html

Section: Statistical Analysis To calculate differential expression of SQRL tag data we analyzed the normalized gene signals (tags per Refseq transcript, length- normalized) for each library using the Limma package in R. After Quantile normalization, we used Limma to fit a linear model to the log2-transformed data using an empirical Bayes method32 to moderate standard errors. Differentially expressed genes were defined as those with a B statistic > zero.


I will try getting unnormalised data.

-Ateeq

ADD REPLY
1
Entering edit mode

Their method is just another way of writing, "we used limma/voom with quantile normalization." Of course what I wrote sounds less impressive than if one writes out the actual steps for using limma (log2 transformed data, empirical Bayes, etc.).

BTW, don't use B statistic >0, there's a reason that topTable gives adjusted p-values.

ADD REPLY
0
Entering edit mode

Thanks a Lot Devon, Its a great help.

One more hiccup, I am not able to locate the unnormalised data in TCGA. can you help me with this?

ADD REPLY
1
Entering edit mode

I don't normally need to deal with TCGA's datasets, I just have a passing familiarity with them. Hopefully others can help you there. Obviously if all you can get are RPKMs then that's what you're stuck with, but one would certainly hope that that's not the case.

ADD REPLY
0
Entering edit mode

Thanks a lot Devon, hopefully I will try to get the non normalised data ...

ADD REPLY
1
Entering edit mode

When you download RNASeqV2 data from TCGA, you should get a set of directories like RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3. In the Level_3 directory, you'll have six files for each sample. They are: junction_quantification.txt, rsem.genes.results, rsem.isoforms.results, rsem.genes.normalized_results, rsem.isoforms.normalized_results and bt.exon_quantification.txt files. The second kind (rsem.genes.results) are the non normalized files.

ADD REPLY

Login before adding your answer.

Traffic: 2644 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6