Question: Can Limma package be used for RNA-Seq Differential gene expression analysis for Quantile normalized log transformed RPKM data
3
gravatar for David_emir
3.8 years ago by
David_emir310
India
David_emir310 wrote:

Hello All,

I have retrieved the data matrix from TCGA breast invasive carcinoma (BRCA) – expression data, The data is Level_3 Data (file names: *.rsem.genes.normalized_results) downloaded from TCGA DCC, log2(x+1) transformed, and processed data

I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.

Sample data is as follows:

<caption>Normalised data matrix for 113 Normal Vs 60 Breast cancer patients</caption>
Gene_ID
TCGA-A7-A0CE-11
TCGA-BH-A0AY-11
ARHGEF10L
9.998
9.768
HIF3A
6.1594
7.7241
RNF17
4.1813
0
RNF10
11.7961
11.7325

I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.

I'm using the following R-script:

library(limma)
raw.data <- read.delim("INPUT-QuantileNormalizedLog2Transformed-RPKM-Data.txt")
attach(raw.data)
names(raw.data)
d <- raw.data[, 2:5]
rownames(d) <- raw.data[, 1]
##To design matrix---
Group<-factor(pheno$Status,levels=levels(pheno$Status))
design<-model.matrix(~0+Group)  
###Assigning colnames###
colnames(design)<-c("Control", "Tumor")
###To assign the designed matrix in linear model using limma
fit <-lmFit(eset,design)
###Designing contrast matrix#
cont.wt<-makeContrasts("Tumor-Control",levels=design)
fit2 <-contrasts.fit(fit,cont.wt)
## eBayes step for calculating p-values, fold change etc###
fit3<-eBayes(fit2)

Example of my Pheno File:

Sample Status pheno
TCGA-A1-A0SN-01 Tumor 1
TCGA-A8-A08S-01 Tumor 1
TCGA-AC-A23C-01 Tumor 1
TCGA-AQ-A0Y5-01 Tumor 1
TCGA-AQ-A1H2-01 Tumor 1
TCGA-A7-A0CE-11 Control 0
TCGA-BH-A0AY-11 Control 0
TCGA-A7-A0CH-11 Control 0
TCGA-BH-A0BW-11 Control 0
TCGA-BH-A208-11 Control 0

 

   
> sessionInfo() 
R version 3.2.0 beta (2015-04-01 r68134) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final) 
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages: [1] parallel  stats     graphics  grDevices utils     datasets  methods[8] base     

other attached packages:[1] limma_3.24.3        affy_1.46.0         Biobase_2.28.0 [4] BiocGenerics_0.14.0

loaded via a namespace (and not attached):[1] zlibbioc_1.14.0       BiocInstaller_1.18.1  tools_3.2.0 [4] affyio_1.36.0   preprocessCore_1.30.0

I am getting the following output:

GENE_SYMBOL
coefficients
df.residual
sigma
stdev.unscaled
Amean
s2.post
t
df.total
p.value
lods
F
F.p.value
ARHGEF10L
-0.608719705
171
0.495810737
0.159737986
9.65395
0.244892
-7.70056
172.05824
1.02E-12
17.86426
59.29856
1.02E-12
HIF3A
-5.727480133
171 0.495810737 0.159737986 9.65395 0.244892
-7.70056
172.05824 1.02E-12 17.86426 59.29856 1.02E-12

 

My Concern/Question:

1. Can someone please let me know what I'm doing is correct procedure for finding out Differential gene expression analysis for two Groups (control Vs Normal in this case) or where i am going wrong here.

2. I am bit confused about the Out put here, i am failing to sort data. Based on what stats should i take cut off for genes which are differentially Expressed (please have a look at the output, how to sort the data?).

Thanks a Ton for your kind help

-Ateeq Khaliq

rna-seq limma tcga • 7.1k views
ADD COMMENTlink modified 22 months ago by biomiha10 • written 3.8 years ago by David_emir310

I am aware that this thread's been dormant for nearly 2 years but I've been dropped into something of a similar nature and found it interesting. In their user guide the authors of the limma package say: "In the limma approach to RNA-seq, read counts are converted to log2-counts-per-million (logCPM) and the mean-variance relationship is modelled either with precision weights or with an empirical Bayes prior trend. The precision weights approach is called “voom” and the prior trend approach is called “limma-trend”. " It would be my understanding that RSEM values could be used as well since they're based on log-normalised count values - or am I mistaken?

ADD REPLYlink written 22 months ago by biomiha10
1
gravatar for Devon Ryan
3.8 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

Doesn't TCGA give expected counts? That'd be preferable to RPKMs. Just preprocess them with voom().

ADD COMMENTlink written 3.8 years ago by Devon Ryan88k

Thanks Devon, can you please Elobrate more on voom(), i never used it, at what point should i use voom()? can preprocessed data be used for this? 

Edit: I forgot to mention, i have only this matrix file with me, can you please tell me where will i get the count file?

Is it not possible to conduct DGE with this type of Data?

sorry , i may sound stupid, but i am learning.Thanks for help ...

-Ateeq

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by David_emir310
2

voom() comes with limma and is intended to preprocess RNAseq data. Ideally, you would feed it unnormalized data and just have it do the quantile normalization. Don't use RPKMs as the input, estimated counts would be preferable (rsem produces them and since TCGA is using RSEM I assume they're available). For more on voom, see the limma vignette.

ADD REPLYlink written 3.8 years ago by Devon Ryan88k

Thanks Devon, 

I'm trying to understand the method of differential expression analysis described in : * * *Stem cell transcriptome profiling via massive-scale mRNA sequencing* *Nicole Cloonan et al* *NATURE METhODS | VOL.5 NO.7 | JULY 2008 | 613* http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1223.html

*Section: Statistical Analysis * To calculate differential expression of SQRL tag data we analyzed the normalized gene signals (tags per Refseq transcript, length- normalized) for each library using the Limma package in R. After Quantile normalization, we used Limma to fit a linear model to the log2-transformed data using an empirical Bayes method32 to moderate standard errors. Differentially expressed genes were defined as those with a B statistic > zero.

----

I will try getting unnormalised data...

-Ateeq

ADD REPLYlink written 3.8 years ago by David_emir310
1

Their method is just another way of writing, "we used limma/voom with quantile normalization." Of course what I wrote sounds less impressive than if one writes out the actual steps for using limma (log2 transformed data, empirical Bayes, etc.).

BTW, don't use B statistic >0, there's a reason that topTable gives adjusted p-values.

ADD REPLYlink written 3.8 years ago by Devon Ryan88k

Thanks a Lot Devon, Its a great help. 

One more hiccupp, i am not able to locate the unnormalised data in TCGA. can you help me with this ?

ADD REPLYlink written 3.8 years ago by David_emir310
1

I don't normally need to deal with TCGA's datasets, I just have a passing familiarity with them. Hopefully others can help you there. Obviously if all you can get are RPKMs then that's what you're stuck with, but one would certainly hope that that's not the case.

ADD REPLYlink written 3.8 years ago by Devon Ryan88k

Thanks a lot Devon, hopefully i will try to get the non normalised data ...

ADD REPLYlink written 3.8 years ago by David_emir310
1

When you download RNASeqV2 data from TCGA, you should get a set of directories like RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3. In the Level_3 directory, you'll have six files for each sample. They are: junction_quantification.txt, rsem.genes.results, rsem.isoforms.results, rsem.genes.normalized_results, rsem.isoforms.normalized_results and bt.exon_quantification.txt files. The second kind (rsem.genes.results) are the non normalized files.

ADD REPLYlink written 3.8 years ago by blakeoft10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1150 users visited in the last hour