Can Limma package be used for RNA-Seq Differential gene expression analysis for Quantile normalized log transformed RPKM data
1
3
Entering edit mode
6.6 years ago
David_emir ▴ 390

Hello All,

I have retrieved the data matrix from TCGA breast invasive carcinoma (BRCA) – expression data, The data is Level_3 Data (file names: *.rsem.genes.normalized_results) downloaded from TCGA DCC, log2(x+1) transformed, and processed data

I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.

Sample data is as follows:

<caption>Normalised data matrix for 113 Normal Vs 60 Breast cancer patients</caption>
 Gene_ID
 TCGA-A7-A0CE-11
 TCGA-BH-A0AY-11
 ARHGEF10L
 9.998
 9.768
 HIF3A
 6.1594
 7.7241
 RNF17
 4.1813
0
 RNF10
 11.7961
 11.7325

I have quantile normalized log2-transformed RPKM data from TCGA and I wanted to find DE genes based on B statistic and log2foldchange.

I'm using the following R-script:

library(limma)
attach(raw.data)
names(raw.data)
d <- raw.data[, 2:5]
rownames(d) <- raw.data[, 1]
##To design matrix---
Group<-factor(pheno$Status,levels=levels(pheno$Status))
design<-model.matrix(~0+Group)
###Assigning colnames###
colnames(design)<-c("Control", "Tumor")
###To assign the designed matrix in linear model using limma
fit <-lmFit(eset,design)
###Designing contrast matrix#
cont.wt<-makeContrasts("Tumor-Control",levels=design)
fit2 <-contrasts.fit(fit,cont.wt)
## eBayes step for calculating p-values, fold change etc###
fit3<-eBayes(fit2)

Example of my Pheno File:

 Sample Status pheno TCGA-A1-A0SN-01 Tumor 1 TCGA-A8-A08S-01 Tumor 1 TCGA-AC-A23C-01 Tumor 1 TCGA-AQ-A0Y5-01 Tumor 1 TCGA-AQ-A1H2-01 Tumor 1 TCGA-A7-A0CE-11 Control 0 TCGA-BH-A0AY-11 Control 0 TCGA-A7-A0CH-11 Control 0 TCGA-BH-A0BW-11 Control 0 TCGA-BH-A208-11 Control 0
> sessionInfo()
R version 3.2.0 beta (2015-04-01 r68134) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: CentOS release 6.6 (Final)
locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel  stats     graphics  grDevices utils     datasets  methods[8] base

other attached packages:[1] limma_3.24.3        affy_1.46.0         Biobase_2.28.0 [4] BiocGenerics_0.14.0

loaded via a namespace (and not attached):[1] zlibbioc_1.14.0       BiocInstaller_1.18.1  tools_3.2.0 [4] affyio_1.36.0   preprocessCore_1.30.0

I am getting the following output:

 GENE_SYMBOL
 coefficients
 df.residual
 sigma
 stdev.unscaled
 Amean
 s2.post
 t
 df.total
 p.value
lods
 F
 F.p.value
 ARHGEF10L
 -0.60872
171
 0.495811
 0.159738
 9.65395
 0.244892
 -7.70056
 172.058
 1.02e-12
 17.8643
 59.2986
 1.02e-12
 HIF3A
 -5.72748
171 0.495810737 0.159737986 9.65395 0.244892
 -7.70056
172.05824 1.02E-12 17.86426 59.29856 1.02E-12

My Concern/Question:

1. Can someone please let me know what I'm doing is correct procedure for finding out Differential gene expression analysis for two Groups (control Vs Normal in this case) or where i am going wrong here.

2. I am bit confused about the Out put here, i am failing to sort data. Based on what stats should i take cut off for genes which are differentially Expressed (please have a look at the output, how to sort the data?).

Thanks a Ton for your kind help

-Ateeq Khaliq

RNA-Seq limma TCGA • 9.6k views
0
Entering edit mode

I am aware that this thread's been dormant for nearly 2 years but I've been dropped into something of a similar nature and found it interesting. In their user guide the authors of the limma package say: "In the limma approach to RNA-seq, read counts are converted to log2-counts-per-million (logCPM) and the mean-variance relationship is modelled either with precision weights or with an empirical Bayes prior trend. The precision weights approach is called “voom” and the prior trend approach is called “limma-trend”. " It would be my understanding that RSEM values could be used as well since they're based on log-normalised count values - or am I mistaken?

1
Entering edit mode
6.6 years ago

Doesn't TCGA give expected counts? That'd be preferable to RPKMs. Just preprocess them with voom().

0
Entering edit mode

Thanks Devon, can you please Elobrate more on voom(), i never used it, at what point should i use voom()? can preprocessed data be used for this?

Edit: I forgot to mention, i have only this matrix file with me, can you please tell me where will i get the count file?

Is it not possible to conduct DGE with this type of Data?

sorry , i may sound stupid, but i am learning.Thanks for help ...

-Ateeq

2
Entering edit mode

voom() comes with limma and is intended to preprocess RNAseq data. Ideally, you would feed it unnormalized data and just have it do the quantile normalization. Don't use RPKMs as the input, estimated counts would be preferable (rsem produces them and since TCGA is using RSEM I assume they're available). For more on voom, see the limma vignette.

0
Entering edit mode

Thanks Devon,

I'm trying to understand the method of differential expression analysis described in : * * *Stem cell transcriptome profiling via massive-scale mRNA sequencing* *Nicole Cloonan et al* *NATURE METhODS | VOL.5 NO.7 | JULY 2008 | 613* http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1223.html

*Section: Statistical Analysis * To calculate differential expression of SQRL tag data we analyzed the normalized gene signals (tags per Refseq transcript, length- normalized) for each library using the Limma package in R. After Quantile normalization, we used Limma to fit a linear model to the log2-transformed data using an empirical Bayes method32 to moderate standard errors. Differentially expressed genes were defined as those with a B statistic > zero.

----

I will try getting unnormalised data...

-Ateeq

1
Entering edit mode

Their method is just another way of writing, "we used limma/voom with quantile normalization." Of course what I wrote sounds less impressive than if one writes out the actual steps for using limma (log2 transformed data, empirical Bayes, etc.).

BTW, don't use B statistic >0, there's a reason that topTable gives adjusted p-values.

0
Entering edit mode

Thanks a Lot Devon, Its a great help.

One more hiccupp, i am not able to locate the unnormalised data in TCGA. can you help me with this ?

1
Entering edit mode

I don't normally need to deal with TCGA's datasets, I just have a passing familiarity with them. Hopefully others can help you there. Obviously if all you can get are RPKMs then that's what you're stuck with, but one would certainly hope that that's not the case.

0
Entering edit mode

Thanks a lot Devon, hopefully i will try to get the non normalised data ...

1
Entering edit mode

When you download RNASeqV2 data from TCGA, you should get a set of directories like RNASeqV2/UNC__IlluminaHiSeq_RNASeqV2/Level_3. In the Level_3 directory, you'll have six files for each sample. They are: junction_quantification.txt, rsem.genes.results, rsem.isoforms.results, rsem.genes.normalized_results, rsem.isoforms.normalized_results and bt.exon_quantification.txt files. The second kind (rsem.genes.results) are the non normalized files.