Question

Different results with edgeR when adding gene symbols

0

Entering edit mode

7.0 years ago

urmi208 ▴ 30

Hello,

I am trying to get the gene names along with the Ensembl gene ids in the final edgeR output. The problem is: I am seeing different results when I add the gene symbols and when I don't. Here is example code without gene symbols

library(edgeR)
exp2 <-read.table('exp2',header=TRUE)
y <- readDGE(exp2,labels=exp2$sample,group=exp2$treatment)
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
bcv=0.4
et_D0_D1 <- exactTest(y, pair=c("D0","D1"),dispersion=bcv^2)
write.table(topTags(et_D0_D1,n=Inf),file="test1.txt",col.names=TRUE,quote=FALSE)

Result is: ENSGACG00000003380 1.00049178363978 4.49563701685595 0.234971858102264 1

Here is the one with gene symbols added using biomart:

library(edgeR)
library(biomaRt)
exp2 <-read.table('exp2',header=TRUE)
y <- readDGE(exp2,labels=exp2$sample,group=exp2$treatment)
geneid <-rownames(y)
ga82 <-  useEnsembl(biomart="ensembl",dataset="gaculeatus_gene_ensembl",version=82)
genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),filters='ensembl_gene_id',values=geneid,mart=ga82)
y$genes <- genes
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
bcv=0.4
et_D0_D1 <- exactTest(y, pair=c("D0","D1"),dispersion=bcv^2)
write.table(topTags(et_D0_D1,n=Inf),file="test2.txt",col.names=TRUE,quote=FALSE)

Here the result is: ENSGACG00000003380 NA -0.135739707808875 2.95905945166089 0.874200189025877 1

SessionInfo() output is given below:

R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] biomaRt_2.24.1 edgeR_3.10.4   limma_3.24.15

loaded via a namespace (and not attached):
 [1] IRanges_2.2.9        DBI_0.3.1            parallel_3.2.1
 [4] RCurl_1.95-4.7       Biobase_2.28.0       AnnotationDbi_1.30.1
 [7] RSQLite_1.0.0        S4Vectors_0.6.6      BiocGenerics_0.14.0
[10] GenomeInfoDb_1.4.3   stats4_3.2.1         bitops_1.0-6
[13] XML_3.98-1.3

I am bit puzzled. Why would addition of gene names in y$genes above change the results. Any help you could provide will be very helpful. Many thanks in advance.

RNA-Seq edgeR biomart • 4.0k views

ADD COMMENT • link updated 7.0 years ago by John Ma ▴ 310 • written 7.0 years ago by urmi208 ▴ 30

0

Entering edit mode

how many rows do you have before and after adding the gene names?

ADD REPLY • link 7.0 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thanks for your reply. I have 22460 rows both before and after adding the gene names

ADD REPLY • link 7.0 years ago by urmi208 ▴ 30

0

Entering edit mode

please also post top 5-10 results for before and after cases.

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

0

Entering edit mode

Before adding gene names:

ensembl_gene_id logFC logCPM PValue FDR
ENSGACG00000007157 3.24915717386252 1.07342942182991 0.000618728695167078 1
ENSGACG00000013912 2.3468671394398 0.373720501233804 0.0113007333301586 1
ENSGACG00000001321 2.21861945153189 0.972005925991169 0.0131377136373496 1
ENSGACG00000020320 -2.01587102395319 2.1704178289833 0.0207851586047353 1

After adding genenames:

ensembl_gene_id external_gene_name logFC logCPM PValue FDR
ENSGACG00000007401 si:dkey-34d22.1 3.24915717386252 1.07342942182991 0.000618728695167078 1
ENSGACG00000014287  2.3468671394398 0.373720501233804 0.0113007333301586 1
ENSGACG00000001351 ttf2 2.21861945153189 0.972005925991169 0.0131377136373496 1
ENSGACG00000020359 pafah1b2 -2.01587102395319 2.1704178289833 0.0207851586047353 1

ADD REPLY • link 7.0 years ago by urmi208 ▴ 30

1

Entering edit mode

Your gene-annotation is somehow not correct. See that the numbers (FC, Pvalue etc) are same (before and after), but the ensembl_gene_id is different!

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

0

Entering edit mode

Aha! There is merging issue. Any ideas how can I fix this?

ADD REPLY • link 7.0 years ago by urmi208 ▴ 30

0

Entering edit mode

?merge()

PS: This just to keep away the min-char bot

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

1

Entering edit mode

Thanks Santosh. Very helpful

ADD REPLY • link 7.0 years ago by urmi208 ▴ 30

0

Entering edit mode

Please close this thread by accepting @MMa answer (green check mark to the left) if it solved your problem. Else, you can also write your own answer and accept it.

ADD REPLY • link 7.0 years ago by Santosh Anand 5.7k

score 3 · Accepted Answer · 2017-05-09

3

Entering edit mode

7.0 years ago

John Ma ▴ 310

Remember, the output of getBM is not always in the order of values, thus the error.

What I'd do is:

geneid <-rownames(y)
ga82 <-  useEnsembl(biomart="ensembl",dataset="gaculeatus_gene_ensembl",version=82)
genes <- getBM(attributes=c('ensembl_gene_id','external_gene_name'),filters='ensembl_gene_id',values=geneid,mart=ga82)
rownames (genes) <- genes$ensembl_gene_id
y$genes <- genes [rownames(y), ]