Question: GO enrichment analysis using biomaRt
gravatar for preetomregon
2.8 years ago by
preetomregon0 wrote:

Hi, I am very new to the RNA Seq data analysis and I am trying to do the GO enrichment analysis using the biomaRt package of R. I want to merge the GO annotation to the DESeq2 data. I run the following script to obtain the GO terms


m <- useMart("plants_mart", host="")
m <- useMart("plants_mart", dataset="osativa_eg_gene", host="")
go <- getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id", "start_position", "end_position","go_id","name_1006"), mart=m)

write.table(go, "GOannotationsBiomart.txt", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")`

Now I want merge the above GO details with DESeq2 data. The final table of DESeq2 data are as follows

write.csv(data, file=paste0(outputPrefix,"_results_with_normalized_final.csv"))
write.table(read.csv(results_csv), gsub(".csv",".txt",results_csv))
a<-read.table(results_txt, head=TRUE)

Any Suggestion and alternative tutorial will be appreciated. Thanks

rna-seq biomart • 2.2k views
ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by preetomregon0
gravatar for Kevin Blighe
2.8 years ago by
Kevin Blighe70k
Republic of Ireland
Kevin Blighe70k wrote:

Just do this and you'll get what you [possibly] want:

data.frame(go, a[match(go$ensembl_gene_id, a$gene),])

Small example:

  ensembl_gene_id    name
1           BRCA1  Repair
2           BRCA1     Rep
3           BRCA1       R
4            TP53      CC
5             ATM Control
6             ATM   Cntrl

   gene pvalue
1 BRCA1  1e-04
2   ATM  1e-03
3  TP53  1e-02

data.frame(go, a[match(go$ensembl_gene_id, a$gene),])
    ensembl_gene_id    name  gene pvalue
1             BRCA1  Repair BRCA1  1e-04
1.1           BRCA1     Rep BRCA1  1e-04
1.2           BRCA1       R BRCA1  1e-04
3              TP53      CC  TP53  1e-02
2               ATM Control   ATM  1e-03
2.1             ATM   Cntrl   ATM  1e-03


ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Kevin Blighe70k

Thanks, Kevin for your kind help. It's exactly what I am looking for. But it hasn't worked in my case. I assumed that it's due to the data frame of my GO terms as I have multiple GO terms for a single gene and transcript.

Following is the output of my result.

data.frame(a, go[match(a$gene, go$ensembl_gene_id),])

ADD REPLYlink written 2.8 years ago by preetomregon0

Hello, can you paste an actual example of your data here? Once you paste it, highlight it and encapsulate it by using the 101 010 button. It will make it easier for me to test.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Kevin Blighe70k
   ensembl_gene_id ensembl_transcript_id      go_id                                name_1006
1     Os08g0254300       Os08t0254300-00 GO:0016021           integral component of membrane
2     Os08g0254300       Os08t0254300-00 GO:0016020                                 membrane
3     Os08g0254300       Os08t0254300-00 GO:0015031                        protein transport
5     Os04g0672100       Os04t0672100-01 GO:0005524                              ATP binding
6     Os04g0672100       Os04t0672100-01 GO:0016021           integral component of membrane
7     Os04g0672100       Os04t0672100-01 GO:0006468                  protein phosphorylation
8     Os04g0672100       Os04t0672100-01 GO:0004674 protein serine/threonine kinase activity
9     Os04g0672100       Os04t0672100-01 GO:0004672                  protein kinase activity
10    Os04g0672100       Os04t0672100-01 GO:0016020                                 membrane
11    Os04g0672100       Os04t0672100-01 GO:0016740                     transferase activity

Thank you for your reply. For example, I have 3 multiple go_ids for the Os08g0254300 gene which may be the problem. Is it possible to merge the multiple "go_id" and the "name_1006" for Os08g0254300 gene? For example

gene            log2FoldChange      ensembl_gene_id     go_id                               name_1006
Os08g0254300    3.641396            Os08g0254300        GO:0016021, GO:0016020, GO:0015031  integral component of membrane, membrane, protein transport

I am quite new in this field and I hope my question is relevant, to make you understand. Kindly suggest anything if I am wrong. Thanks

ADD REPLYlink written 2.8 years ago by preetomregon0

Okay, let me take a look later!

ADD REPLYlink written 2.8 years ago by Kevin Blighe70k

Sorry, my time was limited right now. To help you a bit, this code will at leasat collapse the GO IDs to have just a single record per gene:

uniqueGenes <- as.character(unique(df$ensembl_gene_id)) <- data.frame(row.names=uniqueGenes)

for (i in 1:length(uniqueGenes))
{ <- rbind(, t(data.frame(paste(df[which(df$ensembl_gene_id %in% uniqueGenes[i]), "go_id"], collapse=","))))
  rownames([i] <- uniqueGenes[i]

Os08g0254300                                             GO:0016021,GO:0016020,GO:0015031
Os04g0672100 GO:0005524,GO:0016021,GO:0006468,GO:0004674,GO:0004672,GO:0016020,GO:0016740

Note that the gene names are stored as rownames here. This should now make it easier to merge with your other data-frame of statistical values and fold-changes.

ADD REPLYlink written 2.8 years ago by Kevin Blighe70k

Thank you again for your time. I tried several ways to merge the two data frames of small sizes but did not get through. Following are the my examples-

> deq
          gene FoldChange
1 OS08G0460000   5.028469
2 OS02G0770800   3.641396

Os02g0770800 GO:0005829,GO:0020037,GO:0030151,GO:0042128,GO:0006809,GO:0043546,GO:0009703,GO:0050464,GO:0046872,GO:0055114,GO:0016491
Os08g0460000                                             GO:0048046,GO:0030145,GO:0005618,GO:0045735,GO:0046872,GO:0005576,GO:0031012

I followed several examples to merge the data frames but unable to merge in my case. for example

cbind(, deq[,"gene"][match(rownames(deq), rownames(])
merge(cbind(, row=row.names(, deq, by.x="row", by.y="gene", all.x=T)
deq = cbind("id"=rownames(deq),deq) = cbind("id"=rownames(,

Some outputs are with warnings but error free and the resulting data frame not as per my requirement. The cbind(deq, code work great only if, the no. of rows are same and also if the position of the similar rownames are same in the data frames. Any suggestion will be appreciated. Thanks

ADD REPLYlink written 2.8 years ago by preetomregon0

I see. In this case, you also have an issue with upper and lower case characters. For example: OS08G0460000 versus Os08g0460000

There is a function toupper, though:

matchIndices <- match(toupper(deq$gene), toupper(rownames(
data.frame(deq[matchIndices,], rownames(,

  gene           FoldChange V1
2 OS02G0770800   3.641396     Os02g0770800     GO:0005829,GO:0020037,GO:0030151,GO:00...
1 OS08G0460000   5.028469     Os08g0460000     GO:0048046,GO:0030145,GO:0005618,GO:00...

I'm aware that these merge functions can be difficult. There are also a few different ways of doing it, include merge, match, and which. I just happen to be familiar with match, but I know others who use merge. I would just encourage you to study well how they operate (because they are each different) and to also double-check the output. Years ago a a junior I frequently got caught out by mis-using these commands. One requires many 'checks' and 'balances', like the US government.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Kevin Blighe70k

Thank you for the code and suggestions

ADD REPLYlink written 2.8 years ago by preetomregon0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2213 users visited in the last hour