Question

GO enrichment analysis using biomaRt

0

Entering edit mode

7.2 years ago

preetomregon • 0

Hi, I am very new to the RNA Seq data analysis and I am trying to do the GO enrichment analysis using the biomaRt package of R. I want to merge the GO annotation to the DESeq2 data. I run the following script to obtain the GO terms

library(DESeq2)
library(biomaRt)


listMarts(host="plants.ensembl.org")
m <- useMart("plants_mart", host="plants.ensembl.org")
m <- useMart("plants_mart", dataset="osativa_eg_gene", host="plants.ensembl.org")
go <- getBM(attributes=c("ensembl_gene_id","ensembl_transcript_id", "start_position", "end_position","go_id","name_1006"), mart=m)
go[1:10,]

https://ibb.co/gCSAGT

write.table(go, "GOannotationsBiomart.txt", quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")`

Now I want merge the above GO details with DESeq2 data. The final table of DESeq2 data are as follows

write.csv(data, file=paste0(outputPrefix,"_results_with_normalized_final.csv"))
results_csv<-"O_sativa_DESeq2_results_with_normalized_final.csv"
write.table(read.csv(results_csv), gsub(".csv",".txt",results_csv))
results_txt<-"O_sativa_DESeq2_results_with_normalized_final.txt"
a<-read.table(results_txt, head=TRUE)

https://ibb.co/bySPAo

Any Suggestion and alternative tutorial will be appreciated. Thanks

RNA-Seq biomaRT • 4.6k views

ADD COMMENT • link 7.2 years ago by preetomregon • 0

score 1 · Answer 1 · 2018-05-23

1

Entering edit mode

7.2 years ago

Kevin Blighe 89k

Just do this and you'll get what you [possibly] want:

data.frame(go, a[match(go$ensembl_gene_id, a$gene),])

Small example:

go
  ensembl_gene_id    name
1           BRCA1  Repair
2           BRCA1     Rep
3           BRCA1       R
4            TP53      CC
5             ATM Control
6             ATM   Cntrl


a
   gene pvalue
1 BRCA1  1e-04
2   ATM  1e-03
3  TP53  1e-02


data.frame(go, a[match(go$ensembl_gene_id, a$gene),])
    ensembl_gene_id    name  gene pvalue
1             BRCA1  Repair BRCA1  1e-04
1.1           BRCA1     Rep BRCA1  1e-04
1.2           BRCA1       R BRCA1  1e-04
3              TP53      CC  TP53  1e-02
2               ATM Control   ATM  1e-03
2.1             ATM   Cntrl   ATM  1e-03

Kevin

ADD COMMENT • link 7.2 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks, Kevin for your kind help. It's exactly what I am looking for. But it hasn't worked in my case. I assumed that it's due to the data frame of my GO terms as I have multiple GO terms for a single gene and transcript.

https://ibb.co/ggbCWT

Following is the output of my result.

data.frame(a, go[match(a$gene, go$ensembl_gene_id),])

https://ibb.co/eNV6rT

ADD REPLY • link 7.2 years ago by preetomregon • 0

0

Entering edit mode

Hello, can you paste an actual example of your data here? Once you paste it, highlight it and encapsulate it by using the 101 010 button. It will make it easier for me to test.

ADD REPLY • link 7.2 years ago by Kevin Blighe 89k

0

Entering edit mode

   ensembl_gene_id ensembl_transcript_id      go_id                                name_1006
1     Os08g0254300       Os08t0254300-00 GO:0016021           integral component of membrane
2     Os08g0254300       Os08t0254300-00 GO:0016020                                 membrane
3     Os08g0254300       Os08t0254300-00 GO:0015031                        protein transport
5     Os04g0672100       Os04t0672100-01 GO:0005524                              ATP binding
6     Os04g0672100       Os04t0672100-01 GO:0016021           integral component of membrane
7     Os04g0672100       Os04t0672100-01 GO:0006468                  protein phosphorylation
8     Os04g0672100       Os04t0672100-01 GO:0004674 protein serine/threonine kinase activity
9     Os04g0672100       Os04t0672100-01 GO:0004672                  protein kinase activity
10    Os04g0672100       Os04t0672100-01 GO:0016020                                 membrane
11    Os04g0672100       Os04t0672100-01 GO:0016740                     transferase activity

Thank you for your reply. For example, I have 3 multiple go_ids for the Os08g0254300 gene which may be the problem. Is it possible to merge the multiple "go_id" and the "name_1006" for Os08g0254300 gene? For example

gene            log2FoldChange      ensembl_gene_id     go_id                               name_1006
Os08g0254300    3.641396            Os08g0254300        GO:0016021, GO:0016020, GO:0015031  integral component of membrane, membrane, protein transport

I am quite new in this field and I hope my question is relevant, to make you understand. Kindly suggest anything if I am wrong. Thanks

ADD REPLY • link 7.2 years ago by preetomregon • 0

0

Entering edit mode

Okay, let me take a look later!

ADD REPLY • link 7.2 years ago by Kevin Blighe 89k

1

Entering edit mode

Sorry, my time was limited right now. To help you a bit, this code will at leasat collapse the GO IDs to have just a single record per gene:

uniqueGenes <- as.character(unique(df$ensembl_gene_id))

df.new <- data.frame(row.names=uniqueGenes)

for (i in 1:length(uniqueGenes))
{
  df.new <- rbind(df.new, t(data.frame(paste(df[which(df$ensembl_gene_id %in% uniqueGenes[i]), "go_id"], collapse=","))))
  rownames(df.new)[i] <- uniqueGenes[i]
}

df.new

                                                                                       V1
Os08g0254300                                             GO:0016021,GO:0016020,GO:0015031
Os04g0672100 GO:0005524,GO:0016021,GO:0006468,GO:0004674,GO:0004672,GO:0016020,GO:0016740

Note that the gene names are stored as rownames here. This should now make it easier to merge with your other data-frame of statistical values and fold-changes.

ADD REPLY • link 7.2 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you again for your time. I tried several ways to merge the two data frames of small sizes but did not get through. Following are the my examples-

> deq
          gene FoldChange
1 OS08G0460000   5.028469
2 OS02G0770800   3.641396

> df.new
                                                                                                                                   V1
Os02g0770800 GO:0005829,GO:0020037,GO:0030151,GO:0042128,GO:0006809,GO:0043546,GO:0009703,GO:0050464,GO:0046872,GO:0055114,GO:0016491
Os08g0460000                                             GO:0048046,GO:0030145,GO:0005618,GO:0045735,GO:0046872,GO:0005576,GO:0031012

I followed several examples to merge the data frames but unable to merge in my case. for example

cbind(df.new, deq[,"gene"][match(rownames(deq), rownames(df.new))])
merge(cbind(df.new, row=row.names(df.new)), deq, by.x="row", by.y="gene", all.x=T)
deq = cbind("id"=rownames(deq),deq)
df.new = cbind("id"=rownames(df.new),df.new)
merge(deq,df.new,by="id",all=T)

Some outputs are with warnings but error free and the resulting data frame not as per my requirement. The cbind(deq,df.new) code work great only if, the no. of rows are same and also if the position of the similar rownames are same in the data frames. Any suggestion will be appreciated. Thanks

ADD REPLY • link 7.2 years ago by preetomregon • 0

1

Entering edit mode

I see. In this case, you also have an issue with upper and lower case characters. For example: OS08G0460000 versus Os08g0460000

There is a function toupper, though:

matchIndices <- match(toupper(deq$gene), toupper(rownames(df.new)))
data.frame(deq[matchIndices,], rownames(df.new), df.new)

  gene           FoldChange   rownames.df.new. V1
2 OS02G0770800   3.641396     Os02g0770800     GO:0005829,GO:0020037,GO:0030151,GO:00...
1 OS08G0460000   5.028469     Os08g0460000     GO:0048046,GO:0030145,GO:0005618,GO:00...

I'm aware that these merge functions can be difficult. There are also a few different ways of doing it, include merge, match, and which. I just happen to be familiar with match, but I know others who use merge. I would just encourage you to study well how they operate (because they are each different) and to also double-check the output. Years ago a a junior I frequently got caught out by mis-using these commands. One requires many 'checks' and 'balances', like the US government.