Question: Biomart R package - help writing biomart results into existing file.
3
gravatar for jsgorams
5.0 years ago by
jsgorams30
United States
jsgorams30 wrote:

Just starting to learn R so bear with me. We've conducted an RNA-seq experiment and are working on getting gene ontology information on the sequenced genes. I've gotten the biomart R package installed and have been able to successfully request GO terms given a list of ensembl gene IDs. The issue is that I'd like the results to be placed in a file listing each gene ID in a column and then the GO terms in rows corresponding to each gene they describe. For example:

Gene ID:                     GO term:

AGAP000002           Proteolysis, oxidation-reduction, protein-binding, etc

AGAP006543           chitin biosynthesis, intracellular, etc

When I just export the biomart results, it lists each GO number separately and doesn't show which gene they correspond to. Is there an easy way to get this exported correctly? If I need to include anymore information to assist you in answering, just let me know. Thanks. 

rna-seq next-gen R • 2.7k views
ADD COMMENTlink modified 5.0 years ago by David Fredman990 • written 5.0 years ago by jsgorams30
5
gravatar for David Fredman
5.0 years ago by
David Fredman990
University of Bergen, Norway
David Fredman990 wrote:

Here's a solution using the Bioconductor GO.db package, and Hadley Wickham's dplyr package

1. get GO ids for your genes from Ensmart (from my understanding, this you have already done). This query gets the complete list of genes for a given species (zebrafish in this example) - you may want to filter it.


library(biomaRt)
ensmart = useMart('ensembl', 'drerio_gene_ensembl')
dat = getBM(attributes=c('go_id', 'ensembl_gene_id'), mart = ensmart)

2. Get corresponding GO names for each GO term and concatenate them by gene

library(GO.db)
library(dplyr)

report <- filter(dat, !go_id == "") %>% # remove empty go_ids
  mutate(name = Term(go_id)) %>%        # get corresponding GO term names from GO.db by go_id
  group_by(ensembl_gene_id) %>%         # sort by gene id
  summarize(go_descriptions=do.call(paste, c(as.list(name), sep=","))) #concatenate GO names to one field

write.table(report,file="geneId_goDescription.txt")

 

Alternatively, if you are using a mart that holds the go names in the database as well as the accessions, eg the plant mart

ens_plant_mart = useMart('ENSEMBL_MART_PLANT', 'osativa_eg_gene')
attrs = c("ensembl_gene_id", "go_accession", "go_name_1006", "go_linkage_type")
results = getBM(attributes = attrs,  mart = ens_plant_mart)

... then you don't need the GO.db package, and can modify the second part (aggregation) as follows (this also concatenates a second field, the accession number)

report = filter(results, !go_accession == "") %>%
         group_by(ensembl_gene_id) %>%
         summarize(go_accessions = do.call(paste, c(as.list(go_accession), sep=",")),
                   go_names = do.call(paste, c(as.list(go_name_1006), sep=",")))

ADD COMMENTlink modified 5.0 years ago • written 5.0 years ago by David Fredman990

I think this is on the right track but I can't quite get it to work. I used the biomaRt package to retrieve the GO terms:

results <- getBM(attributes = c("ensembl_gene_id", "go_accession", "go_name_1006", "go_linkage_type"), filters="ensembl_gene_id", values=geneID[1], mart=mart)

This worked fine and it resulted in this format:

head(results)
  ensembl_gene_id go_accession                     go_name_1006 go_linkage_type
1      AGAP004677   GO:0016491          oxidoreductase activity                
2      AGAP004677   GO:0044281 small molecule metabolic process                
3      AGAP004677   GO:0009058             biosynthetic process                
4      AGAP004677   GO:0003674               molecular_function                
5      AGAP004677   GO:0055114      oxidation-reduction process             IEA
6      AGAP004677   GO:0003824               catalytic activity             IEA

Since there are multiple GO terms for each gene ID, I'd like to group gene IDs together and so there's only one row for each gene ID with the corresponding GO terms place in a single cell. So the above result would then be:

ensembl_gene_id           go_accession                         go_name_1006
AGAP004677  GO:0016491, GO:0044281, GO:0009058, GO:0003674, GO:0055114, GO:0003824 oxidoreductase activity, small molecule metabolic process, biosynthetic process, molecular_function, oxidation-reduction process, catalytic activity

How should I go about doing this?  

ADD REPLYlink written 5.0 years ago by jsgorams30

If you are using a mart that holds the go names in the db as well as the accessions, eg the plant mart, and additionally want to output the accession numbers concatenated (not included in the original question) you can skip using the GO.db package, and modify the second part (aggregation) as follows
 

report = filter(results, !go_accession == "") %>%
         group_by(ensembl_gene_id) %>%
         summarize(go_accessions = do.call(paste, c(as.list(go_accession), sep=",")),
                   go_names = do.call(paste, c(as.list(go_name_1006), sep=",")))

Note that dplyr outputs results using a modified data.frame object that hides variables not fitting on the screen (your long concatenated strings). If you export it to a textfile, and it'll all be there. If you want to view all of it, you can cast it to a regular data.frame object by adding an additional "%.%  as.data.frame()" at the end of the pipe.

I'll modify my answer above to include this solution.

ADD REPLYlink modified 5.0 years ago • written 5.0 years ago by David Fredman990
2
gravatar for lkmklsmn
5.0 years ago by
lkmklsmn890
United States
lkmklsmn890 wrote:

You should tell biomaRt to also return the geneIDs

e.g. getBM(attributes=c('GOterm','geneID'),filters='geneID',value=values,mart=ensembl)

ADD COMMENTlink written 5.0 years ago by lkmklsmn890

Yeah I did that as well. So I have the results, I just need to get them in the right layout for output. Do you have any input?

ADD REPLYlink written 5.0 years ago by jsgorams30

Ok. Now I understand. I think you should use the split() function in R.

In your example you would do:  

asplit<-aplit(1:nrow(results),results[,1])

This will give you a list of indices for each gene. Now you can iterate over this list to combine all entries for the same gene.

goAccession<-lapply(asplit,function(x) results[x,2])

goName<-lapply(asplit,function(x) results[x,3])

I havent run the code, but I think you could almost copy and paste this and it should be close to what you want (at least from my understanding).

Hope this helps

 

ADD REPLYlink written 5.0 years ago by lkmklsmn890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1406 users visited in the last hour