I have a list of about 80 HGNC symbols.
I want to find their descriptions and function(s) using biomart
So - first approach is to find everything using biomart. Here I provide an example for three genes.
gL <- c("LTC4S", "ALOX5", "NAT2")
library(biomaRt)
mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
results <- getBM(attributes = c("go_id", "hgnc_symbol"), filters = "hgnc_symbol", values = gL, mart = mart)
For 3 HGNCs we have 46 GO ids... With 80 HGNCs - well you can imagine - over 1500 GO categories.
However I'm only interested in the leaf terms, i.e. the most specific terms, for each branch of the GO with which the gene is annotated.
Is there any easy way to get this using biomaRt or other R tools like GO.db?
Is there any way to get the most "interesting" terms - I guess using semantic similarity and information content, something like that - although of course interesting is going to be rather subjective. It must be a very common task though - given a list of genes, annotate them with "their function", in a digestible way
What you find the "most interesting" is not necessarily what someone else finds the "most interesting". GO terms are designed to give you the function in a digestible way, it just turns out that the most effective way to do this is to assign a number of GO terms to something.
I agree, interesting is subjective, and I'm certainly don't dispute that GO/using a DAG is an effective way to provide annotation. But looking at >1500 terms is not really a feasible option in this case. I will keep looking to see if there are ways to whittle it down to most informative terms e.g. using information content. Thanks!