I am currently looking at the evolution of the genome content of a particular species. I have >15 newly sequenced and annotated species, as well as a species tree and clusters of orthologs/paralogs ("orthogroups") across all species. Let's also say that I have, for each node in the species tree, a list of orthogroups that are lost/gained at this point in the tree. I have also already drawn nice annotated trees with charts at the nodes, giving me some overview of how much happened where. So far, so good.
What I am interested in is whether this can help me find out what functions are gained/lost, or whether pathways are introduced/disabled at each node in the tree, based on the appearance or disappearance of orthogroups at specific nodes. Note that the annotations of these organisms are not in public databases but I have inferred functional annotations (GO, KEGG, Reactome, ...) for them based on InterPro hits and reference species transfers. I also have mapped these functional terms to each orthogroup by simply taking the union of respective terms annotated in each orthogroup's member genes.
My idea was to take the lists of orthogroups that are lost/gained for each node, and use them as lists of 'interesting items' in functional gene set enrichment tools like topGO, with the full set of orthogroups and their terms as the background. For topGO, I have done this, but I am not so sure that this is the right approach to go with and I can trust the results... especially given the fact that I'm looking at many species here.
Also, can anyone recommend other functional enrichment tools, in particular for metabolic pathways, which are:
- well documented,
- not restricted to human/mouse/... but able to work with custom data sets, and
- do not require expression data (which I don't have -- all I have is lists of 'interesting' groups as they are lost or gained).
Looking at the usual suspects like GAGE, clusterProfiler etc. -- these tools look like they are are meant to drive differential expression analysis in the usual popular organisms, but if you need more generic functionality, or don't have the right kind of input, the documentation is very sparse.
I am absolutely not afraid of writing code (as I'm mainly a developer) or munging the data in any form, I'm just quite new at data analysis ;)