like many others newbies in Bioinformatics, I have found a plethora of methods to perform the same task. What is even more problematic for a non-experienced biologist in bioinformatic is to assume the differences in the final result between them. My supervisor ask me about, how we can be sure about the result if it change based on the software used?? and I can not find a convince answer...
I am here to ask about something like a "standard" pipeline to be used with non model organism in terms of GO and KEGG enrichment analysis. I have found like 2 different streams: people who use B2G and trust in the black box and people who try to go a little (or a lot) further and go to R bioconductor packages which are more prone to a deeper configuration. The B2G path is obviously "easy to standardize" but I can not pay a license so I want to go (ha? no choice) trough the Bioconductor path. To obtain GOs data from my annotation I use the UniProt tool to map accession codes with the rest of the information contained in their databases, so I can obtain GO and KEGG; is that a good way to do it??
For example, in terms of GO analysis I found big differences just caused by one single parameter when I compare B2G with topGO and I can not found if B2G did something or not. Maybe you can help me to clarify my thoughts. In topGO there are a parameter called nodesize, which for me looks important because in a de novo assembly not all GOs has to be well represented. For example, sometimes I found some GO terms associated just to 1 or 2 genes, that means if I have these genes in my set of DGE genes in a experimental group I will detect as "very" extreme over-represented GO category when in it just mean 1 over-represented gene. B2G, if I am correct, will show these categories without consider to filter out these result, but if I fix this parameter in topGO as nodesize=10, just GOs associated with at least 10 "genes" will be considered. Is that how it works??
It is common in this particular field (functional enrichment) to find methods whose outputs differ in a very important way?? (well I think its obviously "yes" when the methods differ at their bases but I need a confirmation from experts).
And what about the GO level?? Few tools make possible to consider this parameter in their analysis, it is a controversial parameter?? or it depends on the method??
Thank you for your att.