I've performed differential expression analyses. I set my p as 0.05, and used a Bonferroni correction across my multiple analyses so my p is now 0.0125. My question is, should I use this p value of 0.125 for selecting my GO terms? Or is a p value of 0.05 sufficient for selecting my enriched GO terms?

The real thing you should worry about in terms of multiple testing for GO analysis is accounting for the fact that you are testing many GO terms. A typical GO analyisis involves testing thousands of gene categories, and each one carries a chance of making a type I error. This makes the cumulative chance of making a type I error very high in a GO analysis. If the categoires were independent, then with 1000 categories, you'd expect 50 categories to come up with an alpha of 0.05 even if there were no real enrichment. However, GO categories are not independent tests as the gene sets overlap (and in some cases are complete subsets of one another), and thus its not clear what the best way to correct for multiple testing is. For this reason, most GO testing frameworks report nominal p-values, rather than adjusted ones.

I tend to use BH correction on my GO testing, however I recognise that this is conservative.

The effect of this is so much bigger than the fact that you've done 4 different analyses that I probably won't worry about the difference between 0.05 and 0.0125, when the real question is between 0.05 and 5*10^-6.

That said, yes, in theory you should correct for multiple testing between different analyses (although I'm not sure many people do).

I was wondering about the exact same issue. One should definitely correct for multiple testing with GO enrichment analysis, because making one test for each GO category adds up to many tests. But, as you poined out, GO categories are not independent. They are usually related by a tree-like structure with ancestor and offspring terms.
Theoretically, one option could be to count only children terms and use this as the number of tests for the multiple testing correction (regardless of the method one decides to use). Nevertheless, I am not aware whether anyone has done this.
To clarify: when I run GO enrichment usually I extract all GO terms from my annotation and then I add all the "ancestor" terms (partent and all the parents of the parents), which are the broader categories, and test all GOs. One could correct by using only the number of original GOs as number of tests, since all the additional "ancestor" terms must be related to the original chid terms.

Its not quite true that no one has produced a GO enrichment tool with a principled approach to multiple testing. The GOStats package can use permulation testing to get a corrected p-value accounting for the non-independence. But it doesn't account for things like gene-length bias, which I like to corret for in my GO analyses.

You might also find this page interesting: How GO::TermFinder calculates P-values, although I think some of the formulas may not have survived an update at some point. The GO Term Finder tool at Princeton is more or less the exact same tool as at SGD.

Great point about the "Go categories are not independent tests"... I had never considered that.

I was wondering about the exact same issue. One should definitely correct for multiple testing with GO enrichment analysis, because making one test for each GO category adds up to many tests. But, as you poined out, GO categories are not independent. They are usually related by a tree-like structure with ancestor and offspring terms. Theoretically, one option could be to count only children terms and use this as the number of tests for the multiple testing correction (regardless of the method one decides to use). Nevertheless, I am not aware whether anyone has done this. To clarify: when I run GO enrichment usually I extract all GO terms from my annotation and then I add all the "ancestor" terms (partent and all the parents of the parents), which are the broader categories, and test all GOs. One could correct by using only the number of original GOs as number of tests, since all the additional "ancestor" terms must be related to the original chid terms.

Its not quite true that no one has produced a GO enrichment tool with a principled approach to multiple testing. The GOStats package can use permulation testing to get a corrected p-value accounting for the non-independence. But it doesn't account for things like gene-length bias, which I like to corret for in my GO analyses.