Given they use the same enrichment function (e.g. the hypergeometric distribution), in general, the differences in the output depend on the "universe" they use. The best way is to use as universe (the background) the GO annotation of the genome under analysis. Even when this is the same in two different tools, there might be differences among two tools that derive for instance from how the non-classified genes are considered and how they implement the enrichment calculation of a hierarchical annotation as GO is. Clearly, the background GO category distribution in the genome is very important because it is this that tells you if the distribution of functions you are observing is likely to be the result of a random sampling process over the background, or not. Wrong universe is likely to output wrong enrichments.
In some case, I found that the background used was for another organism or it was the entire database with multiple organisms inside; I strongly advise not to consider tools where this is done, unless you don't have the full genome of the organism you are studying. Even in that case, I would carefully choose the closest genome available to be used as the background. In principle one might also derive a "evolutionary robust" universe from a set of genomes able to give you confidence intervals of your enrichment p-values, but today is probably easier to get the genome of your beast instead.
Take Home Message: If you want to do comparisons, use the same tool!