I'm working on a wide range of problems, ranging from combinatorial to continuous optimization. Including multi-objective and multi-modal optimization. I'm also applying the heuristics to solve benchmark functions (such as BBOB and LSGO) as well as real-world applications such as molecular docking. Thus, I was hoping for a general and broad discussion with different opinions from different domains (instead of posting one question for each specific problem).
your enumeration of problems seems to suggest that you are mostly working on optimization. There are several possibilities for assessing the quality of an optimization algorithm. Among them are:
1. The time until the algorithm stops
2. The time until the algorithm finds a global optimum
3. The time until the algorithm finds a feasible point whose objective function value is within x% of the global optimum
4. The time until the algorithm finds the first feasible point
5.-8. The number of function evaluations of the objective/constraint functions until ... (see above for time)
9. The objective function value / constraint violation after N function evaluations
10. The objective function value / constraint violation after time T
11. The maximal problem dimension for which the algorithm produced a result
12. The order p such that the solution time depends O(n^p) on the dimension n
13. The percentage of problems for which the algorithm failed (i.e. did not find a feasible point, crashed, did not stop, etc.)
14. The percentage of problems for which the algorithm got stuck far away from the global minimum
15. The percentage of problems for which the algorithm was the optimal solver
16. The percentage of problems for which the algorithm was within x% of the optimal solver (in time, objective function value, constraint violation, etc.)
In comparison tests you can determine the "optimal solver", i.e., the best solution among all algorithms you have tested, and you can compare your results against this optimal solver (you should, however, make sure that at least one of the state-of-the-art solvers (CMA-ES, Baron, etc.) is among the solvers you compare. For that you might want to look at the comparison tests by Nick Sahinidis and Nikolaus Hansen.
From all these differents ways to assess an algorithm, you can probably understand my question about the application. Depending on your application one or the other of the above measures is appropriate. If you are developing an algorithm for very expensive objective functions (i.e. huge simulations, measurements, etc.) you will, e.g., be comparing the number of function evaluations rather than solution time, because in the real application the time for the function evaluation will eventually dominate the overall solution time, completely in contrast to the situation when your function evaluation is rather cheap.
In real-time applications, it might be interesting to measure how long an algorithm takes to find the first feasible point, or how good the point is after 10ms, because in the real application the algorithm will always be asked for its result after 10ms.
If the algorithm is used as the heuristic starting phase of a complete (i.e. deterministic) global optimization solver (i.e. branch&bound-like) it will be interesting in what percentage of problems it actually finds the global optimum.
There are several more aspects that need to be considered for a good comparison test. Taking a close look at already existing comparison tests might provide additional information (http://coco.gforge.inria.fr/doku.php?id=bbob-2010, https://plus.google.com/photos/101835671426479336232/albums?banner=pwa, http://archimedes.cheme.cmu.edu/?q=dfocomp, etc.)
1. Measure performance. You must decide what performance means for you (convergence, CPU time, robustness, etc.) and measure it.
2. Comparing performance of different algorithms. As we are dealing with heuristic and -as it derives from your question- stochastic algorithms, it is required to repeat the experiments a sufficient amount of times and make some statistical test to compare the results of the methods.
This topic has to do with the Design of Experiments (http://en.wikipedia.org/wiki/Design_of_experiments) area.
I think that these readings can be helpful to clarify these points:
Cohen, P. R. (1995). Empirical methods for artificial intelligence (Vol. 139). Cambridge: MIT press.
Bartz-Beielstein, Thomas (2006) Experimental Research in Evolutionary Computation: The New Experimentalism. Springer (http://link.springer.com/book/10.1007%2F3-540-32027-X)
Garcia, S., & Herrera, F. (2008). An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons. Journal of Machine Learning Research, 9, 2677–2694. (http://www.jmlr.org/papers/v9/garcia08a.html)