I would agree with Joachim's answer (i.e. show that your algorithm better reflects some independent, possibly subjective, measure of similarity than the alternatives), but add some sort of benchmark for computing the measure in sets of N items. Path similarity metrics can be complex, and it's no good having a very intuitive metric if we can't actually deploy it in real-world data.
One other option that I've seen, but can't quite flesh out with the details in your question, is to show that your algorithm drives better performance in some specific application than the alternatives - that your metric makes it easier to solve some other problem.... Not a terribly specific suggestion, I know, but worth mentioning?
Usually a similaritry metric is used to do something else, like clustering, classification, ranking, etc. Now you have to find a nice problem involving e.g. clustering. You choose an accepted clustering method and an appropriate evaluation method for the cluster quality. Then you repeat the experiment several times, keeping everything constant, changing only the similarity measure between the items. If the new measure gives the best results, it was the most appropriate measure for the given data set and the given task. Ideally you should repeat the experiment for several tasks and several data sets.
For many tasks there exist common benchmark data sets. It would be best to use such a data set. E.g. for word similarity you should include the TOEFL data (http://aclweb.org/aclwiki/index.php?title=TOEFL_Synonym_Questions_%28State_of_the_art%29).
We have done some small experiments following this pattern to compare different similarity measures: