First of all, remember that O(N log N) performance is noticeable only when N is large. If you are looking at a Verilog implementation, where will data be stored? Will you assume that it will be in an external memory and implement the core of the merge sort algorithm as a Verilog module? Think about this aspect. Earlier, people have proposed systolic arrays for sorting.
Thanks C.P. Ravikumar now I know how this architecture is called. Today I managed to implement merge sort with sistolic arrays with sorting time equal to the length of array positive edges of clock signal. On the attached picture sorting 32 elements array in 31 steps
It is good to know this. Please also look for odd-even transposition sort. n processors can be used to sort n elements in n-1 steps. The processors are quite simple, mostly comparators and some registers.