Assume a memory access bound workload such as graph analytics, machine learning, monte-carlo simulations etc. Assume a high-end single chip GPU of the current generation (2020).
Since it is not clear what the task is or how the code looks like (or I may not have understood the question that well), one approach that may work is the following: If you have an explicit thread for synchronization, then the time taken for synchronizing should be the total computation time minus the time taken by the fastest GPU to perform your computation.
A more low-level approach would be to let each GPU report the time taken to complete its task and then subtract the time taken by the fastest GPU (i.e. minimum time) from the total time. The difference would represent the synchronization time.