It's very common to use multiple loss. People usually multiply each single loss with a trade-off factor, and take summation of them. Just like the example below (A loss for generator in WGAN-GP).
g_loss = -diff + lambda1 * grandient_panelty + lambda2 * mse_loss g_loss.backward()
So the problem arouse: how do I appraise the effect of each loss, so as to tunning the trade-off factor lambda ? In the above-mentioned WGAN-GP case, the last mse_loss is a custom loss I add in to the total loss. So how should I adjust the factor lambda2 to ensure that the mse_loss would take effect but won't be excessively dominant.
Of course, a hyper-parameter tunning may solve this problem, but I'm searching for a more elegant solution—I want to appraise the effect of loss directly and quantitatively, and set the factor according to the appraisal.
At the first glance of this problem, I used to intuitively think like this:
Ok, I would plot the varying curve of each loss. After comparing the magnitude of each loss, I would assign a larger factor λ to the smaller one to promote it.
But deep thinking, I found it's wrong, makes not sense. Because it is the grad of the loss that really matters, and a simple calculus knowledge told me that, the value of function f(x) has no relevance to its derivative df(x) . Therefore, a loss with bigger magnitude dose not promise a bigger grad back-propagated to the network, and dose not promise a larger effect apparently.
I couldn't make it out, and come to ask you is there any good choice to appraise the loss effect directly and quantitatively? Do I have to print the grad of each loss and analyse them?