With a significant difference in recognition rate the sum rule becomes a weighted sum rule.
So the normal sum rule may be like: SumConf = (Conf1 + Conf2) /2
Then a weighted sum rule can be like: WeightedSumConf = (Weight1 * Conf1 + Weight2 * Conf2), with Weight1 + Weight2 = 1.
You should adjust the weights such that you get optimal performance; optimizing to the best performance is a way to obtain that. Alternatively, you can relate the weights to the performance of the individual modalities.
Have you normalized both of the scores? The weights should not be randomly adjusted. Instead, you should run experiment to estimate the appropriate weights.