It strongly depends on the task performed by the neural networks (classification, clustering or regression). For a detailed list of parameters and their calculation formulas I suggest you to read:
The problem especially appears when using two NN models and you get different tendency in the performance results. For example, you trained two models and you get the following evaluation metrics:
Model 1: Model 2:
Training Loss: 0.4195 : 0.3354
Training Acc: 0.8325 : 0.8583
Training Recall: 0.7266 : 0.7791
Validation Loss: 0.4331 : 0.3791
Validation Acc: 0.8483 : 0.8493
Validation Recall: 0.7400 : 0.7773
Test Acc: 0.7864 : 0.8254
Test Recall: 0.7657 : 0.7602
The problem there is that although the second model gives a better loss value, the first model gives a better result on the test set. In this case you can assume that the second model provides better results based on the Loss error
Briefly, to compare the performance (e.g. accuracy) of two neural networks, this should directly relate to the difference in their architecture (that is, with the assumption that a regularization technique like early stopping is used to stop training both networks on the same dataset). So, when comparing two neural networks applied to a particular task, you may need to first consider how similar they are and how they differ in terms of model components.
Suppose you identify the model components that make each neural network unique, an ablation experiment can help you learn which components actually contribute to the performance. This can also help to know redundant model components (or model parameters) or other hyper-parameters to be tuned to increase performance. But of course, a deep understanding of the task or experience can help with hyper-parameter tuning.