Firstly, we need to simulate in some way how the model will perform when dealing with unseen cases (since typically the model will be learned from, or calibrated using, some historical software projects). The normal approach is a cross-validation procedure where the data are split into training and validation cases (ie software projects) and this is repeated many times randomly or with some stratification procedure.
Secondly, we need some statistic to describe the performance of the model on each validation set (and then summarise it in some way - ideally with a measure of centre and of spread). In the past researchers used something like MMRE but there is an extensive literature on why this is a biased statistic and will lead you to systematically prefer models that underestimate. I suggest you use something like Standardised Accuracy as this will give you the ratio of improvement of your model over guessing. This is described in a 2012 paper by myself and Stephen MacDonell: