This is an interesting topic in deep learning, with the typical question: ‘Now I’ve created (generative adversarial networks) or modified (super resolution or de-noising) an image how close is it to the true image or gold standard?’ There are many methods for evaluating images: Structural Similarity Index (SSIM), Peak Signal to Noise Ratio (PSNR), Mean Squared Error (MSE), Visual Information Fidelity (VIF), Universal Quality Index (UQI), Perceptual Image Patch Similarity (LPIPS) and Frechet Inception distance (FID). Though these metrics could not be counted upon solely as reference for human judgement of image quality (Armanious et al. 2019).
Armanious, et al. (2019). MedGAN: Medical Image Translation using GANs. Downloaded on 2019061x from: https://arxiv.org/abs/1806.06397
Heusel, et al. (2017). GANs trained by a two time-scale update rule converge to a nash equilibrium. arXiv preprint arXiv:1706.08500 .
Zhang, et al. (2018). The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. https://arxiv.org/abs/1801.03924v2
Two interesting sources discussing evaluating 3D images:
Taha & Hanbury (2015). Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4533825/
Murphy (2011). Development and Evaluation of 2D and 3D Image Quality Metrics.