Galton (1868) famously described great variation between individuals regarding the vividness with which retrieved visual memories of scenes are typically re-visualized, ranging between "as vivid as perceptual experiencing" and no discernable imagery at all. Seashore's early research revealed similar individual variation regarding vividness of recalled (audiated) imagery for musical sounds, and moreover concluded that the characteristic level of vividness for an individual does not appear to be enhanceable by short-term or prolonged ear-training regimes. Vividness implies both fidelity or completeness of detailed feature-representation and intensity or loudness of recalled images. While I know from my own experimentation with various intensive ear-training tactics that the timbral richness and clarity of temporal onset of retrieved imagery can be much increased thereby, my findings bear out Seashore's regarding loudness - adult individuals have a permanently fixed personal ceiling level that is not affected by the content of the imagery.
The brain's encoding of imagery loudness seems to be poorly understood and not extensively researched. Neural firing-rate in primary auditory cortex (A1) has been shown to co-vary with loudness of sound-stimulus, and Penfield et al's famed experiments with electrical stimulation of exposed A1 surfaces in awake surgery-patients evoked auditory imagery at life-like levels. A1, however, does not appear to store encoded representations of loudness, and I would presume higher, cognitive levels to be accountable for that. However, there would also be factors concerning modulation of cortical activation/arousal-level by midbrain nuclei projections to consider - and I guess the question of what neural substrates actually determine imagery-loudness boils down to identifying which of those modulatory projections cause the most impact on experienced loudness when artificially stimulated or suppressed. Would anyone reading be able to supply me with any researched information on that, please?