So, basically, you are attempting something similar to this: Article Transforming remote sensing images to textual descriptions
???
All the criteria for evaluation are are heavily dependent on the 'fitness for use' for some sort of specific situation or scenario. Especially if the sentence collection is meant to stand alone with or without the image present along with the sentences. For either case, though, the sentence collection should be bracketed by a an intro and outro of the scene's hierarchy and contents, not just a list of relationships between the individual designated objects.
( there is a wealth of literature about this around interpreting spatial information for vision impaired disabilities, for giving walking directions, etc.