Yes, it is possible to detect visual objects and attempt to describe video using metadata. However, that may not always be enough and allow for a thorough and accurate description of a video file. Additional information extraction towards understanding high-level meanings in visual data by possibly translating computable low-level multimedia features (like colour histogram, shape, texture etc.) into high-level semantic concepts which humans can relate to; will be quite useful.