The traditional target detection or scene segmentation model can realize the extraction of video features, but the obtained features cannot restore the pixel information of the original video (if there is a deviation in understanding, please correct me). May I ask which articles have introduced a method for extracting video features and using these features for video reconstruction (that is, to achieve feature-to-pixel mapping). That is, the extracted video features can contain semantic information, or can they be used for video reconstruction?