Hello everyone,

While reimplementing a video summarization model, I noticed something unexpected: my reproduced results give higher F1 scores than the baseline reported in the original paper. I did not intentionally make architectural changes, only fixed some minor bugs (e.g., data handling).

My questions are:

  • Is it common for reimplementations to outperform the reported baseline due to bug fixes, evaluation inconsistencies, or skipped videos during testing?
  • Could evaluation protocols (e.g., averaging vs. max F1 across videos) also explain such differences?
  • In general, how should one interpret these improvements — as a genuine enhancement or as an artifact of different evaluation setups?

Any insights from those who have reimplemented models in video summarization (or related areas) would be really helpful.

Thank you!

More Sarah Helal's questions See All
Similar questions and discussions