Why might reimplementation results in higher F1 scores than the reported baseline?

More Sarah Helal's questions See All

Absorption coefficient of methane?

Hello, Can anyone provide me with the absorption coefficient of methane gas at 7.7 um? Any reference?

06 August 2024 980 5 View

How are Large Models Exploring and Outputting Knowledge Understanding in Specific Content Areas, and What Does Academic Research Say About It?

Hello everyone！ I am currently exploring the performance of large models in understanding knowledge in specific domains, and attempting to construct a knowledge framework similar to what...

05 August 2024 5,729 2 View

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity?

Regarding a model for simulating battery charge and discharge, what do you consider to be high fidelity? What is the acceptable percentage of error (regardless of the metric)? Could you suggest...

03 August 2024 5,358 0 View

How do i get an account to upload my published papers?

need to open an account to upload my published papers

01 August 2024 9,255 1 View

What is the problem with these tissue culture plants?

All plants are green but some of these plants becomes yellow. I did not found any reason. Please help me to find out the real problem.

01 August 2024 589 4 View

How to correctly use the UTE and ZTE pulse sequences in Bruker's ParaVision software?

I am using a Bruker 600M solid-state NMR spectrometer with a Micro 2.5 microimaging system. The test sample is a tube of 1M LiCl aqueous solution, and the nucleus detected is 1H. I am trying to...

01 August 2024 9,227 1 View

Is artifacts in XPS possible to build high deviation in binding energy larger than 5 eV??

Hello. Thanks for your consideration to see my question. Recently, I conducted XPS anaylsis of g-CN that is prepared from thermal polycondensation of DCDA, so-called conventional bulk-g-CN,...

30 July 2024 9,824 2 View

Tips for teaching Orton Gillingham+ to students with autism?

I am beginning to use OG+ in my district and noticed during training there are a lot of materials and routines, which may be overwhelming for my students, especially those on the autism spectrum....

30 July 2024 5,695 0 View

Which statistical test should we use?

N=6 Comparing pre and post test likert scale responses. Participants are mix of practicing & preservice teachers.

30 July 2024 7,233 4 View

How to build my own lab made four point probe set up?

Hello, I'm trying to measure the conductivity of semiconductor films but since I don't have a commercial four point probe set up I would like to build one on my own in my lab. I have generators,...

30 July 2024 906 2 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Some new emerging problems on application of RL for scheduling in IoT networks?

I have seen plenty of existing works on applied Reinforcement Learning (RL) policies for optimized scheduling in IoT networks including Q-learning, DQNs, and the newer ones including PPO for...

01 August 2024 8,754 2 View

Hi there, someone has the SeinFit software for windows because I cannot download it?

DOS version.

29 July 2024 6,064 1 View

There is possible way to calculate experimental DOS from the UPS spectrum of the material. How to calculate the DOS from UPS?

Here, I have attached the UPS graph. I'm trying to calculate the DOS/DOVS from the UPS.

29 July 2024 4,971 1 View

What are needed modules for an IoT waterlevel monitoring system?

I want to know the modules needed for an IoT project for water level monitoring

27 July 2024 1,502 3 View

Can we eliminate the stress singularity at the tip of the crack by manipulating the elastic constants?

The aim of the research here is to prevent the propagation of the crack in the fabricated elastic medium with useful applications.

25 July 2024 9,976 3 View

How do we pick data for determination of Validation Acceptance Criteria?

Hello, colleagues! There is commenting open for new upcoming edition of USP 1033. Validation target acceptance criteria is now different from what it used to be and it doesn't include Cpm....

23 July 2024 7,292 3 View

Can water conserved in micro watershed is used only for life saving irrigation but not for post monsoon cropping?

21 July 2024 4,797 0 View

Can harvested rainwater be stored & used for year & why rainwater harvesting is most effective means to conserve water n rain-fed agriculture areas?

Can harvested rainwater be stored and used for year and why rainwater harvesting is the most effective means to conserve water n rain-fed agriculture areas?

19 July 2024 6,612 3 View

Who wants opportunities for scientific cooperation?

Dear Colleagues, I hope this message finds you well. My name is Noor Al-Huda K. Hussein,and I am a researcher specializing in deep learning applications in genetic data analysis. I am currently...

18 July 2024 5,562 0 View

Ahmed Ali

It's actually a common, if poorly discussed, phenomenon in research: a careful reimplementation of a model often outperforms the original paper's results. This usually isn't due to a secret breakthrough, but to the quiet, unglamorous work of fixing hidden bugs and standardizing evaluation.

The reasons are almost always practical. Small oversights in the original code, like misaligned video frames, inconsistent data preprocessing, or even a handful of missing samples, can subtly hamper performance. When these are corrected in a reimplementation, the model finally performs as it was always meant to, making it seem "better."

However the biggest factor is the evaluation methodology itself. Seemingly minor choices, like how to average F1 scores across videos or which random seed to use, can swing results by several percentage points. A higher score might not mean a smarter model, but a more precise and consistent measurement.

So, an improved score should be seen less as a new claim and more as a sign of the field's reproducibility challenges. The most valuable contribution in such cases is transparency: meticulously documenting what was changed, clarifying the evaluation protocol, and sharing the code. This allows the community to understand whether the gain is a genuine step forward or simply the result of a more rigorous and careful setup.

Khadersab Adamsab

Reimplementation may yield higher F1 scores than the reported baseline due to differences in preprocessing steps, hyperparameter tuning, updated libraries, or computational environments. Improved optimization methods and hardware can also enhance training stability and performance. Additionally, subtle variations in random dataset splits may lead to better generalization, making the reimplemented model outperform the original baseline.