Hello everyone,
I am trying to develop an RL agent for a simple double integrator system. Unfortunately, my agent couldn't find the maximum reward. The attached figure is the average episode reward vs. episode number. I was wondering, what is the nature of this error and oscillator behavior of reward collection?