I'm training an agent to accomplish a reaching task. The agent controls a multi-joint robotic arm and has to reach for a target. So far, I've had some success with vanilla policy gradient but, to my surprise, I can't get it to work with actor-critic.

I'm wondering about the ways I can find out what makes it fail. I've tried various reward functions, but none was robust enough. Hence, I thought about monitoring more the agent. I'd like to know what values you think might give some insight ?

I've thought of:

  • Critic/Value loss
  • Actor/Policy mean loss and variance

Thanks !

More Mehdi Mounsif's questions See All
Similar questions and discussions