If we train a robot to walk on a line, we can consider the distance to the line as negative reward. Therefore, as much as the robot walks exactly on the line, he would achieve maximum reward.
The problem with such reward definition in practice is that, in training process, as all the initial Q-values for any action is initialized to zero, almost any taken action would leads to a negative value for Q-value.
In other words, in each state, the robot will try a new action(exploration) instead of taking experienced action (exploitation).
for example consider 4 possible actions for each state:
before training, all Q values are initialized to zero. Therefore in state 1:
Q(1)=0, Q(2)=0, Q(3)=0, Q(4)=0
So, it takes an action say action 3 which leads to a negative reward:
Updated Q-values
Q(1)=0, Q(2)=0, Q(3)= -1.05, Q(4)=0
Now again when agent faces this state, he would take action with maximum reward which is the not experienced actions: 1,2, or 4.
Therefore, the agent always is in exploration mode. considering milliions of states and many actions per each state, makes the training hard to converge.
Would you please share your idea about solving such problem?
(one idea could be that we initialize the Q-values to a large negative value so that any taken action would improve the Q-value. But again it makes the algorithm just do the exploitation).