I have started with the very basics of reinforcement learning just recently and in particular with Q-learning. I encountered a problem in the following very simple scenario:
Consider a one-dimensional chain of 50 sites with periodic boundary conditions. On this chain two distinct final states T1 (reward +100, when the agents gets there) and T2 (reward +50, when the agents gets there) are given. Once the agent hits a terminal state the scenario is finished. Moving on any non-terminal state yields a reward of -1 (move penalty). The agent can either move left or right to update it's state.
I have convinced myself that the agent's optimal strategy would be to reach the terminal state T1 independent of the random initial state, unless the initial state already is the state T2 which finishes the scenario instantly. Otherwise the agent should of course avoid reaching the final state T2, i.e. he may have to take a detour to maximize it's reward and reach T1.
I tried to solve this very simple reinforcement learning problem using the most basic Q-learning approach (no epsilon-greedy strategy, learning rate = 0.5, discount = 0.99, 20000 epochs, just a simple Q-table together with the update rule). For a single terminal state this works perfectly fine and the agent always finds the shortest path. There is a well defined decision boundary where the uni-directional moving of the agent changes from "always move left" to "always move right". However for two final states the agent seems not to find the optimal strategy. Instead it always decides for the terminal state that is closest to the initial state.
Of course, probably my results is due to an incorrect implementation, but I really can not find any mistake so far. I have not seen any literature on Q-learning with two terminal states. Does anybody have deeper insight, if there is a reason why Q-learning with multiple terminal states might not be possible? Can it be related to Bellmann's equation?
Thank you very much!