Does Q-learning fail at environments with multiple terminal states?

07 February 2022 3 7K Report

I have started with the very basics of reinforcement learning just recently and in particular with Q-learning. I encountered a problem in the following very simple scenario:

Consider a one-dimensional chain of 50 sites with periodic boundary conditions. On this chain two distinct final states T1 (reward +100, when the agents gets there) and T2 (reward +50, when the agents gets there) are given. Once the agent hits a terminal state the scenario is finished. Moving on any non-terminal state yields a reward of -1 (move penalty). The agent can either move left or right to update it's state.

I have convinced myself that the agent's optimal strategy would be to reach the terminal state T1 independent of the random initial state, unless the initial state already is the state T2 which finishes the scenario instantly. Otherwise the agent should of course avoid reaching the final state T2, i.e. he may have to take a detour to maximize it's reward and reach T1.

I tried to solve this very simple reinforcement learning problem using the most basic Q-learning approach (no epsilon-greedy strategy, learning rate = 0.5, discount = 0.99, 20000 epochs, just a simple Q-table together with the update rule). For a single terminal state this works perfectly fine and the agent always finds the shortest path. There is a well defined decision boundary where the uni-directional moving of the agent changes from "always move left" to "always move right". However for two final states the agent seems not to find the optimal strategy. Instead it always decides for the terminal state that is closest to the initial state.

Of course, probably my results is due to an incorrect implementation, but I really can not find any mistake so far. I have not seen any literature on Q-learning with two terminal states. Does anybody have deeper insight, if there is a reason why Q-learning with multiple terminal states might not be possible? Can it be related to Bellmann's equation?

Thank you very much!

Xiaoyi Hu

In my opinion, the move penalty is so negative that your agent doesn't want to get to a farther destination. Why not try to reduce the move penalty? You may try to reduce the move penalty to -0.5 and check whether it will work or not.

Philip Rübeling

Many thanks for your answer, I tried this but still I have the same issue. Xiaoyi Hu

How about reducing the move penalty to -0.01? If it still can't work well, you may try to print the value function V(s) table of all states and check whether there is a bug in your code.

Can black holes be simulated with nonlinear optics?

Feedback defines the constitution of an organism?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Need help with my research project on open source SIEM and machine learning?

Swimming/space travel depends on the proprioceptive muscle spindles?

What are the limitations and challenges of using machine learning for predicting concrete compressive strength in practical applications?

Some new emerging problems on application of RL for scheduling in IoT networks?

How to Compress Information Neurally?