Whenever I have a sequential decision problem — a problem where previous choices influence future choices — and when some parts of the problem are unknown (reward function, transition function between states), I have a reinforcement learning problem.

If all my samples are independent (not sequential) I don't have a reinforcement learning problem.

The agent observes the state s, takes action a, gets reward r and the state is transitioned to s'.

However, let's suppose that s' isn't necessarily the state where the next action takes place.

  • Can the agent still learn how to maximize the discounted reward? Or, in other words, is this allowed in reinforcement learning?

Let's suppose I have an agent and two (finite) queues. Each queue can offload data to the other queue, in order to avoid being saturated. The agent, whenever a new data packet arrives to one queue, can decide to offload it to the other queue, or keep it for himself. So the transition from s to s', given action a, is deterministic. However, in the next step, the environment observation s may not be s'.

This is due to the fact that, before the next packets arrives, one or both queues might have processed some data.

My concern is that the agent cannot learn to predict what the next state will actually be. For sure, given state s and action a, the agent can easily predict what s' will be, but at the same time s' is not the "actual" next observed state.

Similar questions and discussions