When the reward function is defined as R(a,s), the value function is defined as max_{a in A} (rho(a,b) + ...), where A is the set of actions, b is the current belief state and rho(a,b) is the expected reward w.r.t. b. But what does the definition of the value function look like when the reward function looks like R(s,a,s') , where a is performed in state s and the agent ends up in state s'?

More Gavin Rens's questions See All
Similar questions and discussions