As I've already studied, Q-learning doesn't have to know anything about transition probability! Thus, how an agent can determines its new state after choosing an action without knowing anything about the probability of transition to other states?
Given the current state, s, the agent does not have to guess its new state s'. It receives the new state s' from the environment. The thing that is missing is that you don't know what is the probability of moving from s to s' (given a particular action of course) even if you know that this transition s->s' is what has happened. Q-learning can learn without estimating transition probabilities. An alternative, model-based approach would be to learn those transition probabilities first, and then to solve the MDP (at that point you won't need Q-learning but you could still use it).
you are in a state, you choose an action and you must have access to an "environment" which tells you in which state you are now and what your reward is
this "environment" may be a real system (duly equipped with sensors so as to determine the new state etc) or a computer simulation of a real system
.
and you do not need to know or evaluate the transition probabilities when using Q-learning
(only the environment know them, it does not tell, but Q learning allows you not to care !)
I became convinced that transition probabilities is not needed as an input of Q-learning algorithm. But, I think reward function is needed as an input of Q-learning. Isn't it? If not, how is the immediate reward computed in Q-learning ?
the reward (and the new state) is given to you by the "environment" ...
you are not doing q-learning "in abstracto", you are trying to optimize your actions relative to a given "system" (or "environment")
.
more concretely, say you want to solve a labyrinth problem ("gridworld") : of course you do not know the map of the labyrinth !
you are in a position, you have 4 actions (North, East, South, West) ; you choose an action, say North ; the system tells you your new state (depending of the transition probabilities, you might end up south of your current position ... commands might be noisy !) and gives you your reward, -1 if you are still in the labyrinth, +1000 if you have reached the way out
from that, you can increment your Q table and learn how to reach the way out as fast as possible
.
to sum up, there is an "environment" (a real system or a computer simulation) which implements the transition probabilities and rewards and you play with this environment so as to maximize your long term reward : when it is your move, you choose a state and an action, when it is the environment's move, it tells you the next state and your reward (and at this point you increment your Q table and go on for the next move)
In reinforcement learning the lack of a model is solved by making observations from the environment.
When executing the Q-learning algorithm in a computer, you usually have a simulated environment that replaces the transition function, or even you can call to a transition function to obtain the state s' reached from the agent after executing action a from state s
In mobile robotics, when working with a real robot, the reached state s' is obtained by making observations from its sensors, previously mapped into states. No transition function is needed, the agent must learn from its environment; this is the key of reinforcement learning.
As to the Reward Function you do need to map (s,a,s') into rewards, since he task to learn by the agent is mostly defined by the obtained rewards. However, if you deal with many states, the definition of every Reward R(s,a,s') could be nonsensical.
When working with real robots, even with very few states, I find very useful to map the observations of the sensors directly into rewards with simple if else structures. For example, if the mobile robot is learning a wandering task, bumper collisions will get negative rewards, wheel encoders above a certain threshold without colliding will get the highest positive reward, etc. This is a simpler way of defining the Reward function with few lines of code and closer to the definition of the task to learn by the agent.