What is the best Reward function in Reinforcement Learning?

Hamid Radmard Rahmani @Hamid_Radmard_Rahmani2

06 June 2018 6 859 Report

If we train a robot to walk on a line, we can consider the distance to the line as negative reward. Therefore, as much as the robot walks exactly on the line, he would achieve maximum reward.

The problem with such reward definition in practice is that, in training process, as all the initial Q-values for any action is initialized to zero, almost any taken action would leads to a negative value for Q-value.

In other words, in each state, the robot will try a new action(exploration) instead of taking experienced action (exploitation).

for example consider 4 possible actions for each state:

before training, all Q values are initialized to zero. Therefore in state 1:

Q(1)=0, Q(2)=0, Q(3)=0, Q(4)=0

So, it takes an action say action 3 which leads to a negative reward:

Updated Q-values

Q(1)=0, Q(2)=0, Q(3)= -1.05, Q(4)=0

Now again when agent faces this state, he would take action with maximum reward which is the not experienced actions: 1,2, or 4.

Therefore, the agent always is in exploration mode. considering milliions of states and many actions per each state, makes the training hard to converge.

Would you please share your idea about solving such problem?

(one idea could be that we initialize the Q-values to a large negative value so that any taken action would improve the Q-value. But again it makes the algorithm just do the exploitation).

Mohamed AbdElAziz Khamis Popular answer

https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0

https://stats.stackexchange.com/questions/189067/how-to-make-a-reward-function-in-reinforcement-learning

https://stats.stackexchange.com/questions/189067/how-to-make-a-reward-function-in-reinforcement-learning/190378

https://hal.archives-ouvertes.fr/hal-00331752v2/document

https://www.quora.com/How-does-one-learn-a-reward-function-in-Reinforcement-Learning-RL

https://deepblue.lib.umich.edu/bitstream/handle/2027.42/136931/guoxiao_1.pdf?sequence=1&isAllowed=y

Zanne Z. Zixtnine

The poor binary system, too many numbers there should only be one Choice 1 success asset then the Experience of choice is another 1 success asset for engaging.

A right choice out come in experience acquires another 1 success asset and Graduate to the next level

A 0 is a failure asset and a do over in outcome of wrong choice do over until 1 success asset is acquired resulting in learning the right choice behaviour and naturally progressing to the next level with increased co efficient dense learning with each new 1 or 0 asset acquired, by learning from mistakes. This creates Certainty See error correct and attain the degree resulting in a more Co Efficient density in Accurate Predictable future outcomes based on past experience. I will be adding more to my page on this very subject. Stay tuned I am certain this will provide you more consistent results. I look forward to observing your progression and learning. Namaste Sanne Z. Sixt9

Hamid Radmard Rahmani

Thanks Sanne but I can not get your point. It is about definition of a reward function which is not necessarily 0 or 1. So we are dealing with partial success.

If you have some complementary description, please share with me.

But thank you for your answer.

I am looking to have other experts opinion in this regard also.

Zanne Z. Zixtnine

The words and many numbers are confusing the binary brain, think of C3P0, he with his 6 million languages in his brain main frame is always confused and it is always R2 D2 with his binary whistles and images of data he records that inform the Brain of experienced reality.

The Brain Computer only images data and feelings of right and wrong, good bad, success failure and relates and connects those data assets with the Idea and concept of a 1 choice decided and 1 engagement creating an inflationary Acquisition of 1 Up from Right Choice Success or a 0 down deflationary failure identify error Hierarchy to self learn and self correct do over until a 1 is acquired graduate to the next level.

Do you have children? This is similar. They also do not speak or think in words at first, only the images and feelings around them to be recorded by body and sent to brain for processing and presenting Consciousness with a choice and then their body records for brain to process the result of choice.

The Hierarchy of brain computing processing must be to identify error the reward is the experience and the acquisition of the 1 is a successful experience and fulfills the inflationary acquisition of more 11/11. If it does not have a hierarchy for error identification than it will be a failure asset 0 do over and never learn.

This is the natural learning process of all living things that are just binary body brain computers. The reward is the Choice itself, right or wrong, that is why you give it a reward asset for choosing, even if it fail, it will try again until it learn the right action for more 111111111111111111111111111111111111111111111111.

Mohamed AbdElAziz Khamis

https://medium.com/@BonsaiAI/deep-reinforcement-learning-models-tips-tricks-for-writing-reward-functions-a84fe525e8e0

https://stats.stackexchange.com/questions/189067/how-to-make-a-reward-function-in-reinforcement-learning

https://stats.stackexchange.com/questions/189067/how-to-make-a-reward-function-in-reinforcement-learning/190378

https://hal.archives-ouvertes.fr/hal-00331752v2/document

https://www.quora.com/How-does-one-learn-a-reward-function-in-Reinforcement-Learning-RL

https://deepblue.lib.umich.edu/bitstream/handle/2027.42/136931/guoxiao_1.pdf?sequence=1&isAllowed=y

Hamid Radmard Rahmani

Mohamed AbdElAziz Khamis

Mohamed, many thanks for the very helpful links.

How to deal with a diverse data set for training a Neural Network?

How to measure/calculate displacements in a structure during earthquake?

Why Matlab is so slow in training Neural Networks comparing to Tensorflow?

How can we perform a simple implicit dynamic analysis in abaqus with specific damping?

Feedback defines the constitution of an organism?

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

Can formaldehyd-fixed gelatin be melt?

Measuring the Intelligence of a Species?

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

The Curse of Evolution and Complexity?

Need help with my research project on open source SIEM and machine learning?

Why am I getting "shark finn"-like signal?

Swimming/space travel depends on the proprioceptive muscle spindles?

Training for new staff?