Does anybody know what is the difference between "on-policy learning" and "off-policy learning" in reinforcement learning?

More Negin Malekian's questions See All

Why does not Prokka assign name to some gene?

In the fasta output of Prokka listing the name of genes, some genes does not have any name ("gene: NA"). My question is whether these genes are hypothetical or they do not have any name?If the...

08 September 2017 8,967 5 View

Can I use MDP to model a process at which the new state of an agent depends on the last previous action of itself and its neighbors?

I know that in the Markov decision process (MDP), the probability of transition to a new state depends on the current state and chosen action of an agent. However, in my model, the new state of an...

31 December 2016 911 5 View

Can Markov decision process (MDP) and reinforcement learning model evolutionary dynamics of populations?

I mean, are MDP and reinforcement learning as powerful as evolutionary game theory to model evolutionary dynamics of populations?

05 June 2016 3,760 4 View

Does anybody know if there is a communication between tumor cells?

I want to know whether tumor cells share their information by time passing. I were wondering if anybody could answer my question and introduce some good resources in this regard?

31 December 2015 3,454 19 View

Does anybody know whether tumor cells learn their behavior?

I want to know whether tumor cells learn to choose the best phenotypic behavior (i.e., apoptosis, necrosis, proliferation or quiescence) based on the environment (e.g., oxygen level) by time...

31 December 2015 9,232 10 View

Is there any implementation for Gompertz fitting considering the bias and variance?

I want to fit my data (50 samples so that each of them contains 100 data points) with Gompertz curve. After that, I want to check whether the fitting is appropriate. Here, low bias and variance is...

11 December 2015 3,382 8 View

Which statistical tests can be applied to check the appropriateness of curve fitting considering the bias_variance_tradeoff?

I want to fit my data with the Gompertz curve. So I should find the best free parameters in Gompertz function that leads to low bias and variance. What’s your suggestion?

11 December 2015 1,958 10 View

What value should be set for learning rate in QLearning with NonStationary environment and how can be checked if the value results in the convergence?

The thing is, the agents in my model don’t have to necessarily reach a final state. I mean, the behavior of agents over time must be learned instead of reaching to a final state. For example, the...

10 November 2015 3,773 1 View

What are the ages of "ductal carcinoma in situ" cells and "effector cytotoxic T cells (CTLs)", on average?

Thanks in advance

09 October 2015 8,514 0 View

Why do ordinary differential equation (ODE) models of cancer suggest different behaviors for cancer cells?

For validation part of my study, I need a comparison between my model of ductal carcinoma in situ (DCIS) and ODE models of this area. But, I’m really confused because ordinary differential...

09 October 2015 3,942 10 View

Could you recommend some articles on Urban Transportation System optimization and Innovation?

13 August 2024 2,595 3 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How to design human-centered classroom in the age of A.I.?

08 August 2024 347 5 View

Pitipong Chanloha

I think you should read in the link I herein attached for you. Their discussions were useful and direct to your questions.

https://groups.google.com/forum/#!topic/rl-list/4Efnr0gXhAU

Luis Miralles

hi, I recommend you this website:

https://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlearning.html

On-Policy and Off-Policy Learning

On-Policy Temporal Difference methods learn the value of the policy that is used to make decisions. The value functions are updated using results from executing actions determined by some policy. These policies are usually "soft" and non-deterministic. The meaning of "soft" in this sense, is that it ensures there is always an element of exploration to the policy. The policy is not so strict that it always chooses the action that gives the most reward. Three common policies are used, e-soft, e-greedy and softmax. These are explained in the section below.

Off-Policy methods can learn different policies for behaviour and estimation. Again, the behaviour policy is usually "soft" so there is sufficient exploration going on. Off-policy algorithms can update the estimated value functions using hypothetical actions, those which have not actually been tried. This is in contrast to on-policy methods which update value functions based strictly on experience. What this means is off-policy algorithms can separate exploration from control, and on-policy algorithms cannot. In other words, an agent trained using an off-policy method may end up learning tactics that it did not necessarily exhibit during the learning phase.