What is the difference between value iteration and policy iteration methods in reinforcement learning?

More Negin Malekian's questions See All

Are you intersted in nanoplastic detection?

to be involved in our research please massage me here

15 May 2024 8,998 2 View

Can the Thermosensitive Chitosan–Glycerol Phosphate Hydrogels be liquefied at 23°C, after being exposed to 37°C?

Hello I intended to make a Thermosensitive chitosan-beta-glycerol phosphate hydrogel, which was based on the protocol of the article (Heat-sensitive electroconductive gold nanoparticle-chitosan...

21 March 2024 4,162 0 View

Does anybody know a Model-Based System Engineering real case study in Automotive Driving?

I need to re-implement a case study in automotive systems (recommended), on "APP4MC" platform, with a reasonable level of complexity in hardware and software characteristics. If anybody knows an...

24 November 2023 7,941 2 View

Accuracy of Climate Models in predicting Extreme Rainfall?

I'd like to make sure if a GCM has enough accuracy in predicting Long-term Precipitation for a region, does it have the same accuracy in predicting the Extreme Rainfall (in the form of IDF curves,...

07 November 2023 9,785 4 View

Does anyone know how to simulate with DEFORM 2D software?

Hi, I am simulating a die and its workpiece in DEFORM 2D software, and at the step of determining and applying temperature and pressure, I encountered the problem that temperature and pressure are...

26 October 2023 7,350 1 View

How to work on the interface of COMSOL and YADE in UBUNTU?

Hello I have a question, I want to work on ICY which is the interface between COMSOL and YADE. But when I want to run COMSOL from terminal, and I wrote "comsol mphserver" on teminal, I get this...

27 January 2023 5,892 4 View

Why in differential scanning calorimetry thermograms the heat capacity(cp)in crese and after that it decrease?

Why in differential scanning calorimetry thermograms the heat capacity(cp)in crese and after that it decrease. I attach the overall photo. I want to know it generally not for a particular case...

15 January 2023 2,099 5 View

Is Mapillary SDK Object Detection works properly?

Hi, I am using Mapillary mly.interface.get_detections_with_image_id interface from the SDK, while it seems it is not detecting objects correctly by the image-ids for many images. Does anyone...

10 December 2022 7,924 3 View

Parse error on line 1 in file "rigid.pdbqt": Unknown or inappropriate tag ?

I receive this error when I try to do virtual screening with vina for flexible docking. does anybody know what should I do?

15 March 2022 7,094 4 View

Why does ulceration happen in TRAMPC2 model mice?

Hello everyone I inoculated C57BL/6 mice with TRAMPC2. I saw ulceration after 32 days of inoculation. anyone knows why it happens? except of injection manner, is there another factor to make this...

22 January 2022 2,596 3 View

Could you recommend some articles on Urban Transportation System optimization and Innovation?

13 August 2024 2,595 3 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

Can we mark 'EFL Learners shifting from general digital to AI technologies' as technological transition?

After COVID-19 it has seen that EFL learners technological affiliation has raised. In addition, in the post-COVID period learners started to engage AI technologies like ChatGPT while learning...

08 August 2024 8,964 4 View

Who will be moral responsible for the death of thousands of people in the event of an earthquake?

Who will bear moral responsibility for the deaths of thousands of people in the event of an earthquake? Weeks and months remain before the onset of strong earthquakes that bring death to...

08 August 2024 6,134 12 View

What are examples of AI for good projects a teacher can assign to students?

So I am organizing an AI seminar. What are possible AI projects in the AI for good spirit? something the students can do and have an impact?

08 August 2024 9,437 4 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How to design human-centered classroom in the age of A.I.?

08 August 2024 347 5 View

Fabrice Clerot

this presentation does a nice job at presenting the various options for RL and reminding their relative merits given the applicative context :

http://www2.econ.iastate.edu/tesfatsi/RLUsersGuide.ICAC2005.pdf

Negin Malekian

Actually, I don't know whether the classification of RL approach (i.e., policy iteration & value iteration) is just for model-based approach or is about model-free approach as well?

Dear Fabrice: I've found your link very helpful and I've learned a lot form it. But, I couldn't find anything related to my problem in your file. I were wondering if you can introduce me another suitable resource for this purpose as well.

in a nutshell, at the end of learning

with a model-free approach , the agent knows how to act, but doesn’t explicitly know anything about the environment (think of Q-learning)
with a model-based approach, the agent has built a correct model of the environment and therefore can simulate it so as to find the right decision (think of E3)

Thanks dear Fabrice,

Do you know policy iteration & value iteration methods are just related to mode-based approach or are related to model-free approach as well?

Pitipong Chanloha

I am not sure whether you could understand about my explanation.

Policy iteration requires two step

1. Policy evaluation : you have to try all possible state and action pairs from your transition probability Ps,a,s' based on the predetermined policy "pi"

2. Policy improvement : the it is not converge, you have to run the new policy "pi" for the next iteration.

Then, do it repeatedly until it is converged.

Value iteration can be done via by using the Bellman equation where you can evaluate progressively state-by-state until the solution converges.

I hope it helps.

everything goes !

you can have model-free or model-based value iteration algorithms etc ...

by the way, the Book is on line

https://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

(see II.4 and III.9 in particular)

and you can even have a mix of the different approaches : see Dyna-Q in the book above

Sébastien Dourlens

With value iteration, at each iteration of evaluation of all states of your environment, you increment the value of a state depending on values of neighbor states, you do this until all your environment is covered.

With policy iteraction, you evaluate the action of your process at each iteration, so that you improve your control law or policy. here you need a model: a discrete Markov process.

To understand this, you need to read about Dynamic Programming and the Hamilton-Jacobi-Bellman Equation !

Angel Martínez-Tenor

In my Mater Thesis there is a simple approach from Model-based to Model-free decision-making processes; I think it will help you.

https://www.researchgate.net/publication/281712242_AMT_MASTER_THESIS_presentation (slide 4)

https://www.researchgate.net/publication/281631089_Reinforcement_Learning_on_the_Lego_Mindstorms_NXT_Robot._Analysis_and_Implementation (section 2.1 pages 11-14 )

For further (and formal) information I also recommend the Sutton/Barto book:

Data AMT MASTER THESIS presentation

Thesis Reinforcement Learning on the Lego Mindstorms NXT Robot. Ana...

Anshul Joshi

The best way to learn about both methods, similarities and differences, is the book by Russell & Norvig. Some people might differ, but it is considered by majority of AI researchers to be "the" reference book for AI topics. It explains very well (in chapter 17 "Making Complex decisions" I believe), both topics within a few pages. The pdf version is also available online, for a quick glance, but I highly encourage you to find it in your local library etc.

Ansir Ilyas

According to my view, policy iteration PI and value iteration VI are two methods that are used to solve the Bellman's equation in on-line fashion.

They have following difference,

1) PI required admissible initial policy whereas VI does not required.

2) PI called full backup solution, it can take significant computations whereas VI partial backup solution and take less computations.

3) VI is a recursive method, it use previous policies to update new policy.

Maysam Toghraee

hi negin?

https://faradars.org/courses/fvrml120-reinforcement-learning

Imtithal Saeed

Thanks for the discussion

Sopan Talekar

Algorithms that purely sample from experience such as Monte Carlo Control, SARSA, Q-learning are "model free" RL algorithms

Ramesh Kesh

Policy definition and computations needed are the main difference. Depends on what you are trying to solve. Some good references cited by previous respondents.

Firoz Mahmud

In policy iteration algorithms, you start with a random policy, then find the value function of that policy (policy evaluation step), then find a new (improved) policy based on the previous value function, and so on. In this process, each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). Given a policy, its value function can be obtained using the Bellman operator.

In value iteration, you start with a random value function and then find a new (improved) value function in an iterative process, until reaching the optimal value function. Notice that you can derive easily the optimal policy from the optimal value function. This process is based on the optimality Bellman operator.

In some sense, both algorithms share the same working principle, and they can be seen as two cases of the generalized policy iteration. However, the optimality Bellman operator contains a maxoperator, which is non linear and, therefore, it has different features. In addition, it's possible to use hybrid methods between pure value iteration and pure policy iteration.

Source: stackoverflow