01 January 1970 3 5K Report

Hi all:

I met a problem in an RL project. We use an searching method to produce the data and then use deep neural network to train the value function. It is an off-policy method. We have a large data buffer to keep the data, and every time we sample the data and feed them into the neural network(replay). Although the average reward increases, I found the loss function never decrease and even increase!!! Attached is the plot for loss function. Did you meet this problem? How did you solve it? I thought the reason lies in the lack of important samples...

Thank you

More Chen Ziheng's questions See All
Similar questions and discussions