We collected the [good]/[bad] feedback from the web page.
Then we remove the [bad] feedback data.
Then we only use the [good] feedback data to train the text-generation policy-model.
The [good] feedback data is merged into the origin dataset of policy-model.