01 January 1970 2 4K Report

For ChatGPT,if you can collect all the possible pre-train data, then you can just remove the bad-feedback data from predictions for reward model.

if you can not collect all the possible pre-train data, then you need to correct the bad-feedback data from predictions for reward model.

But in both way, you need humans to label.

More Tong Guo's questions See All
Similar questions and discussions