This new way of accessing information through a reward model for reinforcement learning requires collecting comparison data, which consists of two or more model responses classified by quality. Data collection is performed through conversations that the AI trainers had with the chatbot, enhanced by Reinforcement Learning with Human Feedback. A written message is randomly selected per model, trying out several alternative conclusions and asking the AI trainers to rate them.

The use of reward models can adjust the model using Proximal Policy Optimization. Models such as GPT-3 and Codex are also implemented.

More António Brandão's questions See All
Similar questions and discussions