Interesting. Behind the successful LLMs models, RL with human feedback (RLHF) worked.
Maybe you can look at this news article: 'Exclusive: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic: https://time.com/6247678/openai-chatgpt-kenya-workers/
It is quite interesting. Thank you for sharing Tong Guo
I too believe that LLMs are working on a feedback mechanism which is not true reinforcement learning. Firstly the responses to mathematical, logical, and fact-based questions can be judged to be right or wrong. This may be difficult for long-form questions where users may just upvote or downvote the response based on what 'they' think is right or not. Secondly, we don't know how this feedback is used by LLMs in fine-tuning. Is it directly used or just for guiding the overall fine-tuning of the model - for instance, to adjust the temperature or certainty of results?