Why use preference datasets for DPO training? For the same question, with data comparing which answer is better, why not just use the better answer for SFT directly?

More Tong Guo's questions See All
Similar questions and discussions