06 July 2025 1 10K Report

Lora fine-tuning. Using an SFT-trained model, make predictions on the training dataset, then human annotators label preference dataset — specifically, indicating whether the training dataset answers or the predicted answers are better. After that, apply DPO algorithm training. Can this approach improve performance?

More Tong Guo's questions See All
Similar questions and discussions