Lora fine-tuning. Using an SFT-trained model, make predictions on the training dataset, then human annotators label preference dataset — specifically, indicating whether the training dataset answers or the predicted answers are better. After that, apply DPO algorithm training. Can this approach improve performance?