Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math