ɸ-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models

1CVIU Lab, University of Arkansas     2Dept. of Geosciences, University of Arkansas    
3Carnegie Mellon University    

🎉 Accepted to CVPR 2026 🎉

Highlights

  • Introduce a new continual learning paradigm via DPO, termed as ɸ-DPO, which addresses the catastrophic forgetting problem.
  • Present a novel Fairness DPO loss to address the fairness problem caused by imbalanced data, by analyzing the limitations of traditional DPO.
  • Provide a comprehensive theoretical analysis to show that our ɸ-DPO can address both catastrophic forgetting and imbalanced data.
  • Construct preference data labels for current continual learning benchmarks, to support the DPO learning in our framework.
  • Achieve State-of-the-Art performance via intensive experiments and ablation studies.

Abstract

Fairness in Continual Learning for Large Multimodal Models (LMMs) is an emerging yet underexplored challenge, particularly in the presence of imbalanced data distributions that can lead to biased model updates and suboptimal performance across tasks. While recent continual learning studies have made progress in addressing catastrophic forgetting, the problem of fairness caused the imbalanced data remains largely underexplored. This paper presents a novel Fairness Direct Preference Optimization (FaiDPO or ɸ-DPO) framework for continual learning in LMMs. In particular, we first propose a new continual learning paradigm based on Direct Preference Optimization (DPO) to mitigate catastrophic forgetting by aligning learning with pairwise preference signals. Then, we identify the limitations of conventional DPO in imbalanced data and present a new ɸ-DPO loss that explicitly addresses distributional biases. We provide a comprehensive theoretical analysis demonstrating that our approach addresses both forgetting and data imbalance. Additionally, to enable ɸ-DPO-based continual learning, we construct pairwise preference annotations for existing benchmarks in the context of continual learning. Extensive experiments and ablation studies show the proposed ɸ-DPO achieves State-of-the-Art performance across multiple benchmarks, outperforming prior continual learning methods of LMMs.

Preference Data Curation

Example of our DPO data in the Continual Learning Benchmark.

Experimental Results

Results on MLLM-CL Domain
Method Remote Sensing Medical Autonomous Driving ScienceQA Finance MFT↑ MFN↑ MAA↑ BWT↑
Zeroshot 32.29 28.28 15.59 35.55 62.56 34.85 -- -- --
LoRA-FT* 76.54 50.27 43.01 43.32 89.85 66.32 60.60 64.72 -7.15
O-LoRA* 76.94 41.17 34.18 39.61 83.22 60.49 55.02 60.73 -6.83
MoELoRA* 77.63 49.54 39.08 41.04 89.21 66.24 59.30 64.81 -8.68
CL-MoE* 76.58 52.31 39.65 45.64 90.21 66.65 60.88 64.95 -7.22
HiDe* 74.80 42.29 34.03 38.01 79.22 60.83 53.67 61.81 -8.95
SEFE* 78.43 52.85 46.21 47.76 89.33 66.89 62.92 66.51 -4.97
DISCO* 77.78 46.25 50.45 49.51 89.71 65.27 62.74 64.92 -3.17
LoRA-FT 69.65 41.59 25.43 40.88 87.45 64.98 53.00 61.13 -14.97
O-LoRA 74.64 44.42 30.02 41.47 87.15 65.16 55.54 62.12 -12.03
MoELoRA 77.54 41.85 27.62 40.13 86.75 64.94 54.78 61.76 -12.70
CL-MoE 71.34 46.84 26.33 41.17 88.74 66.06 54.88 61.79 -13.96
HiDe 74.31 48.95 33.21 38.54 81.55 60.77 55.31 60.68 -6.82
SEFE 77.26 50.37 37.21 40.87 86.82 65.01 58.51 63.63 -8.13
DISCO 76.03 45.20 43.79 42.33 88.95 64.43 59.26 63.35 -6.46
MR-LoRA 80.87 65.32 54.12 56.71 91.12 69.64 69.63 71.06 -0.01
ɸ-DPO 85.68 69.74 57.73 61.55 95.28 74.29 74.00 75.68 -0.37
* denotes the method using relay data.

Results on MLLM-CL Ability
Method OCR Math & Logic Visual Perception GUI MFT↑ MFN↑ MAA↑ BWT↑
Zeroshot 31.20 30.20 60.79 10.00 33.05 -- -- --
LoRA-FT* 21.80 32.70 58.38 28.75 40.32 35.41 36.32 -6.55
O-LoRA* 29.60 31.30 60.79 27.50 39.96 37.30 36.34 -3.55
MoELoRA* 19.80 32.20 54.19 30.00 40.35 34.05 35.39 -8.41
CL-MoE* 25.40 31.80 60.91 30.00 41.22 37.03 37.28 -5.59
HiDe* 24.60 28.40 30.71 23.75 36.84 26.86 33.54 -13.30
SEFE* 25.60 34.80 57.61 31.39 42.25 37.35 37.93 -6.53
DISCO* 34.20 35.00 61.55 27.50 40.14 39.56 37.85 -0.77
LoRA-FT 23.60 33.70 55.84 32.50 41.28 36.41 36.58 -6.49
O-LoRA 29.60 32.90 52.41 33.75 39.72 37.16 35.42 -3.41
MoELoRA 26.70 32.80 56.85 27.22 39.45 35.89 36.07 -4.75
CL-MoE 19.90 32.70 53.43 30.69 40.50 34.18 35.65 -8.43
HiDe 24.60 32.10 46.32 28.75 37.98 32.94 34.60 -6.72
SEFE 26.00 33.40 57.74 33.75 40.98 37.72 36.59 -4.35
DISCO 32.90 33.10 60.15 30.14 39.02 39.07 36.57 0.07
MR-LoRA 33.70 36.20 65.10 32.50 41.89 41.88 38.86 -0.02
ɸ-DPO 38.40 39.20 68.65 35.00 45.55 45.31 43.03 -0.31
* denotes the method using relay data.

Results on CoIN Benchmark
Method ScienceQA ImageNet VizWiz Grounding TextVQA GQA VQAv2 OCR MFN↑ MAA↑
Zeroshot 69.79 9.93 45.50 58.47 57.75 60.77 66.50 64.93 -- --
FineTune 57.43 28.90 41.88 30.05 51.39 50.76 53.28 64.78 47.31 52.86
LwF 60.71 30.58 41.49 36.01 52.80 47.07 53.43 65.12 48.40 53.22
EWC 59.75 31.88 42.26 34.96 51.06 51.84 55.30 64.55 48.95 53.30
L2P 70.21 23.31 44.21 43.76 56.25 58.46 62.32 64.11 52.83 53.96
O-LoRA 72.56 62.84 48.43 58.97 57.66 59.14 63.21 63.31 60.77 62.60
MoELoRA 62.02 37.21 43.32 33.22 52.05 53.12 57.92 65.75 50.58 55.24
HiDe 73.20 69.28 50.76 59.18 56.92 61.33 67.12 64.76 62.82 64.70
ɸ-DPO 77.84 95.61 54.55 60.74 59.17 64.32 69.99 68.69 68.86 74.94

Acknowledgments

This work is partly supported by NSF CAREER (No. 2442295), NSF SCH (No. 2501021), NSF E-RISE (No. 2445877), NSF BIO (No. 2524623) and USDA/NIFA Award. We also acknowledge the Arkansas High-Performance Computing Center (HPC) for GPU servers.

BibTex

@inproceedings{truong2026faidpo,
  title={{$\phi$-DPO: Fairness Direct Preference Optimization Approach to Continual Learning in Large Multimodal Models}},
  author={Truong, Thanh-Dat and Tran, Huu-Thien and Cothren, Jackson and Raj, Bhiksha and Luu, Khoa},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  year={2026}
}