OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Xiangyu Zhao*^1,2, Shengyuan Ding*^2,3, Zicheng Zhang^1,2, Haian Huang²,
Maosong Cao², Weiyun Wang^2,4, Xinyu Fang¹ Jiaqi Wang², Wenhai Wang²,
Guangtao Zhai¹, Hua Yang¹ Haodong Duan², Kai Chen²,

¹Shanghai Jiaotong University ²Shanghai AI Laboratory ³Nanjing University ⁴Fudan University

Paper Code SFT Data DPO Data Bench

OmniAlign-V-SFT/DPO Dataset & MM-AlignBench

Abstract

Recent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs' alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs' alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities.

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Abstract

Framework

Dataset & Benchmark

Statistics of OmniAlign-V dataset and samples in MM-AlignBench.

Samples

Samples of tasks in OmniAlign-V dataset.