MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

1Shanghai Jiaotong University 2Shanghai AI Laboratory 3S-Lab, Nanyang Technological University

MG-LLaVA

Abstract

In this work, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. 0Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The models and code will be available soon.

Framework

The illustration of MG-LLaVA. Top left: The overall framework of MG-LLaVA, which includes the Multi-Granularity Vision Flow module and a LLM. Right: Illustration of Multi-Granularity Vision Flow, which aims to extract multiple visual features and integrate disparate features to ensure seamless interaction. Botttom left: Structure of Conv-Gate Fusion module.

Quantitive Analysis

Comparison of MG-LLaVA with other method in image understanding.

Comparison of MG-LLaVA with other method in video understanding.

BibTeX

@article{mgllava,
        title={Towards Semantic Equivalence of Tokenization in Multimodal LLM},
        author={Zhao, Xiangyu and Li, Xiangtai and Duan, Haodong and Huang, Haian and Li, Yining and Chen, Kai and Yang, Hua},
        journal={arXiv preprint},
        year={2024}
    }