DirectedTokens

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

¹CVIU Lab, University of Arkansas ²University of Science, VNU-HCM ³Carnegie Mellon University

🎉 Accepted to NeurIPS 2025 🎉

Highlights

Shuffling-based Alignment Tasks: We introduce two novel pre-training and fine-tuning tasks—image order reconstruction and text order reconstruction—that strengthen reasoning, visual understanding, and cross-modality alignment in LMMs.
Directed-Token Representation: We propose a directed-token mechanism that effectively captures visual and textual knowledge, enabling the model to reconstruct correct visual sequences and enhance robust alignment.
Image-to-Response Guided Loss: We design a new loss function that explicitly guides responses with visual understanding, leading to consistent state-of-the-art performance on academic benchmarks for task-oriented and instruction-following LMMs.

Abstract

Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

Qualitative Results

Comparison of Response in Conversation Between Direct-LLaVA-7B and LLaVA-v1.5-7B on In-the-Wild Samples. fail

Comparison of Response in Multiple- Choice Questions Between Direct-LLaVA-7B and LLaVA-v1.5-7B on MMMU-Val. fail

Experimental Results

Comparison with Prior Methods on Academic-task-oriented Benchmarks

Comparison with Prior Methods on Benchmarks for Instruction-Following LMMs

BibTex