Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

1CVIU Lab, University of Arkansas     2University of Science, VNU-HCM     3Carnegie Mellon University    

🎉 Accepted to NeurIPS 2025 🎉

fail

Highlights

  • Shuffling-based Alignment Tasks: We introduce two novel pre-training and fine-tuning tasks—image order reconstruction and text order reconstruction—that strengthen reasoning, visual understanding, and cross-modality alignment in LMMs.
  • Directed-Token Representation: We propose a directed-token mechanism that effectively captures visual and textual knowledge, enabling the model to reconstruct correct visual sequences and enhance robust alignment.
  • Image-to-Response Guided Loss: We design a new loss function that explicitly guides responses with visual understanding, leading to consistent state-of-the-art performance on academic benchmarks for task-oriented and instruction-following LMMs.
fail

Abstract

Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM's pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

Qualitative Results

Comparison of Response in Conversation Between Direct-LLaVA-7B and LLaVA-v1.5-7B on In-the-Wild Samples. fail
Comparison of Response in Multiple- Choice Questions Between Direct-LLaVA-7B and LLaVA-v1.5-7B on MMMU-Val. fail

Experimental Results

Comparison with Prior Methods on Academic-task-oriented Benchmarks
fail

Comparison with Prior Methods on Benchmarks for Instruction-Following LMMs
fail

Acknowledgements

This work is partly supported by NSF CAREER (No. 2442295), NSF SCH (No. 2501021), NSF E-RISE (No. 2445877), NSF SBIR Phase 2 (No. 2247237) and USDA/NIFA Award. We also acknowledge the Arkansas High-Performance Computing Center (HPC) for GPU servers.

BibTex

@inproceedings{truong2025directedtokens,
      title={Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models},
      author={Thanh-Dat Truong and Huu-Thien Tran and Thai Son Tran and Bhiksha Raj and Khoa Luu},
      booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
      year={2025}
    }