HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

1 University of Arkansas     2 Ohio State University    

🎉 Accepted to CVPR 2025 🎉

Our Scene HyperGraph-based reasoning apporach enhances Multimodal LLMs for Video Scene Graph Generation, enabling higher-order reasoning and multi-way interaction modeling in videos.

Highlights

  • Proposes a unified Scene HyperGraph that integrates spatial relationships (entity scene graphs) and causal transitions (procedural graphs), enabling higher-order reasoning beyond pairwise connections in Multimodal LLMs for improved Video Scene Graph Generation (VidSGG).
  • Introduces the Video Scene Graph Reasoning (VSGR) dataset with 1.9M frames from third-person, egocentric, and drone views, supporting five key tasks (Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning).

Abstract

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Our Approach

HyperGLM Framework

The VSGR Dataset

VSGR includes videos and annotations leveraged for Scene Graph Generation and Scene Graph Anticipation from the ASPIRe and AeroEye datasets, along with additional annotations for tasks such as Video Question Answering, Video Captioning, and Relation Reasoning.
3.7K videos 1.9M frames 61.1K reasoning tasks

Experimental Results

Comparison (%) on VSGR and PVSG for the Scene Graph Generation.
fail
Comparison (%) on VSGR and Action Genome for the Scene Graph Anticipation.
fail
Comparison (%) on VSGR for the Video Question Answering.
fail
Comparison (%) on VSGR for the Video Captioning.
fail
Comparison (%) on VSGR for the Relation Reasoning.
fail

Our Related Work


[1] Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, and Khoa Luu. "HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

[2]Trong-Thuan Nguyen, Pha Nguyen, Xin Li, Jackson Cothren, Alper Yilmaz, and Khoa Luu. "CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos." In The Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024.

BibTex


      @inproceedings{nguyen2024hig,
        title={HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding},
        author={Nguyen, Trong-Thuan and Nguyen, Pha and Luu, Khoa},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
        pages={18384--18394},
        year={2024}
      }
  

    @inproceedings{nguyen2024cyclo,
      title={CYCLO: Cyclic Graph Transformer Approach to Multi-Object Relationship Modeling in Aerial Videos},
      author={Nguyen, Trong-Thuan and Nguyen, Pha and Li, Xin and Cothren, Jackson and Yilmaz, Alper and Luu, Khoa},
      booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
      year={2024}
    }