LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

Department of EECS, University of Arkansas  
WACV 2025  

Motivation

motivation-figure
Comparison of LiGAR with conventional GAR methods in analyzing group activities.

Abstract

Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.

Highlights

  • We propose a novel hierarchical transformer architecture that leverages LiDAR data as a structural backbone to guide the processing of visual and textual information.
  • We introduce an Adaptive Fusion Module (AFM) that dynamically integrates LiDAR, visual and textual modalities while modeling temporal dependencies.
  • Through extensive experiments on diverse benchmark datasets, including JRDB-PAR, Volleyball, and NBA, we LiGAR's superiority in multi-modal group activity recognition.

Proposed Framework

framework-figure
Overview of the LiDAR-Guided Hierarchical Transformer (LiGAR) framework.

Experimental Results

Experimental Result 1
Comparison with SOTA on JRDB-PAR dataset.
Experimental Result 2
Comparison with SOTA on Volleyball and NBA datasets. Here, WS: Weakly Supervised.

Visualizations

Prior Works

  1. Chappa, Naga VS, Pha Nguyen, Alexander H. Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, and Khoa Luu. "SPARTAN: Self-supervised spatiotemporal transformers approach to group activity recognition." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5158-5168. 2023. [Paper]
  2. Chappa, Naga VS, Pha Nguyen, Alexander H. Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, and Khoa Luu. "SOGAR: Self-supervised spatiotemporal attention-based social group activity recognition." Under review. [Paper]
  3. Chappa, Naga VS Raviteja, Pha Nguyen, Page Daniel Dobbs, and Khoa Luu. "REACT: Recognize Every Action Everywhere All At Once." Machine Vision and Applications 35, no. 4 (2024): 102. [Paper]
  4. Chappa, Naga Venkata Sai Raviteja, Pha Nguyen, Thi Hoang Ngan Le, Page Daniel Dobbs, and Khoa Luu. "HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos." Sensors 24, no. 11 (2024): 3372. [Paper]