LiGAR

Motivation

Comparison of LiGAR with conventional GAR methods in analyzing group activities.

Abstract

Group Activity Recognition (GAR) remains challenging in computer vision due to the complex nature of multi-agent interactions. This paper introduces LiGAR, a LIDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition. LiGAR leverages LiDAR data as a structural backbone to guide the processing of visual and textual information, enabling robust handling of occlusions and complex spatial arrangements. Our framework incorporates a Multi-Scale LIDAR Transformer, Cross-Modal Guided Attention, and an Adaptive Fusion Module to integrate multi-modal data at different semantic levels effectively. LiGAR's hierarchical architecture captures group activities at various granularities, from individual actions to scene-level dynamics. Extensive experiments on the JRDB-PAR, Volleyball, and NBA datasets demonstrate LiGAR's superior performance, achieving state-of-the-art results with improvements of up to 10.6% in F1-score on JRDB-PAR and 5.9% in Mean Per Class Accuracy on the NBA dataset. Notably, LiGAR maintains high performance even when LiDAR data is unavailable during inference, showcasing its adaptability. Our ablation studies highlight the significant contributions of each component and the effectiveness of our multi-modal, multi-scale approach in advancing the field of group activity recognition.

Highlights

We propose a novel hierarchical transformer architecture that leverages LiDAR data as a structural backbone to guide the processing of visual and textual information.

We introduce an Adaptive Fusion Module (AFM) that dynamically integrates LiDAR, visual and textual modalities while modeling temporal dependencies.

Through extensive experiments on diverse benchmark datasets, including JRDB-PAR, Volleyball, and NBA, we LiGAR's superiority in multi-modal group activity recognition.

Proposed Framework

Overview of the LiDAR-Guided Hierarchical Transformer (LiGAR) framework.

Experimental Results

Comparison with SOTA on JRDB-PAR dataset.

Comparison with SOTA on Volleyball and NBA datasets. Here, WS: Weakly Supervised.

Prior Works

Chappa, Naga VS, Pha Nguyen, Alexander H. Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, and Khoa Luu. "SPARTAN: Self-supervised spatiotemporal transformers approach to group activity recognition." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5158-5168. 2023. [Paper]
Chappa, Naga VS, Pha Nguyen, Alexander H. Nelson, Han-Seok Seo, Xin Li, Page Daniel Dobbs, and Khoa Luu. "SOGAR: Self-supervised spatiotemporal attention-based social group activity recognition." Under review. [Paper]
Chappa, Naga VS Raviteja, Pha Nguyen, Page Daniel Dobbs, and Khoa Luu. "REACT: Recognize Every Action Everywhere All At Once." Machine Vision and Applications 35, no. 4 (2024): 102. [Paper]
Chappa, Naga Venkata Sai Raviteja, Pha Nguyen, Thi Hoang Ngan Le, Page Daniel Dobbs, and Khoa Luu. "HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos." Sensors 24, no. 11 (2024): 3372. [Paper]

LiGAR: LiDAR-Guided Hierarchical Transformer for Multi-Modal Group Activity Recognition

Motivation

Abstract

Highlights

Proposed Framework

Experimental Results

Visualizations

t-SNE plots of video representations on the given datasets learned by different variants of our LiGAR model.

t-SNE plots of video representations on the given datasets learned by different combinations of modalities by our LiGAR model.

Prior Works