Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media

University of Arkansas  
Under review at International Journal of Computer Vision  

Dataset Samples with Annotations

Highlights

  • We introduce a new, comprehensive PHAD dataset consisting of videos related to tobacco products sourced from YouTube and TikTok, enriched with detailed metadata such as user engagement metrics, video descriptions, and search keywords.
  • We propose a novel two-stage method incorporating a Vision-Language (VL) Encoder that significantly improves the performance of classifying tobacco-related content by leveraging visual and textual features.
  • Our analysis reveals significant trends in user engagement, particularly the high interaction with vaping and e-cigarette content, providing critical insights for public health interventions.
  • We demonstrate the importance of incorporating contextual features such as metadata to enhance the performance of deep learning models in understanding and classifying video content.

Abstract

The Public Health Advocacy Dataset (PHAD) is a comprehensive collection of videos related to tobacco products sourced from social media. This dataset includes detailed metadata such as user engagement metrics, video descriptions, and search keywords, providing a valuable resource for analyzing tobacco-related content and its impact. Our research employs a two-stage classification approach, incorporating a Vision-Language (VL) Encoder, demonstrating superior performance in accurately categorizing various types of tobacco products and usage scenarios. The analysis reveals significant user engagement trends, particularly with vaping and e-cigarette content, highlighting areas for targeted public health interventions. The PHAD addresses the need for multi-modal data in public health research, offering insights that can inform regulatory policies and public health strategies. This dataset is a crucial step towards understanding and mitigating the impact of tobacco usage, ensuring that public health efforts are more inclusive and effective.

Dataset Statistics

The Public Health Advocacy Dataset (PHAD) includes videos from YouTube and TikTok along with detailed metadata such as user engagement metrics, video descriptions, and search keywords.

5.7K

videos

4.3M

frames

20M

engagement metrics

Description 1
User engagement statistics for each tobacco product on PHAD dataset
Description 2
Search keywords word cloud on PHAD dataset
Description 3

Download the PHAD Dataset

v1.0:

Notes:

  • Please follow the data format and the dataset split ratio which are clearly and comprehensively mentioned in the supplementary material to have better understanding of the dataset. Also, we included the python scripts to download and organize the dataset in the supplementary material. After the decision of the paper, we will release the github reporsitory with all the corresponding python scripts.

Licensing:

The PHAD dataset is released under a CC BY-NC-SA 4.0 license. The original video links provided in the dataset are adhered to the copyrights of © YouTube and © TikTok.

Experimental Results

Experimental Result 1
Experimental Result 2
Comparison of different approaches in the second stage of our framework.

References

  1. Dhiraj Murthy, Rachel R Ouellette, Tanvi Anand, Srijith Radhakrishnan, Nikhil C Mohan, Juhan Lee, and Grace Kong. " Computer Vision to Detect E-cigarette Content in TikTok Videos." Nicotine & Tobacco Research, 26(Supplement_1):S36–S42, 02 2024.
  2. Julia Vassey, Chris J Kennedy, Ho-Chun Herbert Chang, Ashley S Smith, and Jennifer B Unger. "Scalable Surveillance of E-Cigarette Products on Instagram and TikTok Using Computer Vision." Nicotine &Tobacco Research, 26(5):552–560, 11 2023.
  3. Sepp Hochreiter and Jürgen Schmidhuber. "Long short-term memory." Neural computation, 9(8):1735–1780, 1997.
  4. Joao Carreira and Andrew Zisserman. "Quo vadis, action recognition? a new model and the kinetics dataset." In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.

Prior Works

  1. Naga VS Raviteja Chappa, Charlotte McCormick, Susana Rodriguez Gongora, Page Daniel Dobbs and Khoa Luu. " Public Health Advocacy Dataset: A Dataset of Tobacco Usage Videos from Social Media." Under review at International Journal of Computer Vision. [paper]
  2. Naga Venkata Sai Raviteja Chappa, Page Daniel Dobbs, Bhiksha Raj and Khoa Luu. "FLAASH: Flow-Attention Adaptive Semantic Hierarchical Fusion for Multi-Modal Tobacco Content Analysis." Under review at International Journal of Computer Vision. [paper]
  3. Naga VS Raviteja Chappa, Charlotte McCormick, Susana Rodriguez Gongora, Page Daniel Dobbs and Khoa Luu. " Advanced Deep Learning Techniques for Tobacco Usage Assessment in TikTok Videos." In 2024 IEEE Green Technologies Conference (GreenTech), pp. 162-163. IEEE, 2024. [paper]