Skip to content

URMC-CTSI, Engage Vapor: Strategies from E-cigarette Social Analytics

Author

  • Weihong Qi
  • Zhaochen(Jason) Wang
  • Jiaqi Zhu
  • Christy Kim

Sponsor

Dr. Zidian Xie

Instructor

Professor Ajay Anand

Professor Cantay Caliskan

Abstract

In the digital age, Twitter has emerged as an
central arena for public health issues, particularly those involv-
ing e-cigarette use. This study examines the dynamics of user
involvement with e-cigarette-related tweets in order to identify
patterns in how information regarding vaping is consumed and
shared. Using a dataset spanning May 2021 to March 2023,
we evaluated millions of tweets using advanced data mining
and deep learning models like RoBERTa for text classification
and BERTopic for topic modeling. We discovered key tweet
characteristics—such as tweet length, hashtag usage, and the
presence of health-related content—that have a significant
impact on user interactions, as evaluated by replies, retweets,
and likes. Our findings show that while some aspects, such
as tweet length and links, increase engagement, the usage
of slang and specific emotional tones can decrease it. These
insights are vital for public health activists and politicians
to optimize social media strategies for more effective vaping
prevention and education campaigns, since they provide a
sophisticated understanding of digital communication in public
health contexts.

Introduction

The growth of social media has drastically changed public debate on health-related issues, with Twitter appearing as a popular place for conversations regarding e-cigarettes. These platforms not only allow the rapid broadcast of information, but they also serve important places for monitoring public perceptions and behaviors about health issues. E-cigarettes, which are especially popular among teenagers, present substantial public health risks due to aggressive marketing and conflicting messages regarding their safety. This study examines Twitter data to see how different qualities of e-cigarette-related tweets influence user engagement, giving useful insights for improving public health campaigns and policymakers. Using advanced data mining and deep learning techniques, our study identifies tweet features associated with increased interaction, with the goal of providing public health advocates with practical suggestions for improving anti-vaping efforts on social media.

Method

  • Combine advanced data mining techniques with deep learning models
  • Leverage transformer-based model – RoBERTa for classification
  • Utilize topic modeling through LDA and BERTopic

Data

Features

Manually labeled + RoBERTa fine-tuned Features

  1. question 
  2. slang/abbr 
  3. first-person language 
  4. health related

Algorithm-labeled Features

  1. text length
  2. hashtag
  3. @mentions
  4. contains link or not (most links are image or video)  https/t.xxxxx
  5. sentiment of text(sentiment score)
  6. contains emoji or not
  7. when it post (morning/night => time of day, weekday or weekend => day of week)
  8. how many stopwords in text

Model

To prepare our training dataset, we manually labeled 2200 tweets after conducting an initial assessment of 200 tweets, which yielded a satisfactory kappa score. This initial labeling assured us of the reliability of our annotations. Using this labeled data, we trained and fine-tuned four different RoBERTa models to apply to our entire dataset. Each of the four models displayed unique performance characteristics, tailored to handle specific aspects of the data.

Question:

 

F1 score = 0.98

 

Slang/abbr:

F1 score = 0.81

 

First-person language:

F1 score =  0.92

 

Health related:

F1 score = 0.86

Results

Number of Replies

The multivariate regression results show that the number of replies is negatively correlated with the number of slangs, whether the tweet is related to health-related issue, the sentiment score and whether contains an emoji. But the length of the tweet positively correlated with the number of replies.

 

 

Number of Retweets

The results show that the number of retweets is negatively correlated with the slangs, question, tweet length, hashtags, mentions, sentiment score and emoji. But it is positively correlated with whether it contains links.

 

Number of Favorites

The number of favorites is negatively correlated with all the features.

 

 

Other Results

User engagement is also correlated with the time of day and day of the week. Further analysis indicates that evenings and weekends are associated with increased user engagement.

Acknowledgement

We express our gratitude to Dr. Zidian Xie and the team at the Clinical and Translational Science Institute for their sponsorship and guidance throughout this project. We also extend our thanks to Prof. Ajay Anand and Prof. Cantay Caliskan for their valuable instructions and suggestions during the project’s development.