Author
- Weihong Qi
- Zhaochen(Jason) Wang
- Jiaqi Zhu
- Christy Kim
Sponsor
Dr. Zidian Xie
Instructor
Professor Ajay Anand
Professor Cantay Caliskan
Abstract
In the digital age, Twitter has emerged as an
central arena for public health issues, particularly those involv-
ing e-cigarette use. This study examines the dynamics of user
involvement with e-cigarette-related tweets in order to identify
patterns in how information regarding vaping is consumed and
shared. Using a dataset spanning May 2021 to March 2023,
we evaluated millions of tweets using advanced data mining
and deep learning models like RoBERTa for text classification
and BERTopic for topic modeling. We discovered key tweet
characteristics—such as tweet length, hashtag usage, and the
presence of health-related content—that have a significant
impact on user interactions, as evaluated by replies, retweets,
and likes. Our findings show that while some aspects, such
as tweet length and links, increase engagement, the usage
of slang and specific emotional tones can decrease it. These
insights are vital for public health activists and politicians
to optimize social media strategies for more effective vaping
prevention and education campaigns, since they provide a
sophisticated understanding of digital communication in public
health contexts.
Introduction
The growth of social media has drastically changed public debate on health-related issues, with Twitter appearing as a popular place for conversations regarding e-cigarettes. These platforms not only allow the rapid broadcast of information, but they also serve important places for monitoring public perceptions and behaviors about health issues. E-cigarettes, which are especially popular among teenagers, present substantial public health risks due to aggressive marketing and conflicting messages regarding their safety. This study examines Twitter data to see how different qualities of e-cigarette-related tweets influence user engagement, giving useful insights for improving public health campaigns and policymakers. Using advanced data mining and deep learning techniques, our study identifies tweet features associated with increased interaction, with the goal of providing public health advocates with practical suggestions for improving anti-vaping efforts on social media.
Method
- Combine advanced data mining techniques with deep learning models
- Leverage transformer-based model – RoBERTa for classification
- Utilize topic modeling through LDA and BERTopic
Data
Features
Manually labeled + RoBERTa fine-tuned Features
- question
- slang/abbr
- first-person language
- health related
Algorithm-labeled Features
- text length
- hashtag
- @mentions
- contains link or not (most links are image or video) https/t.xxxxx
- sentiment of text(sentiment score)
- contains emoji or not
- when it post (morning/night => time of day, weekday or weekend => day of week)
- how many stopwords in text
Model
F1 score = 0.98
Slang/abbr:
F1 score = 0.81
First-person language:
F1 score = 0.92
Health related:
F1 score = 0.86
Results
Number of Replies
The multivariate regression results show that the number of replies is negatively correlated with the number of slangs, whether the tweet is related to health-related issue, the sentiment score and whether contains an emoji. But the length of the tweet positively correlated with the number of replies.
Number of Retweets
The results show that the number of retweets is negatively correlated with the slangs, question, tweet length, hashtags, mentions, sentiment score and emoji. But it is positively correlated with whether it contains links.
Number of Favorites
The number of favorites is negatively correlated with all the features.
Other Results
User engagement is also correlated with the time of day and day of the week. Further analysis indicates that evenings and weekends are associated with increased user engagement.
Acknowledgement
We express our gratitude to Dr. Zidian Xie and the team at the Clinical and Translational Science Institute for their sponsorship and guidance throughout this project. We also extend our thanks to Prof. Ajay Anand and Prof. Cantay Caliskan for their valuable instructions and suggestions during the project’s development.