Classifying causes of depression from social media posts using machine learning and NLP
Loading...
Supplementary material
Other Title
Authors
Thakur, Ayushi
Author ORCID Profiles (clickable)
Degree
Master of Applied Technologies (Computing)
Grantor
Unitec, Te Pūkenga – New Zealand Institute of Skills and Technology
Date
2025
Supervisors
Pashen, Mohsen
Keivanmarz, Ali
Keivanmarz, Ali
Type
Masters Thesis
Ngā Upoko Tukutuku (Māori subject headings)
Keyword
depression (psychology)
social media
natural language processing (computer science)
pattern recognition
social media
natural language processing (computer science)
pattern recognition
ANZSRC Field of Research Code (2020)
Citation
Thakur, A. (2025). Classifying causes of depression from social media posts using machine learning and NLP (Unpublished document submitted in partial fulfilment of the requirements for the degree of Master of Applied Technologies (Computing)). Unitec, Te Pūkenga - New Zealand Institute of Skills and Technology
https://hdl.handle.net/10652/6944
Abstract
RESEARCH QUESTIONS
1 How accurately can machine learning models (e.g., SVM, XGBoost) classify causes of depression expressed in social media posts?
2 Which feature representation—TF-IDF or contextual embeddings (e.g., BERT)—yields better performance for cause classification?
3 How does model performance differ when trained on expert-labeled data versus publicly available self-reported depression datasets?
4 What are the advantages and limitations of using expert-annotated data for classifying causes of depression in terms of accuracy and generalizability?
5 Does combining expert and public datasets improve the robustness and reliability of cause classification models?
ABSTRACT
Depression is a serious challenge to one’s mental health worldwide, affecting billions of souls and causing grievous personal, social, and economic consequences. Detecting de pression early through social media platforms has been a matter of recent interest in the research domain. However, this fails to distinguish between general depression and the underlying cause for it. This is a significant oversight since a therapeutic interven tion, when oriented towards a specific cause like trauma, stress, gender discrimination, or domestic violence, tends to produce far better results. The advancement in machine learning (ML) and natural language processing offers an excellent opportunity to analyze large-scale social media data to detect mental health indicators. This leaves a massive gap, however, in employing these technologies to detect the causes of depression, espe cially with expert-verified standards. This paper considers a new framework for defining depression based on the causes of social media posts. Two complementary datasets are integrated into the study: (1) a small, expert-classified, high-quality dataset annotated by mental health professionals under DSM-5 guidelines, and (2) a large, publicly available dataset of self-disclosed depressive posts. Feature extraction was done with TF-IDF and BERT contextual embeddings. Classification was done with supervised learning through SVM and XGBoost, while latent structures were discovered with unsupervised learning through K-Means. The results indicate that the merged dataset performed better than individual sources, with XGBoost + BERT embeddings achieving the best accuracy and F1 score. Interestingly, unsupervised clustering highlighted latent patterns compatible with known depression causes. The importance of merging expert knowledge with broader social data is thus confirmed by these results, along with the provision of a scalable and interpretable mental health monitoring method.
The study contributes to viewing depression not from a general classification but more specifically, from cause classification, thus allowing for more targeted and timely support and intervention.
Publisher
Permanent link
Link to ePress publication
DOI
Copyright holder
Author
Copyright notice
All rights reserved
