Abstract
Depression is a prevalent mental health disorder characterized by persistent sadness, loss of interest, and impaired daily functioning. Wearable monitoring systems have emerged as promising tools for continuous mental health assessment; however, they face challenges such as data privacy concerns, misclassification risks, and limited ability to capture complex emotional states. To address these limitations, this study proposes a Self-SupervisionEnabled Compounded Multi-Modal Feature-Learning Network (S2-CFL) for depressive state classification using wearable sensor data and psychological self-reports. The framework integrates a Twin-Path Encoder–Decoder Network (TP-EDN) for extracting temporal features from raw signals and a Densely Connected Convolution Pyramidal Transformer Network (DC2-PTN) for learning spatial representations from signal-to-image transformations. A fusion mechanism combines multi-modal features to predict depressive states, valence, and arousal levels, while a Fine-Grained Emotion Classification Network (FGECN) is employed to categorize emotional states into multiple classes using supervised learning models. Experimental results demonstrate that the proposed multi-modal approach improves classification performance and provides interpretable insights into emotional and depressive patterns.