research-article

Open access

Graph Attention Network for Text Classification and Detection of Mental Disorder

Authors:

Usman Ahmed,

Jerry Chun-Wei Lin,

Gautam SrivastavaAuthors Info & Claims

ACM Transactions on the Web, Volume 17, Issue 3

Article No.: 19, Pages 1 - 31

https://doi.org/10.1145/3572406

Published: 22 May 2023 Publication History

PDF eReader

Abstract

A serious issue in today’s society is Depression, which can have a devastating impact on a person’s ability to cope in daily life. Numerous studies have examined the use of data generated directly from users using social media to diagnose and detect Depression as a mental illness. Therefore, this paper investigates the language used in individuals’ personal expressions to identify depressive symptoms via social media. Graph Attention Networks (GATs) are used in this study as a solution to the problems associated with text classification of depression. These GATs can be constructed using masked self-attention layers. Rather than requiring expensive matrix operations such as similarity or knowledge of network architecture, this study implicitly assigns weights to each node in a neighbourhood. This is possible because nodes and words can carry properties and sentiments of their neighbours. Another aspect of the study that contributed to the expansion of the emotion lexicon was the use of hypernyms. As a result, our method performs better when applied to data from the Reddit subreddit Depression. Our experiments show that the emotion lexicon constructed by using the Graph Attention Network ROC achieves 0.91 while remaining simple and interpretable.

1 Introduction

The World Health Organization (WHO) has released a report stating that the COVID-19 epidemic has caused widespread disruption in 93% of the world’s countries.¹ The need for psychiatric treatment has increased as a result of lockdowns implemented in affected areas as a preventive measure. According to the claim [36], physiological anxiety factors such as worry about getting sick and concern about the future increase with each duration of the lockdown. One of the most common illnesses for mental health is Depression, and it significantly affects the daily lives of those affected [4]. It can cause a variety of mental and physical problems that limit a person’s actions and can lead to serious effects on the person’s environment, career, or school, and even the most fundamental human requirements such as sleeping and eating. Although depression is treatable, the condition often goes undiagnosed. This is due to several reasons, including patient’s inability to recognise the problem themselves or the social stigma associated with mental disorders. In recent years, the use of social media has increased, which has opened new doors to the diagnosis of depression. Digital media allows users to share and express their thoughts without restriction [4]. Moreover, people suffering from depression often turn to digital media for information about their condition or to communicate with others about their concerns and symptoms.

In the field of natural language processing, recurrent neural networks (RNNs) have been widely used. RNNs can achieve a level of performance that is considered state-of-the-art by exploiting contextual information in a “memory” mechanism modelled via hidden/cell states [3]. Despite its advantages, the memory mechanism makes it difficult to interpret the decisions made by the model. This is because, as the hidden states are transmitted over time, different pieces of information are interwoven across timesteps, naturally giving RNN models the appearance of a “black box.” Attentional weights, however, are not always understandable, although some of them have the potential to be very revealing. Instead, they often turn out to be a meaningless collection of data for which there is little meaningful interpretation. Moreover, an understanding of the theoretical underpinnings of how RNNs work is necessary to conduct an analysis of attentional weights. Therefore, a first-time user may find it difficult to understand, and consequently its widespread use in the real world may not be practical. A prototypical technique seeks out some examples or prototypes from the dataset for a given sequence and then derives a decision. This process is similar to the way people, such as doctors and judges, make decisions about a new case by referring to similar previous cases. Such prototypes provide intuitive clues and evidence about how the model reached a conclusion in a way that a layperson may understand. This is important from an interpretability perspective. However, existing prototype-based methods locate prototypes at the paragraph level. This makes it difficult to break down the analysis at sentence level, such as the connections and flows that are made up of individual sentences in a given paragraph.

Personality traits have been explored in the context of online texts and the Internet in general, and the results show that these traits have a strong relationship [2]. In addition, several predictive models have been presented to extract psychological background of users based on residuals of their online behaviours and link them to personality traits. However, relationship features such as attachment orientations and other forms of bonding have been largely ignored in the online context, despite the fact that user activity has a strong correlation with aspects of social behaviour. For this reason, examining a relationship profile from an application standpoint is extremely important and provides a wealth of information about individuals’ social profiles.

Conventional methods have viewed mental illness as a supervised text classification task [4]. Depression-based training instances are used to create diverse datasets. Although annotated clinical practice data are of high quality, their development is often expensive and time-consuming. Due to privacy regulations and ethical constraints imposed by institutions, their dissemination is limited in several ways. Data from digital and social media have been used to avoid institutional restrictions and study people in clinical experiments. According to Mukhiya et al. [25], depression is caused by a combination of current and long-term conditions. Under difficult conditions, it is not always easy to determine cause [16]. It is important to recognise early signs and symptoms of depression to get help as soon as possible. Many Internet forums and social media now allow people to contact each other anonymously and talk about distress, grief, and possible treatment options [17]. People from all over the world can openly express thoughts and feelings, according to Muhleck et al. [24]. Online monitoring can be a proactive and promising technique for discovering high-risk concerns. It can promote meditation and increase overall well-being [29].

According to WHO,² anxiety is one of the most disabling diseases in the world. It has become a common disease affecting approximately 264 million people worldwide [10]. According to Mazza et al. [20], untreated depression can develop and cause lifelong anguish. Anxiety can lead to suicidal thoughts in the worst cases. According to the World Health Organization, more than 800,000 people die by suicide each year. Suicide is the second leading cause of death among 15- to 29-year-olds. This is because between 76% and 85% of mentally ill people in low- and middle-income countries go untreated. Lack of financial assistance and support, a shortage of qualified physicians, misperceptions, and social stigma against the mentally ill are all barriers to effective treatment [10]. The biggest barriers that prevent people from seeking treatment are negative tautism, shyness, and fear of disclosure. People often feel ashamed, humiliated, and afraid of having their mental anguish thoroughly explored [26]. For these reasons, people are often reluctant to admit that they are unhappy or to seek psychological care and counselling. The prevention and treatment of mental health problems have become a global priority in health systems.

Text classification is a type of sequence-based methodology and mostly uses attention [4]. In attention mechanisms, variable-size inputs focus on the most relevant correlated features, called self-attention mechanisms. An attention mechanism is often used to compute a representation of a given sequence. For the tasks of text classification and embedding creations, self-attention is advantageous in bidirectional RNNs [43]. Based on previous studies [2, 4, 25], when users disclose personal information, they tend to use expressions that could indicate psychological problems such as depression. Users use words that could be potential clues to mental illnesses such as depression. We hypothesise that people with the same illness, as opposed to healthy individuals, have comparable topic interests in personal statements. For example, the following sentence is a personal statement from a user diagnosed with depression, “I am prescribed medication to treat my depression. I cannot calm my anxiety.” However, it is noted that highlighting the importance of these terms through feature selection and phrase weighting could help distinguish depressed individuals from non-depressed individuals.

Recent years have seen a rapid growth in the importance of incorporating digital healthcare and remote health monitoring technology into the overall healthcare industry. However, those who work in the healthcare industry are acutely aware of the fact that a positive public perception of integrated healthcare systems is the result of a complex interaction between a number of factors. Users of mobile and wearable health technologies need to work on strengthening their relationships with one another. In light of this possibility, medical experts might find it useful to investigate the emotional responses of patients to therapies that are administered remotely. Finding solutions that are acceptable will make it possible for e-health applications to continue to be competitive, which will lead to the development of new tools and approaches that will have an impact on future healthcare applications.

Artificial intelligence (AI) refers to a collection of algorithms that can identify human emotions just by analysing facial expressions. AI can “feel” emotions. AI is able to rapidly identify feelings that a person is experiencing. Monitoring a patient’s emotional state enables medical professionals to swiftly determine whether or not a patient is cooperating with the diagnostic plan and providing assistance to patients in emergency situations. Additionally, the identification of emotions via mobile health monitoring applications and other forms of intelligent healthcare technology would be a significant advance. This is due to the fact that emotionally intelligent and technically advanced algorithms can instantly identify stressed patients and offer assistance in maintaining healthy lifestyle patterns. Emotionally aware and intelligent systems have become an essential component of modern medical practice because of the role they play in determining the most effective course of treatment for individual patients. However, there is still a significant amount of work to be done in the medical field before the technology can properly capture human emotions. It is necessary to have an understanding of the risks associated with collecting and using emotional data of patients, particularly in the context of early disease identification, efficient patient communication, and emergency assistance. It is essential to make use of the vast body of research that has been conducted in this field to appropriately handle risks and problems connected with emotion-aware intelligent technology in the healthcare industry.

The need for services related to mental health has significantly increased as a direct result of the unpredictability of lockdown zones. The probability of having severe health problems is increased when a person is subjected to physiological stresses such as anxiety around sickness and uncertainty regarding their health. Isolation from others and a lack of physical activity are both factors that can make stress worse. In addition, those who work in the healthcare industry frequently experience anxiousness, a lack of protective equipment, and a highly stressful atmosphere, although it can be difficult to establish the exact source of these conditions. This can lead to anxiety and melancholy. In addition, depression is among the most distressing diseases that may be found anywhere in the world. There are roughly 264 million people around the globe who suffer from depressive illnesses. The majority of mental illnesses are misdiagnosed, because there is insufficient verbal interaction and a lack of trust. Depression is the underlying cause of death by suicide for more than 800,000 people each year. Due to the fact that the issue is not being addressed, suicide is the main cause of death among those aged 15 to 29. In middle-income nations, suicide accounts for 76 percent of fatalities, while in low-income countries, it accounts for 85 percent of deaths. Aside from personal concerns, other challenges with early detection include a lack of resources, untrained healthcare providers, social stigma, and speedy reaction. All of these factors contribute to the problem. Because anxiety and shyness can make it difficult to maintain a steady state of mind, some people feel embarrassed about their inability to do so. Even after doing an exhaustive evaluation of patient’s psychological anguish, this issue continues to be a concern. As a consequence of this, people who are afflicted with depression may be unable to participate in extra treatment for the purpose of addressing their existing medical issues.

This study identifies people who show indicators of depression. Early diagnosis increases the chances of appropriate treatment. Unlike methods based on word frequency, the proposed method allows focusing on frequent terms used as nodes in user contexts, which allows early diagnosis of this disorder. In this study, we propose an attention-based architecture for node classification of graph-structured data inspired by recent works. The goal is to compute the hidden representations of each node in a graph by paying attention to its neighbours and using a method of self-observation. The attention architecture has several intriguing properties: (1) node neighbours can help achieve parallelisation between collocated word pairs; (2) variable weights can be used to improve import neighbours; and (3) the method aids in inductive learning (without prior data), since in our case the digital media with the new words develop the model, the model is able to generalise to unseen graphs. In particular, this article makes the following contributions:

(1)

We have proposed an attention-based word weighting scheme that is used to select the significant depression terms.

(2)

We have implemented Graph Attention-based inductive learning for depression symptom classification in mental health interventions.

(3)

We have performed extensive experiments to evaluate the generalisation of the learning system by reducing data annotation, and the results indicated that the designed model outperforms the state-of-the-art models.

The rest of the article is organised as follows: Section 2 summarises the works that are related. Section 3 describes the primary technique used to conduct the experiment, collect data, and build the model. Section 4 discusses the results and findings. Finally, Section 5 summarises the results and makes suggestions for further research.

2 Related Work

Social media data was used to analyse issues related to public health. Numerous health topics (including allergies, migraines, and obesity) were explored using Twitter data [32]. Methods from computational linguistics have also been used to study mental illnesses such as attention deficit hyperactivity disorder. Research on depression has been conducted in a variety of fields, including psychology, medicine, and linguistics. They have attempted to explore the reasons, symptoms, and causes for the diagnosis of depression. Interactions between users leave traces that reflect their personality traits. The study of these traces is essential to a variety of disciplines, including social science, psychology, and marketing, among others. Studies of personality prediction based on online behaviour have increased substantially, but the focus has been on individual personality traits rather than relational components. This is despite the fact that the amount of research on this topic has increased significantly. As a direct result, information provided by users on platforms for social media has emerged as an interesting area of study for topics related to depression [14].

Tweets from Twitter can be analysed to determine how depressing certain content is. Research has shown that individuals commonly publish information about their depression in generic terms, such as symptoms of depression, therapy, and other related topics. The authors draw a number of distinctions between Twitter users who suffer from depression and those who do not. People who are depressed are less likely to engage in social activities, are more likely to express themselves in a negative manner, and are more concerned with medical and religious issues. Nadeem et al. collected messages on Twitter from individuals who indicated that they had been diagnosed with depression over the course of a year [27]. They conducted experiments combining bag-of-words and traditional classifiers.

A new model has been developed known as Time-delayed Multifaceted Factorizing Personalized Markov Chains (Time-delayed Multifaceted-FPMC) [15]. This model accounts for many different types of historical user actions, such as clicks, collections, and add-to-carts, in addition to past sales. The diminishing influence of previous actions is another factor that our model considers. Learning parameters of the proposed model can be done using this approach, thanks to its unified framework called Bayesian Sparse Factorization Machines. Here, the concepts of classical factorisation machines are applied to a more flexible learning framework and the time-delayed multifaceted FPMC is trained using the Markov Chain Monte Carlo method. Extensive evaluations conducted on a number of real-world datasets have shown that the technique is significantly more effective than a number of the existing purchase recommendation systems.

The study used a dataset consisting of 1.7 billion tweets. It then determined where each tweet came from with respect to data collected from 59 countries [19], then provided a list of 22 Online Values Inquiries (OVI), each containing a unique set of questions from the World Values Survey. These questions cover a variety of topics, including religion, science, and abortion. The method has shown that the solution is quite capable of capturing human values online for a variety of countries and topics of interest. The technique also shows that certain offline values, particularly those related to religion, are significantly associated (up to c = 0.69, p = 0.05) with the corresponding online values in the digital domain. Our approach is general. Specialists in the social sciences, such as demographers and sociologists, will find it useful, because it allows them to create their own online values studies using the expertise and knowledge they already have. This will enable them to study human values in an online environment.

This deficiency will be addressed by research in the form of developing a predictive model for a holistic personality profile in online social networks (OSNs). The model will combine socio-relational traits (attachment orientations) with traditional personality traits [11]. Specifically, a feature engineering process has been developed that extracts a variety of features (taking into account behaviour, language, and emotions) from users’ OSN accounts. These features are then used to create a profile of the person. They have developed a Machine Learning (ML) model that predicts users’ trait values based on the features extracted from user profiles. The proposed model architecture is influenced by psychological features; more specifically, it exploits the interrelationships that exist between the different parts of a person’s personality. As a result, it achieves a higher level of accuracy than current state-of-the-art methods. To demonstrate the usefulness of this method, we applied the model to two different datasets—typical online social network users and social media opinion leaders—and compared the psychological profiles of these two different groups. It is possible to make a clear distinction between the two groups by focusing on both Big Five personality traits and attachment orientations. The research presented points to a potentially fruitful line of research for further studies to categorise and characterise OSN users.

In addition, approaches based on Deep Learning (DL) have been used to predict depression. The authors examined Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to detect Twitter users who are depressed [30] and obtained competitive results. Written texts containing possible depressive signs are likely to be highly subjective. Therefore, some authors effectively exploited material to detect depressed users. The authors used sentiment analysis to determine the polarity of Twitter tweets. They found that individuals suffering from depression sent longer emotional tweets. The importance of emotions in posts about despair was also used to diagnose depression on Twitter and Reddit [5]. Recently, another work developed a one-class classification strategy to detect sadness on Reddit [1]. Other studies have examined the network topology of interactions to determine the presence of depression [34]. They tested a probabilistic model that accounted for a variety of variables, including emotional expressiveness, language style, and social network aspects.

Another study used public Twitter data to examine psychological issues [7]. They quickly collected data on anxiety, bipolar disorder, and seasonal affective disorder, among other mental illnesses. The classifier was developed to distinguish each group from the control group based on textual information [21]. The classifier also detects correlations and extracts insights from quantitative and meaningful psychological Twitter signals. Lin et al. used a Deep Neural Network (DNN) to detect stress. The authors used data from four microblogs to evaluate the effects of their proposed four-layer DNNs. Random Forest, SVM, and Naive Bayes are ML algorithms. Neuman et al. [28] has developed a different approach, “Pedesis,” which uses NLP dependency parsing to decompose web pages containing fears and extract extended conceptual domains from metaphorical connections. Then, the domain knowledge defines the phrases used as depression metaphors. The vocabulary is used to independently evaluate the amount of depression in the text.

The predictive power of Neural Networks (NNs) is determined by the hidden layers of the network as well as the architecture. Network tuning requires careful selection among the various available layers, topologies, and hyperparameters. In the training process of the optimized network, the input features have the potential to provide a higher-order representation of the vector [4]. A more accurate representation of the features is taught to students to generalise and improve their ability to anticipate. In research on modern NNs, the network that has both the lowest computational complexity and the highest predictive capacity is often selected. In the past two decades, the number of architectural concepts has increased. The main differences between hidden layers, types of layers, shapes, and links between layers have been identified by Vijayakumar and colleagues [35].

Using ML approaches, Wainberg et al. [39] have shown how higher-dimensional features can be extracted from tabular data. CNNs are responsible for accumulating patterns from visual pixels. The network’s ability to learn and predict is enhanced by the information contained in the pixels and the diversity between pixels. The network is able to achieve this using the translation variable pixels. In natural language processing, such as machine translation, speech synthesis, and time series analysis, the RNN architecture has been developed and used for sequential data. This architecture has been used in several operations. [41]. The RNN model can contain either an encoder or a decoder. The encoder is responsible for sequencing the input, while the decoder organises everything into a vector of some length. To process the input features, the model uses multiple portals, each of which depends on the loss function. The alignment of the input and output vectors is another problem that must be overcome when designing an encoder and decoder for an RNN. The order is determined by the values of the immediately adjacent elements. Another variation of the RNN algorithm is the creation of a new network called the attention mechanism [4]. However, it uses the attention approach of the input vector to selectively assign weights to the many different inputs that are provided. This is done to improve the accuracy of the results. To provide an accurate representation of the features, the decoder can make use of the orientation of the context vector as well as the weights that are linked with it. This is done in accordance with the relative relevance and placement of the information at hand. The weights of the RNN model can be trained either through the architecture or the feature representation to make predictions. This learning process takes into account both the care weight and the context vector. An attention mechanism like this one can take many different network forms, such as a soft design, a hard design, or even a global form. They came up with the soft attention paradigm to reduce the amount of contextual information. The model’s typical hidden state was utilised throughout the construction of the context vector. This method helps to understand how the input characteristic is buried, which decreases the amount of lost information.

Other differences introduced by Luong et al. [18] are local and global attention. Global attention is the middle ground between soft and intense focus. The model chooses the focus point for each input data batch, which helps speed up the convergence process. The position of the attention vector in the local attention model that includes a prediction function can be learned with the help of this. The techniques predict where the attention will be focused. At both the local and the global levels, it is necessary to conduct domain-specific data analysis to achieve computing efficiency in two completely distinct areas of attention.

When it comes to the world of medical imaging, the importance of image security cannot be overstated [8]. A number of research projects have obtained medical imaging. The confidentiality of the images is preserved with no loss of data when encrypted. Due to the complexity, redundancy and capacity of the data, traditional encryption techniques cannot be easily applied to e-health data. This is especially true when patient data is transmitted over open channels. Patients run the risk of having their data compromised, because images are not the same as words in terms of data loss and confidentiality. Researchers have identified similar vulnerabilities and have responded by developing a variety of strategies for encrypting images. According to the study’s findings, currently available methods have application-specific security flaws. The results of this study provide the healthcare industry with a method for encryption that is both lightweight and effective. A lightweight encryption method protects medical images by using two different permutation techniques. Compared to more conventional encryption methods, the security level of the techniques and the time required to execute it are evaluated. Test photographs were used to evaluate the performance of the proposed algorithm. A series of tests have shown that the algorithm is more effective for the proposed image cryptosystem.

The demand for devices capable of performing tasks automatically and communicating with each other has increased as the world has progressed [9]. To meet this demand, the concept of the Internet of Things (IoT) was developed. The Internet of Things allows connected smart devices to communicate with each other over a network to perform a variety of tasks, including automating tasks and making intelligent judgments. The Internet of Things has enabled people to delegate some of their tasks to machines, which then take care of their surroundings and adjust their behaviour accordingly. This allows people to save time and energy. As you can see, these machines use sensors to collect important data, which is then analysed in a computing node. Based on the results of this analysis, the node determines how these devices should operate in an intelligent manner. IoT security has been enhanced by DL-based techniques to defend against attacks. However, IoT networks still pose a significant obstacle in industries with complex infrastructures. Such systems require a long training time to process a large dataset from the network’s past data flow. Decision trees, logistic regression, and SVMs are traditional DL methods. The work involves experimental research on cryptographic algorithms to characterise them as asymmetric or symmetric. It has been studied in terms of rate attack in real-time DL and complex IoT applications. The speed of encryption and decryption of certain encryption algorithms was evaluated using simulations. The tests involve encrypting and decrypting the same plaintext five times and comparing the average time. The maximum key size of each encryption algorithm is given in bytes. The average time each device takes to compute the algorithm is compared. In the simulation, a collection of plaintexts-password and a paragraph-sized text-achieves the targeted fair results compared to existing techniques in real-time DL networks for IoT applications.

Some other recent related works are summarised next. In Reference [40], the authors use a polymorphic graph attention network for establishing good practice in Chinese Name Entity Recognition (NER). The authors’ methodology can easily be mixed with both pre-trained as well as sequence encodings. Their results show support for multi-head attention and has really strong inference speeds. In Reference [44], the authors propose TrajGAT, an embedded map graph attention network for vehicle trajectory data. Their results indicate strong performance against other state-of-th-art methods. While focussing on vehicular technology, their work as stated in their paper may be applicable in other domains. In Reference [13], the authors propose GANLDA, a graph attention network used for disease prediction. The authors combine heterogeneous data on long non-coding RNA (lncRNA) to help connect to diseases. Their novel work was shown to outperform many other state-of the-art methods in a 10-fold cross-validation procedure.

Text-based techniques for diagnosing depression in the context of mental illness can be found in Table 1 [1, 4, 31, 42]. Multiple datasets (eRisk 2017/2018, Twitter, Amazon Turk services-based labelling, and Reddit) were used to evaluate the effectiveness of these methods. The methods discover a wide range of features that capture users’ verbal, behavioural, and emotional expressions. Our motivation for this approach stems from a theory in the field of psychology known as personality theory, which states that personality traits are reflected in a variety of online behaviours and actions. Since the proposed feature engineering method considers a larger number of traits than those studied in previous research, it is possible to perform psychological profiling that is more broadly applicable. By using a crowdsourcing platform, we were able to acquire a labeled dataset to which we could then apply and test our methodology. The approaches then created a baseline dataset that was tagged with their respective psychological profiles. We made extensive use of the Twitter application programming interface (API) to collect tweets from the recruited participants’ Twitter accounts and apply the proposed feature engineering technique. To improve the predictions for individual features, the methodology described above uses a large set of the collected (psychologically related) features, carefully selects the subsets of these features that have the strongest predictive power for each feature, and exploits correlations between personality and relationship behavioural (i.e., social) features. In this way, the approaches not only predict the social components of a psychological profile along with the personality facets (which are not captured by existing personality prediction models), but also use the different attributes for more accurate holistic profile prediction. In this way, the social components of a psychological profile be predicted along with the personality facets (which are not captured by existing personality prediction models). This work shows how node- and edge-level hypergraphs can support the training approaches and tries to extend the trainable instances using a semantic extension strategy [3]. The proposed model aims to reduce the data annotation overhead. Thus, the technique contributes to the generalisation of the system. Semantic vectors are classified based on hypergraph information obtained from the context in which they occur. Based on the semantic information, the resulting word embedding selects a subset of the unlabelled text. This method finds instances using unlabelled text extracted from the hypergraph learning process. The approach integrates the additional training points into the model training.

Table 1.

Paper	Data-set	Method	Unsupervised	Adaptive learning	Performance
[42]	Online Forum	Feature-based, Maximum Entropy, and Decision Trees	No	No	54.5
[1]	Erisk 17/18 \| Anx 18/19	Word Embeddings \| SVM and KNN	No	No	77
[4]	Amazon Mechanical Turk	Synonyms, Attention network	Yes	No	88.8
[31]	Erisk 17/18	Embedding-based Deep Learning (DL)	No	No	66
[3]	Amazon	Unsupervised Embedding Attention Network	Yes	Yes	87
Proposed	Erisk	Word Embeddings Graph Attention Learning	Semi-supervised	Yes	90

Table 1. Related Methods Summary

The vast body of NLP literature has a variety of techniques that have been proposed for emotion detection. Emotional knowledge-based (EKB) systems, however, have not yet been thoroughly researched. The word sense lexicon and the learned embedding in a variety of contexts are the two components that make up the EKB. Our first proposal was an embedding that integrates words that have different meanings, depending on the context with the depression lexicon (which is based on word sense) and emotional information from Internet forums. Words that can convey context and sentiments are the building blocks of emotional intelligence. The embedded sentence structure is a representation of what was learned. The retrieved embedding contributes to capturing the semantic composition of the text in a more meaningful way. Calculations are made using the linguistic patterns to determine the co-occurrence frequencies of the words that have been vectorised. The learned model is built from the one-of-a-kind word of the author and is represented with a vector whose length is constant. A word that is very similar can be found in the vicinity. The majority of the pre-trained embeddings are for communication that is used for broad purposes. As a result, pre-trained models cannot be applied to the emotional analysis. The bespoke mental health model is trained with the help of the word sense model and the transfer learning method. This helps us to expand the corpus. The primary reason for this is that the majority of the embedding is trained on open-source data, including the texts of Wikipedia and the data from Twitter, which are used to determine sentiment. The meaning of the word “feelings” is conveyed by the words “emphasis sad” and “emphasis pleased.” However, these words denote a very distinct mental state. Because of this, it is necessary to extend the embedding by utilising word sense.

3 Methodology

Traditionally, the problem of detecting depressed users on social media has been approached as a binary text classification problem. The goal is to assign given classes (depressive and non-depressive) to a given set of users, each represented by a single document with all their texts, as described above. As shown in Figure 1(a), the training part of the embedding involves training user-defined data. The gradient coefficients are used to create training embeddings based on a phrase with node information. To achieve promising results, the emotional lexicon based on word sense helps. By using specialised embedding techniques to classify different symptoms, the goal of fine-grained classification can be achieved. Words are decomposed into their constituent parts using part-of-speech tagging. We used WordNet to extract synonyms, antonyms, hypernyms, and the physical meaning of each extracted part-of-speech from the corpus, which consists of a collection of texts. As a direct result, we have language describing sentiments associated with each document. The sentiment is a domain-specific contextual corpus, which is presented here. Then, vocabulary is created using the sentence used in training the model. The learned vector V is the embedding that was generated as a result. Then, we derive the sentence embedding by taking the average of each word vector from patient authored text. Word sense and emotional intelligence are both represented by the vector. To convert the patient-provided text into a vector and extract all nine symptoms from the PHQ-9 questionnaire lexicons, a trained model is used. The cosine similarity technique obtains the two embeddings corresponding to their input parameters. We have a similarity value that can range from 0 to 1 for each set of nine symptoms.

Fig. 1.

Given two vectors—one representing the text written by the patient and one representing the symptom lexicon—the variable V is used to convert textual features into semantically aware vectors. The fact that two different embeddings look so similar suggests that the written language is closely associated with certain symptoms. We then extract lexicons from the examples. The emotion lexicon is trained using the word embeddings with window size. The emotion lexicon facilitates the conversion of nodes into word vectors that are merged for the hypergraph and attention network. Word structure is inserted into the text. Then, we design a pooling method for the attention network. Next, a small group of sentences with average embedding is selected and the similarity between them is calculated to obtain their labels. The gradient method is then revised. The embedding approach is a pre-trained embedding. As shown in Figure 1(c), the prediction is made using the pre-trained embedding method for depression and non-depression. Lexicon expansion and training for classification are mentioned in detail in Algorithm 1.

Accordingly, we were interested in investigating the relevance of personal statements to this task and comparing the recognition performance when only certain words were used to train the classifiers with detection performance when the rest of the document content was used, as shown in Figure 1. For this purpose, we first divided the sentences of each document into five subgroups associated with different personal pronoun families: I, you, he/she, we, and they. We then distinguished between depressed and non-depressed users to train several standard classifiers to account for each of these categories.

3.1 Dataset

For our studies, we used the 2018 eRisk corpora for split tasks [16]. The information included in the dataset comes from visitors to the online community website reddit.com. The data instance includes both types of users, namely, depressed and non-depressed. The users who were clinically depressed stated quite openly that they had been diagnosed with depression. The dataset has a problem of imbalance, i.e., the instances of one class are substantial, while the other class includes fewer examples overall. The statistical summary of the dataset can be found in Table 2.

Table 2.

Type	Statistics
Corpus size (Number of posts collected)	15,044
Number of sentences	133,524
Average sentences per post	8.87
Training set size (Number of posts)	14,944

Table 2. The Statistical Summary of the Training and Testing Set

3.2 Preprocessing

We perform pre-processing on the sentence to remove the words that are inconsistent. To maintain coherence, we begin by converting the text to the UTF-8 format, and then we lowercase all of the words. After that, we got rid of any unnecessary spaces that were between the sentences. After that, we get rid of the illogical symbols. (#, +, -, *, =, HTTP, HTTPS) that have no sentiments. Then, we remove the slangs and short forms of words, e.g., won’t is replaced by will not, can’t by cannot, and so on.

3.3 Graph Attention Network

This section describes the embedding so it can be used to label the data. The PHQ-9 questionnaire and the patient-authored text are transformed into two vectors, each reflecting the latent sentiment by using the proposed technique. To facilitate the semantic expansion of the PHQ-9 questionnaire, seed concepts were created using WordNets. The next step is to calculate the degree of similarity between the authorised vector patient and the latent representation of symptoms. The result of the model is a prediction for several classes based on the degree of latent similarity between nine different symptoms. We represent the writing of a patient as a sequential directed acyclic graph in which word points indicate location and time. The structure of this study is represented by the word nodes, and the edges between each node in this study each indicate a set of words that occur in sequential order. The individual nodes that make up a graph are represented by a hypergraph. In Figure 2, a hypergraph is used to arrange the words according to the semantic meaning of each word. Both edges and hyperedges have their own attention networks. The former is called edge-level attentional network, and the latter is called node-level attentional network. The averaging method is used to organise the hyperedges into hypervertices before proceeding with sentence embedding. In the model created, it refers to the set of nodes connected to two or more other nodes; more specifically, the set of nodes connected to two or more other nodes. The structure of the hypergraph, denoted by the letter G, can be conceived as an incidence matrix.

Fig. 2.

In this study, the single graph attention method was used. It is defined by a single graph-attention layer. The neighbourhood exploration is then used to extend the work [6]. However, in this study, the self-attention and weighted methods were used to generate the embedding. Node features are used as input to the layer, \(\mathbf {N}=\left\lbrace n_{1}, n_{2}, \ldots , \vec{n}_{i}\right\rbrace , \vec{n}_{i} \in \mathbb {R}^{F}\), where N is the number of nodes and F is the features of the nodes. The layer uses the information from the feature set to generate the high-level features, i.e., \(F^{\prime }\)), \(\mathbf {h}^{\prime }=\lbrace {n}_{1}^{\prime }, {n}_{2}^{\prime }, \ldots , {n}_{i}^{\prime }\rbrace , \vec{n}_{i}^{\prime } \in \mathbb {R}^{F^{\prime }}\), as their output.

(1)

After the word-level modelling, the networks are expressed by latent learning representations. We captured the nodes based on the degree sequence in ascending order. For example, when the proximity is set to 1, the node contains the set of neighbours of a vertex. The node symbolises the distant extension of the node structure. We uncover different patterns by analysing the sorted degree sequences of two nodes in a node network. The learning function indicates the distance between two phrases from the structural point of view. All keywords within a certain radius represent their neighbourhood. Only if both phrases have a distance advantage by using the structural evolution at a distance and the node distance, the learning function can be used to calculate the degree sequence of the node that is at the same distance, using the same distance. Calculate the maximum distance between two sorted degree sequences using Dynamic Time Warping (DTW). This method allows the extraction of regions more tolerant to sequences of different lengths, as well as flexible summarisation of sequence patterns. The DTW method helps determine the appropriate alignment of word growth sequences. DTW aligns sequences by minimising the sum of distances between matching elements, given a distance function for each element in the sequences [4]. We used this distance measure because the degree sequences between a node and its neighbours describe the trajectory.

We propose the linear transformation based on the weight matrix. The nodes are then subjected to self-attention computed via a shared attentional process. This helps in transforming the input feature into higher-level features, as mentioned in Equation (1). We used the node information to extract the local topology. The graph structure was created using the masked attention mechanism. This method uses the first node order to create the coefficient and weight matrix. Then, the proposed model normalises them using the Softmax function as used by Reference [38] and mentioned in Equation (2):

\begin{equation} \alpha _{i j}=\operatorname{softmax}_{j} (e_{i j})=\frac{\exp (e_{i j})}{\sum _{k \in \mathcal {N}_{i}} \exp (e_{i k})}. \end{equation}

(2)

In this study, a single-layer feed-forward neural network with weighted parameters was used, and then a nonlinear active function, namely, LeakyRelu, was applied. The method computes the coefficients for the attentional mechanism as mentioned in Equation (3):

\begin{equation} \alpha _{i j}=\frac{\exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf {a}}^{T} [\mathbf {W} {n}_{i} \Vert \mathbf {W} {n}_{j}]\right)\right)}{\sum _{k \in \mathcal {N}_{i}} \exp \left(\operatorname{LeakyReLU}\left(\overrightarrow{\mathbf {a}}^{T}\left[\mathbf {W} {n}_{i} \Vert \mathbf {W} {n}_{k}\right]\right)\right)}, \end{equation}

(3)

where T is the transformation and concatenation function.

After this step, the normalised attention coefficients are used to compute a linear combination of the features corresponding to them. This linear combination then serves as the final output features for each node, as indicated in Equation (4).

\begin{equation} {n}_{i}^{\prime }=\sigma \left(\sum _{j \in \mathcal {N}_{i}} \alpha _{i j} \mathbf {W} {n}_{j}\right). \end{equation}

(4)

The results of this study show that extending our approach to the use of multiple-head attention was useful in stabilising the learning process of self-attention. This result is similar to that of Vaswani et al. [37]. The stack-based approach, in conjunction with the attention network, is able to determine which works are significant based on the emotional meanings they carry and the relationships they have with the output class. According to Equation (5), the aggregation layers help to obtain a representation of the informative words from the vectors. The technique is used to identify the words that convey emotion to form sentences, and then these sentences are combined to create a paper or dialog that is identifiable to the symptoms. Since certain words can be deceptive while others are really important, the key words are weighted against the words that come before and after them in the sentence. According to the results, the functionality of the model can be further improved by using a larger number of lexical and grammatical variants.

\begin{equation} {n}_{i}^{\prime }=\sigma \left(\frac{1}{K} \sum _{k=1}^{K} \sum _{j \in \mathcal {N}_{i}} \alpha _{i j}^{k} \mathbf {W}^{k} {n}_{j}\right) \end{equation}

(5)

3.4 Psychometric Questionnaires (PQ)

The PHQ-9 questionnaire is a common method for assessing the signs and symptoms of depression and is also the accepted practice [12]. The Patient Health Questionnaire-9 (PHQ-9) is an instrument that helps extract nine unique behavioral types based on the Diagnostic and Statistical Manual of Mental Disorders 5 (DSM-5).³ According to Table 3 and the sample document,⁴ these nine symptoms can be assigned to different disorders, such as sleep, interest, concentration, and eating disorders. After completing the questionnaire, the psychiatrist determines the total score for the examination. This score can be used to determine the patient’s level of depression. The standard approach for the CSEP is for the psychiatrist to ask the patient the questions for each category and observe the patient’s response to classify the patient’s frequency into the appropriate class.

Table 3.

Symptons	PHQ-9	Seed terms
S1	little interest or pleasure in doing things	interest
S2	feeling down, depressed, or hopeless	feeling, depressed, hopeless
S3	trouble falling or staying asleep or sleeping too much	sleep, asleep
S4	feeling tired or having little energy	tired, energy
S5	poor appetite or overeating	appetite, overeating
S6	feeling bad about yourself or that you are a failure or have let yourself or your family down	failure, family
S7	trouble concentrating on things such as reading the newspaper or watching television	concentration, reading, watching
S8	moving or speaking so slowly that other people could have noticed or being	moving, speaking, restless
	restless which is moving around a lot more than usual
S9	thoughts that you would be better off dead or hurt yourself	dead, hurt, suicide

Table 3. PHQ-9 Questionnaire and Seed Terms for Each Symptom

(a)

score 0: not at all,

(b)

score 1: several days,

(c)

score 2: more than half the days, and

(d)

score 3: nearly every day.

3.5 Lexicon Expansion

The emotion extraction and recognition method has been used by a variety of researchers. Emotion-based knowledge-based systems (EKB), however, have not yet been studied. To construct knowledge, EKB systems use contextual embedding and emotional lexicons. A variety of emotions and states of mind are represented by the information. For each word in the patient text [33], we embedded it using the GloVe word embedding. The text is expanded by using the seed phrases for each questionnaire, which were hand-selected based on the attitudes expressed in the questionnaire shown in Table 3. We then employed the terms retrieved via WordNet to determine their synonyms [23, 25]. Then, we determined the hypernyms, hyponyms, and antonyms associated with each synonym. We also used the five terms that were most similar to the terms in the user text to expand it using GloVe embedding.

Table 4 gives a list of symbols that are used in the Algorithms. The corpus T consists of the set of documents \(T \, = \, \lbrace t_1,\ t_2, \ldots , t_n\rbrace\) (Algorithm 1, line 2). As mentioned in Algorithm 1, we performed text processing as in Section 3.2) and extracted the word types. We used the WordNet to extract the synonyms, antonyms, hypernyms, and physical meaning for each extracted word of the word group (Algorithm 1, lines 3 to 8). Then, we used the neighborhood-based graph generation (Algorithm 1, line 11). Then, we extracted the self-knowledge based node features (Algorithm 1, lines 13 to 16). The output of the model is the embedding for nodes n. For classification, we used the averaging technique to find the mean value of the input nodes that will be used for classification. Then, we applied the lexicon strategy through embeddings by determining the labels of the probability score and updating the gradient descent \(\lambda\) (Algorithm 1, lines 18 to 21). For details on the processes involved in classification, please see Figure 3.

Fig. 3.

Table 4.

Symbol	Description
N	Number of nodes
F	Features
\(\mathbb {R}\)	Latent representation
S	Extended Synonyms
W	Number of words
K	Distinct words in the vectors
T	Corpus of the text
d	set of synonyms
V	set of unique synonyms
I	Symptoms set

Table 4. Symbol Tables Used in the Algorithm and Equations

3.6 Deep Learning (DL) Model

fasttext: We used the fastest baseline model. The Facebook AI Research lab is responsible for the development of the open-source library known as fastText. A primary goal of the project is to develop scalable solutions to the challenges of text classification and representation during the process of dealing with enormous datasets in a timely and accurate manner while at the same time developing innovative techniques for processing the data. The embedded token was used in the distributed representation space of this model [22]. The average of the individual word representations yields a text representation. This was then passed to the Softmax function, which calculated the probability distribution of symptoms. We then used the cross-entropy loss function for each batch to calculate the loss. The model does not consider the order of words in the word representation. Therefore, we used n-gram features to calculate partial information about emotional words. When the number of classes is very large, the computational cost of Softmax increases exponentially, while hierarchical Softmax speeds up the training process.

Bidirectional LSTM: We used the gated recurrent neural network architecture. By taking advantage of the hidden state, LSTM is able to retain information that is derived from inputs that have already been processed by the algorithm. As a result, this is the primary function of the LSTM. As the only inputs unidirectional LSTM has ever seen have been from the past, it is only able to maintain information that has occurred in the past, as this is the only input it has encountered. It is important to remember that when you use the bidirectional method, you will need to process your inputs in two different directions: either from the past to the future or from the future to the past. There is one particular difference between this method and the unidirectional method: You are preserving information from the future when you use the LSTM that runs backwards. When you combine these two hidden states, you will be able to preserve information from both the past and the future at any given time when you combine them together. The RNN design is based on a memory mechanism that helps categorise symptoms indicative of individual emotions [2]. In the hidden layer, we used the element-wise averaging approach. The model was fed with two inputs: the original input vector and a reverse duplicate of the input vector. This provides additional context to the network.

Bidirectional LSTM with Soft and Global attention: The attention layer was combined with a bidirectional LSTM to improve the relevance of emotional phrases using the attention layer. It may be possible to achieve greater generalisation by using a more extensive network. To select the points for each batch, the attention layer is used. Using the prediction function, the attention vector can then be learned by local attention from the prediction function. To have the model converge more quickly, weighted words are used to categorise the text more accurately. As a result of entropy learning, the model can be adapted and improved by increasing the number of labelled data points to improve its performance.

4 Experimental Result and Analysis

In this study, we used the processing settings listed in Table 3. After first performing the preprocessing described in Section 3.2, we created the custom depression embedding and then performed the graph attention-based node feature extraction in Section 1. These steps were performed in the order described in the section preceding this one. The embedding is used to categorise the data from Twitter.

In this research, we used processing configurations as given in Table 5. The results of the fasttext-base model are shown in Figure 4. The trend plot illustrates that as the training, development, and test sets approach the upper left corner of the precision curve, the model overperforms. This is because the training, development, and test sets converge as they approach the upper left corner. To determine how well the algorithm performs, we computed the ROC curve, precision, recall, and F-measure to determine how well the algorithm performs in terms of precision, recall, and F-measure. To compare the performance of multiple classifiers, it may be beneficial to combine the performance of each classifier into a single metric. This is because it allows the performance of the classes to be compared. A typical method for determining the area under the ROC curve, or AUC, is to determine the area of the curve under the ROC curve. Thus, it is comparable to the probability that a random sample of positive observations will be ranked higher than a random sample of negative observations, i.e., it is comparable to the Wilcoxon rank sum statistic for two samples. It is possible that a classifier with a high AUC will occasionally perform worse than one with a lower AUC at a given location. Nevertheless, AUC is an excellent comprehensive indicator of the accuracy of a predictive model. Because the number of instances for each symptom is not the same, we chose to use both approaches, macro and micro. There is a significant difference between macro and micro averaging. In macro averaging, each class is weighted equally, whereas in micro averaging, each sample is weighted equally. If each class has the same number of samples, then the result for macro-averaging is the same as for micro-averaging as long as each class has the same number of samples. The model used a graphical word representation, regardless of the context in which it was used. It is clear that when words are used consecutively, they have different meanings. When words are combined with lexicons to form a single representation, the meaning of the term may be distorted in a way that is detrimental to the term. The embedding approach must accept and capture multiple meanings of the same word, even if the resources are present in the imbalanced training corpus, for it to be successful. For the embedding approach to be effective, it is important that it does so.

Fig. 4.

Table 5.

Device	Intel Core i7-9700K	RTX 2070
Architecture	Coffee Lake	Turing (TU106)
Base clock	3.6 GHz	1,410 Mhz
Boost clock	4.6	1,710 Mhz
Total cores	6	2,304
Memory	16 GB	8 GB
Memory bandwidth	32 GB/s	448 GB/s
Deep Learning (DL) library	Sklearn	Tensorflow
Language \| library	Python 3.8	CUDA 8.0 \| CUDNN

Table 5. Experimental Setup

An emotional lexicon consists of a collection of words with emotional connotations. The accuracy of a prediction based on unreliable data will be lower than that of a prediction based on reliable data. To estimate and reduce the noise, we use a Bi-LSTM network. LSTM networks are generally used to predict a time series of data. Bi-LSTM networks are bidirectional, unlike LSTM networks. Since the forward LSTM network learns from past data in the upward direction, it uses the input values at that time, the gate weights of the LSTM cells, and the forward or backward output. Like the forward LSTM network, the backward LSTM network also learns from future values in the reverse direction, using the inputs at the time, the gate weights of the LSTM cells, and the forward/ backward output. The output gate is responsible for storing the data related to the bidirectional steps taken.

A description of the hyper-tune architecture setting can be found in Table 6. Each size applied to the text tensors has an embedding dimension of 300. The matrices have the same number of columns, regardless of the number of words that are being processed simultaneously. Max-pooling layer output is of the same size as the output of this layer. Each of the outputs created by the various filters that had the same size is applied to this layer using the max operation. All outputs are merged into a single feature. By using the max operation, we are guaranteed to have recorded the most important aspect of the sentence or document. The primary benefits of combining resources are the following: The number of parameters has been reduced significantly, resulting in a reduced calculation cost and a reduction in overfitting. As a result of concatenating the feature vectors that have been independently derived from the text tensors, the final feature map will be produced. It will contain the most essential and conspicuous characteristics of the text-based feature vector extraction. We made use of four dense layers to achieve a more gentle dimensionality reduction. In addition to the first and second dense layers, there is a third dense layer containing 64 neurons. As a result of the presence of neurons in the final dense layer, binary classification is possible. Each dense layer has a half probability of dropping out. As a result of the ReLU activation function, the first three dense layers are responsible for binary classification. However, the Softmax function is responsible for handling the final dense layer.

Table 6.

Methods	Descriptions
Average length	1,319 words
Lexicons	4,000
Number of neighbours	6
Parallelism	80%
Batch size	128
Drop out ratio	0.6
Layers	Bi-LSTM, Attention, Dropout
Word Embedding Dimension	300
Dense layer	To keep emotion lexicon crucial pieces of information

Table 6. Hyper-tuned Model

The content of online communications defines particular emotion keywords, which is why the filters are applied first to the text part of the communication. Afterwards, feature vectors are concatenated together using features derived from both branches belonging to the same text modality to extract features from the concatenated feature vectors. Last, but not least, is that to converge the framework into a classification system that can handle multiple labels, multiple dense layers are added with a gradually decreasing number of neurons for a softer reduction in dimensionality. By doing so, we ensure that all of the distinguishable features that are used to classify an object can still be preserved. As a result of the combination of a semi-supervised attention neural network with a self-assembling architecture, it is possible to perform optimal feature engineering out of both labelled and unlabelled corpora in addition to a supervised neural network. Therefore, we have been able to obtain the dimensions of the layers of the neural network to construct the ultimate configuration of the network.

In this study, a dense layer was used to generate a vector of selected features provided in parallel with the Bi-LSTM. An activator and regulariser are used to complete the normalisation process, which helps reduce overfitting by using a regulariser. To start the network, the next layer merges the vectors to build the network. To solve this problem, you need to consider the number of hidden neurons in each Bi-LSTM unit. In the context of the attention layer, two tasks need to be done. First, we need to integrate the output of the upstream layer into the downstream layer. This is a relatively straightforward process to perform. In the parallel structure, there must be a layer that assembles the output data in channel order. The reason for this is that the input data for the parallel structure already contains the channel dependency information. As is well known, there are a number of different inputs to the different representations. Therefore, the other task that needs to be done is to filter the essential representations for detection. This is done by redistributing the weights of the representations according to a certain mechanism. Based on data inputs from multiple channels of Bi-LSTM networks, the attention mechanism computes the most recent hidden state and attention score vectors based on the most recent hidden state information. After determining the score values of the representations, we apply the dot product function to determine the score values of the representations. Therefore, we use the dot-product focus, which is more time-saving in this case. Following the normalisation process, each of these scores is normalised using a Softmax function. A context vector is formed by aligning and summing the vectors to create the final vector.

When using RNN with GRU, experiments have shown that long-term memory cells perform well in the sequential task. The output layer of a bidirectional LSTM architecture can preserve the last step length of the hidden state. Moreover, we used the bidirectional LSTM architecture that reads input token lists from the end and sets a parameter for forward unrolled LSTMs. Consequently, the position of each token has two input states that are concatenated to produce the output state, which can also be extended to the attention layer. The dropout ratio was set to 0:5 to prevent overfitting of the LSTM layer.

As in Figure 5, the Bidirectional LSTM high precision-recall curve value of 0.81 has been achieved by the model. By using the bidirectional technique, it is possible to go forward and backwards while capturing the context at the same time. As a result of the hidden states, the relational meaning of the lexicon and graph structure is retained and stored. Consequently, there are very few false positives as a result of this method. With the embedding feature of the model, which incorporates verbal sentiment, the performance of the model has improved; however, the regularisation and averaging of the model are affected by long-range text relations. It has been found that the models in some batches are not able to differentiate between symptoms and lower classes. In the case of increasing text and sentiments overlapping within each sentence, the model does not perform optimally when the text is increased.

Fig. 5.

The semantic vectors are arranged according to the semantic information obtained from the context in which they are located. By using the available semantic information, the created similarity metrics can be used to make the process of selecting an unlabelled text subset clearer. The unlabelled text is filtered out using the proposed approach and then included in the subsequent cycle of the active learning process. Previously, the unlabelled material was separated out and ignored. Thanks to this strategy, the training of the model is improved by incorporating the newly obtained training points. This cycle continues until the best solution is found. Then, all the unlabelled text is converted into the training set, which serves as the basis for the algorithm. The expanded lexicon of emotions helps reduce overfitting and generalisation due to the expanded lexicon of emotions. A vector representation with nodes and an edge-based hypergraph are supported by the attention-based system structure. A node’s attention contributes to the context of the text, while an edge’s attention contributes to the emotional trigger represented in the text. As an online adaptive intervention, this technology can be used in virtual sessions with mental health patients to facilitate the recovery process.

The Bidirectional LSTM with attention outperformed fastext and Bidirectional without attention in Figure 6. As a result of the chosen batch weights and the attention mechanism, learning the sent words is easier. The segmentation of sentences according to the importance of each word contributes to the improvement of the results. The results illustrate that attention is capable of capturing the temporal fluctuations associated with emotional phrases when it is equipped with a location prediction function. As a result of the lower error in the training and development sets relevant to the test set, the model should be able to generalise more effectively. It is evident from the high positive and false negative rates of the training data that global attention contributes to the expansion of the training dataset. As a result of this, it can be concluded that local word location plays a major role in learning emotional segments and that sentence attention plays a major role in context learning.

Fig. 6.

There is a text written by a patient that describes him as a whole, using keywords that describe what he is. In a first step, the screens are applied in parallel to the text. Then, the extracted features are concatenated by early fusion to correspond to the same text modalities, since the features extracted from both branches correspond to the same text modalities. It is proposed that successive dense layers with a decreasing number of neurons can be added for finer dimensionality reduction to converge the architecture to a categorical classification system such that all specific information used for classification is preserved. A semi-supervised graph attention neural network coupled with a temporal self-assembling architecture enables optimal feature extraction from both labeled and unlabelled corpora using the semi-supervised graph attention neural network.

As can be seen in Figures 7 and 11, the outcomes are improved by using the attention method, which uses the predictive capacity of the words’ locations to improve the accuracy of the results. Furthermore, it should be noted that the incorporation of lexicon-based neighbours also contributes to an overall improvement during the application of the attention technique, uncertainty sampling can be used as a means to assist in the identification of unlabelled occurrences that occur at the decision border if the attention technique is used. Based on the level of confidence in the model, a classification can be made. The model can be used by psychiatrists to determine how the patient’s emotional experiences relate to their mental symptoms and how those experiences are related to the patient’s emotional experiences. To assist in the identification of symptoms and the classification of the phrases, the graph-based approach makes use of attention nodes and lexicons to assist in the identification and classification of the phrases. It can be implemented as a computer-based method for Internet-based therapies and provides the psychiatrist with the assistance he or she needs to evaluate the patient’s notes during therapy.

Fig. 7.

In this way, we were able to set up a neural network with the required number of layers and the required size. Thanks to the proposed semi-supervised architecture, ensemble predictions can be made for unknown labels that are as accurate as the actual labels themselves. Considering that the ensemble predictions of unknown labels by the proposed semi-supervised architecture are as accurate as the actual labels, this is also to be expected. Consequently, training the model with a relatively modest number of labeled samples yields good results, and increasing the amount of labeled data will not significantly change the performance of the model, as it is trained with a modest number of labeled samples. Experiments have been conducted with the proposed framework on datasets to show that the model is able to categorise text and achieve excellent performance even with a small amount of training data. Although the proposed system has demonstrated its superiority on a number of parameters, it still has some limitations or weaknesses that need to be addressed. We have optimised our recognition algorithm to handle patient-authored text data, but it could also be used in future studies with text data from other sources. This system does not consider the trustworthiness or authenticity of the source, which can also play a critical role in detecting underlying mental health problems. Instead, our approach uses a multi-category analysis system to classify texts into nine different identification symptoms. However, the solution can be further improved by classifying the text based on a rating or scale to make it more robust. To make the system useful in the real world, it can be improved as a standalone application or as a browser plugin. Using web scrapes of multiple versions of a patient’s story, it is possible to gain a deeper understanding of the purpose of the story by analysing the timeline.

Since the dataset is not very large, the preprocessing steps—such as tokenisation, sentence splitting, dependency parsing, and negative example selection—can have a significant impact on the performance of the models. The best F1 score we can obtain on the test set is 90%, as shown in Figures 8 and 9. We will need to re-implement this model to apply it to the pre-processed data version when we apply it to the re-implemented model. Since both models use the same method to partition the data, it is possible to attribute the significant performance differences to changes in the preprocessing of the data that are responsible for the significant performance differences. Preprocessing is responsible for significant performance differences. This again highlights the importance of using the same preprocessed data to evaluate the efficiency of different models based on the same preprocessed data.

Fig. 8.

Fig. 9.

Lexicon expansion, discussed in Figures 8 and 9, leads to conclusions based on the data not labeled with and without lexicon expansion. The results show how the learning cycle proceeds under semi-supervised conditions and how it is affected by the environment. Due to the limited space available and the enormous number of emotion types that can be recognised, we found that emotion recognition is difficult. Emotion recognition phrases usually consist of very long words with specific terminology built into them, making recognition difficult. It was also difficult to determine the meaning of the word “emotion,” because it occurs in a limited number of contexts but covers a wide range of applications in a variety of contexts. In more than half of the cases, the term in question refers to a context rather than its actual meaning. This presented quite a challenge when it came to completing the work.

In this way, it is clear that the time that has passed after the event has an influence on the number of error years recorded for the events that occurred within a narrower period after the event. In addition to the emotional event itself, there are a number of other factors that are also important, such as the intensity of the emotional event and its duration. In summary, the temporal expressions detected in the texts written by patients were arranged in descending order according to the probability score determined for each expression. The score of a lexicon determines the position of an expression in the text; the higher the score, the higher the position of the expression in the text. By considering the temporal expressions, an estimate of the time required to pay attention to the document can be made. It is important to note that the methods used to divide the documents have an impact on the degree of accuracy with which the duration of emotion concentration can be calculated. However, the time interval between the time of the appointment and the time of occurrence has no discernible influence on the precision values caused by the time interval between the time of the appointment and the time of occurrence. The second method of evaluation, looking at the average error over the years, needs further study. It is a measure calculated from the difference between the estimated and classified emotions.

Thus, it is clear that the number of error years recorded for events that occurred within a narrower time frame after the event is affected by the length of time that has elapsed since the event. There are a number of aspects that play a role in this, not the least of which is the emotional experience itself and its intensity and duration. The temporal expressions in the text written by the patient were ordered in descending order of their probability rating. The position of a word or phrase in the text is determined by its score in a lexicon; a higher score indicates a more prominent placement in the text. This highlights a particular prominent part of the text. A rough estimate of the time spent reading the document can be made by looking at the phrases that refer to time. The precision with which the amount of time spent in emotional intensity can be calculated depends on the methods used to divide the documents. However, the precision values are not significantly affected by the time interval between the date and the event. The second evaluation approach, which calculates an annualised error rate, also deserves a closer look. This is a metric derived from the comparison of estimated and classified emotional states.

The diversity and consistency aspects of lexicons, however, help to focus attention on the most informative words by assigning higher relevance values to these words, as mentioned in Table 7 and shown graphically in Figure 10, respectively. For this reason, the process of aggregating vector representations can learn more effectively. Even if the phrases that trigger the alarm are interwoven with a number of other words, the most informative terms in an emotion lexicon are highlighted by assigning them higher relevance values, this is how the lexicon works. The representation aggregation approach is thus able to learn additional true hidden vectors. The diversity and consistency of dictionaries highlights the most informative terms by assigning them higher relevance scores. Therefore, this method of aggregating vector representations can learn faster and better than any other. You do not need to worry if the phrase that triggers the alert consists of several different words. The most informative phrases in an emotion lexicon are still highlighted by assigning them a higher relevance score, as is the nature of dictionaries. Consequently, the representation aggregation method can be used to learn more true hidden vectors.

Fig. 10.

Table 7.

Methods with cycles	Precision	Recall	F-measure	Type
Node graph attention (50)	0.89	0.90	0.89	Lexicon Expansion and Embedding
Node graph attention (10)	0.83	0.85	0.84	Embedding approach
Node graph attention (10)	0.81	0.78	0.79	Lexicon expansion

Table 7. A Comparison among Different Techniques Related to the Proposed Model

LES: Lexicon expansion with SSL, Cycle = [10,50].

The highest score this year goes to the expansion of the emotion lexicon. The emotion monitoring model assumes that any sentence that contains all the essential emotion expressions as well as the corresponding trigger words is likely to convey the meaning of the event. There is a high probability that the emotions expressed in that line will also play a similar role in that event. To begin, we take the time to carefully evaluate the accuracy of the automatically labeled data. From the automatically labeled information, we randomly selected 500 samples. It is important to note that the selected samples are statements with highlighted triggers, labeled arguments, and relevant event types and argument role assignments for each statement. Adding automatically generated and labeled data to the training sets is one way to extend the training sets. We then determine whether the performance of the event extractor trained with these extended training sets has improved as a result of the extended training sets. In this way, the effectiveness of the proposed method can be automatically demonstrated to show that it works. In this study, emotion expansion was used to remove distracting word triggers and expand nominal triggers by filtering out distracting word triggers. The fact that event identification results are higher when the feature set is used than when it is not suggests that using the feature set is also beneficial when expressive neural methods are used to identify emotional information when the feature set is used. Given the large differences between arguments, it is abundantly clear that the features outlined for extending the argument language are also beneficial because of these differences.

The test has a high rate of truly positive outcomes because of its strong. As a result of these findings, there is no evidence that there are key phrases that contribute more to the classification of depression symptoms and that are crucial for understanding them. By focusing on particular words that contribute more than others, the network also helps to minimise the amount of computing power that must be used to perform the computation. It is because of the fact that the model is aware of the target word in the task and has learned both the positive and negative connotations associated with the subject matter. Due to the complexity of the data related to mental health, there is a possibility that grammatical and lexical variations can have an impact on performance.

As in Figure 11, the attention-based probability and the triggered word rules in the sentence may assist in treatment. The visualisation helps the psychiatrist to see the classification reasoning as words, i.e., depression, heaven, and peace for the diagnostic decision. As you can see, the highlighted weights represent two symptoms, S2 (feeling down, depressed, or hopeless) and S3. It is significant that the model successfully recognises the triggering rules and highlights the words that help psychiatrists make notes and diagnoses in their notes. The visualisation makes it easier for the user to see the context behind the extraction of symptoms, resulting in a more understandable and less complex attention model. The proposed model allows for a better understanding of the reasoning behind the prediction. When a psychiatrist detects an irregularity in the prediction model, he or she can identify it by the word attention, which gives an indication of the intent supporting the decision-making of the prediction model.

Fig. 11.

Table 8 shows the comparison with the state-of-the-art of models [2, 3, 4]. The goal of this article is to investigate how depression symptoms can be extracted from mental health interventions using Natural Language Processing (NLP) and attention-based active learning techniques with high entropy. To achieve this goal, we propose an approach based on synonym expansion using semantic vectors. Based on the semantic information obtained from the context in which the semantic vectors occur, we classify them into groups according to the semantic information they convey. For the similarity metrics semantic information is used to help select a subset of text that has not yet been labeled. The unlabelled text is extracted in this way and then used in the subsequent cycle of the active learning mechanism to learn something new. As a result of the additional training points, the procedure performs an update of the model training using the additional training points. Until the optimal solution is found, the cycle repeats until all unlabelled text has been converted into the training set. To increase the size of the instances that can be trained, the researchers plan to use a semantic clustering approach to help them achieve their goal. This strategy helps to generalise the learning system as well as reduce the workload associated with annotating data during the learning process. As a result, our model achieves a lexicon expansion f1 of 0.90, which is in contrast to another model that achieves relatively high accuracy with high explainability.

Table 8.

Evaluation	F1 score
Method	Without Lexicon expansion	Lexicon Expansion
Method [4]	0.81	0.84
Method [2]	0.83	0.89
Method [3]	0.82	0.87
Proposed	0.82	0.90

Table 8. Comparison with Other Methods

A semantic clustering technique [4] is used to increase the number of cases that can be trained. For this purpose, the author proposes a strategy based on synonym expansion using semantic vectors. The linguistic vectors are clustered based on contextual information extracted from the environment in which they exist. Using semantic information, the resulting similarity metrics allow the selection of the set of unlabelled texts. Using the linguistic features of real texts written by patients, the study proposes a fuzzy classification, the Deep-Attention-based model, which magnifies emotional lexicons [2]. Over time, the active learning approaches can augment the learned dataset and fuzzy rules.

As a result, the approach may facilitate labelling efforts for mental health applications by simplifying the process. Consequently, the proposed model can address issues related to per-class language length, data sources, and development techniques and provide a benchmark for each performance level based on per-class language lengths, data sources, and development techniques. By presenting weighted terms in this article, we are also able to provide explainability. Another work aims to construct graph attention networks (GATs) that use mask self-attention layers to overcome the problem of text classification in the context of depression [3]. The networks assign meaning to each node in a neighbourhood based on the characteristics and emotions of its neighbours without using complex matrices such as similarities or architectural knowledge, which can be very costly. In the context of this study, hypernyms are used to expand the emotional vocabulary. There is no comparison between the architecture and its competitors. As a result of the experiment, we found that combining an emotion lexicon with an attention network yields an f-value of 0.87 while remaining interpretable and transparent.

4.1 Discussion

Openness is the tendency to have an open mind and imaginative ideas. When we focused on individual traits and their relationship to emotional attributes, we found that leaders had a higher score than normal users. A willingness to be open to new experiences is another trait typically associated with individuals who have achieved high levels of social influence. These results were to be expected. Despite the presence of a social environment, individuals in our sample performed relatively poorly on symptoms 8–9. This is inconsistent with the fact that the environment was highly social. This suggests that the actions associated with these traits are not easily reflected online and therefore are not easily detected. For the other traits, we find that ratings are high not only for fear but also for avoidance. One possible explanation for this finding is that the individuals who make up our sample do not intend to cultivate fruitful connections with other people on the platform and instead use it solely as a means to publicise their own efforts and accomplishments and to interact with other users whose content seems to contain interesting cognitive material.

Psychiatrists are able to create and prescribe the appropriate treatment program using symptom-based visualisation and symptom-based probability. The NLP provides an elegant technique for adapting and teaching imagery, while the IDPT systems are helpful in computerised exercises for psycho-education. Both the LSTM and attention models help achieve high accuracy in symptom prediction. The bidirectional LSTM model, which included an output-attention layer, performed exceptionally well when asked to classify multiple labels of symptoms. The active learning model is capable of increasing one’s knowledge level over time. Our model achieved a ROC score of 0.85 and helped in visualising concepts based on attention that helps in recommending symptoms. The method presented takes advantage of adapting IDPT systems to perform psycho-educational activities. These systems automatically learn from texts written by patients. The modified intervention generates individualised feedback on the activities that are recommended in response. In the future, we will seek to incorporate a character-level text classifier. In addition, strict regulations can help improve performance while reducing overfitting. These results point to an interesting area of research where our methodology could be valuable, namely, in-depth analyses of possible differences between the behaviour of different types of people in the real world and in the atmosphere of a therapist. This could point to groups or clusters of people who tend to have a persona that is markedly different from their actual personality, as well as possible variables that give rise to this phenomenon. While these initial results illustrate the utility of our work in this direction (a more comprehensive study is beyond the scope of this article), we believe they may stimulate further research on this topic, for example, by considering alternative user groups (such as artists, politicians, etc.) or classification methods.

4.2 Limitations

Our framework imposes some constraints, which we will now address. Using the probability technique may magnify the instability of the pre-trained network due to the instability of re-training big models. The model should be trained multiple times with a variety of initialisations and then selected based on how it performs on development data. The fragmentation of context results from this method of chunking text. An example would be a sentence that is severed in the middle, resulting in the loss of a large amount of meaning. Another way to say it is that the sentences are not taken into account when the text is broken up. Semantic-aware techniques should be used to deal with the segmentation of sentences.

4.3 Omitted Design Elements

The model by itself cannot be interpreted, along with a necessity for a vast amount of data and powerful computational resources to solve the problem. It is suggested that the vocabulary and lexicon expansion with references to the problem should be carefully selected. In the absence of domain and the relevancy of the domain, probability-based pseudo labelling did not produce good results. To minimise the impact of uncertainties during both optimisation and decision-making processes, uncertainty quantification (UQ) methods play a pivotal role in reducing the impact of uncertainty. For learning model training, validation, and reproducibility to be successful, uncertainty quantification is crucial. It is essential to have a statistical framework that is semantically aware to be able to compare model output with the observed data to perform a global parameter search.

5 Conclusion

Depression is a severe health concern that can have serious repercussions on people’s lives on a day-to-day basis. In recent years, there has been a proliferation of research investigating the possibility of using data generated from users from platforms for social media to diagnose and detect mental illness. We, therefore, focused on the terms used in people’s personal utterances when we attempted to identify depression on social media. To address the limitations of text categorisation of sadness, we present Graph Attention Networks (GATs) that utilise masked self-attention layers. In this study, distinct weights are assigned to each node in a neighbourhood without requiring expensive matrix operations, such as similarity or knowledge of the network structure. Because nodes and words can carry the characteristics and emotions of their neighbours, this is possible. Additionally, hypernyms were used throughout the research to expand the emotional vocabulary. Therefore, our methodology can produce superior results when applied to Reddit’s Depression subreddit data. According to our experiments, the emotion lexicon that makes use of the Graph Attention Network ROC achieves a score of 0.91 while maintaining its interpretability and simplicity.

Footnotes

https://www.euro.who.int/en/health-topics/health-emergencies/coronavirus-covid-19/publications-and-technical-guidance/mental-health-and-covid-19.

https://www.who.int/.

https://www.psychiatry.org/psychiatrists/practice/dsm.

⁴

https://www.uspreventiveservicestaskforce.org/Home/GetFileByID/218.

References

[1]

J. Aguilera, D. I. H. Farías, R. M. Ortega-Mendoza, et al. 2021. Depression and anorexia detection in social media as a one-class classification problem. Appl Intell 51 (2021). 6088–6103.

Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 Dataset

3.2 Preprocessing

3.3 Graph Attention Network

3.4 Psychometric Questionnaires (PQ)

3.5 Lexicon Expansion

3.6 Deep Learning (DL) Model

4 Experimental Result and Analysis

4.1 Discussion

4.2 Limitations

4.3 Omitted Design Elements

5 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Heterophily-aware graph attention network

A unified structure learning framework for graph attention networks

BiLGAT: Bidirectional lattice graph attention network for chinese short text classification

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations