research-article

Open access

My Tweets Bring All the Traits to the Yard: Predicting Personality and Relational Traits in Online Social Networks

Authors:

Dimitra Karanatsiou,

Pavlos Sermpezis,

Dritjon Gruda,

Konstantinos Kafetsios,

Ilias Dimitriadis,

Athena VakaliAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 16, Issue 2

Article No.: 10, Pages 1 - 26

https://doi.org/10.1145/3523749

Published: 20 May 2022 Publication History

All formats PDF

Abstract

Users in Online Social Networks (OSNs,) leave traces that reflect their personality characteristics. The study of these traces is important for several fields, such as social science, psychology, marketing, and others. Despite a marked increase in research on personality prediction based on online behavior, the focus has been heavily on individual personality traits, and by doing so, largely neglects relational facets of personality. This study aims to address this gap by providing a prediction model for holistic personality profiling in OSNs that includes socio-relational traits (attachment orientations) in combination with standard personality traits. Specifically, we first designed a feature engineering methodology that extracts a wide range of features (accounting for behavior, language, and emotions) from the OSN accounts of users. Subsequently, we designed a machine learning model that predicts trait scores of users based on the extracted features. The proposed model architecture is inspired by characteristics embedded in psychology; i.e, it utilizes interrelations among personality facets and leads to increased accuracy in comparison with other state-of-the-art approaches. To demonstrate the usefulness of this approach, we applied our model on two datasets, namely regular OSN users and opinion leaders on social media, and contrast both samples’ psychological profiles. Our findings demonstrate that the two groups can be clearly separated by focusing on both Big Five personality traits and attachment orientations. The presented research provides a promising avenue for future research on OSN user characterization and classification.

1 Introduction

Online Social Networks (OSNs) offer a virtual space in which people connect and interact with others, express themselves, and receive information, in a continuous digital reflection of the real (offline) world. In OSNs, people typically showcase their real self [40] and leave traces in online behavior, which reflect their real-world personality [24]. These traces expose a holistic image of oneself, including both personal characteristics (personality traits) and characteristics that portray their behavior in relation to others (relational traits).

The term personality refers to characteristic combinations or patterns of behaviors, cognitions, and emotional reactions that evolve from biological and environmental factors and form relatively consistent individual differences [13]. The Big Five (BF) or Five Factor model [29] is one of the most distinctive personality theories that constitutes five main traits of human personality representing individual differences in cognition, emotion, and behavior: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. On the other hand, relational traits have also been linked with consistencies in social behavior and interaction patterns, with attachment theory [7] as the most emblematic theoretical framework in that respect [31, 43], capturing how individuals experience close relationships to and interactions with others.

Personality traits have been studied in the context of OSNs and the web overall, as findings show that they are strongly linked to OSN use [57], online friendships [60], and online reviews [52]. Moreover, certain prediction models have been proposed [37, 64] to extract users’ psychological background from their online behavioral residue and map it to personality characteristics. However, relational traits such as attachment orientations (AO) have been overlooked in online environments, even though user activity in OSNs heavily relates to social behavior characteristics. This makes the study of a relational profile critical from an application point of view and provides rich information about individuals’ social profile.

The present research aims to address this limitation in OSN research, by studying and predicting both relational traits and personality traits of users. The importance of relational facets of personality for explaining social interaction cannot be overstated. Given that online social media engagement resembles actual social interactions in many respects [15, 30], the need to study how different personality facets are reflected in online expression is particularly compelling. Attachment orientations, a key individual difference of relational orientation, can be derived on the basis of traces found in micro-blogs. Attachment orientations capture one’s notion of the self in relation to others and interpersonal relationships, with attachment theory being one of the key personality theoretical frames to explain actual social behavior [31]. Considering both traditional personality Big Five traits and relational traits is important for (1) providing holistic profiling of OSN users—humans have an integrated profile in which self and social aspects interrelate and affect each other, and joint profiling can be essential for understanding the overall human presence on OSNs; (2) uncovering more traits of people’s psychological and social world has been identified as a direction in OSN research (which currently focuses only on the personality traits) that could help to better explain, analyze, and predict online user behavior [66], e.g., with applications on customer segmentation [46] or digital advertisement environments [17]; and (3) shedding light on social interaction phenomena taking place in OSNs is of great socioeconomic importance, e.g., community formation [32], decision making [42], or information diffusion [12].

To this end, the present article proposes a novel data-driven approach to predict a holistic psychological profile of OSN users, capturing both their personality and relational traits.¹ Building on insights stemming from psychology theory, our approach applies data mining on OSN textual and non-textual data, carefully selects different sets of features for predicting different types of traits, and exploits the inherent correlations in psychological traits, to efficiently predict a complete image of OSN users’ psychological profile. The proposed approach is applied on the Twitter micro-blogging service, which stands as a live, dynamic, and open OSN platform on which people intensively interact, and is largely driven by people’s spontaneous reactions and emotions expressing themselves (personality facet) and interacting with others (relational facet) at the same time.

Specifically, our contributions in detail are as follows:

•

Data mining and feature engineering for psychology traces in OSN. Motivated by psychology theory on personality suggesting that traits are reflected in different types of online behavior and actions, we identify a large set of features that capture language, behavioral, and emotional expressions of users in OSNs. The proposed feature engineering methodology accounts for a larger set of features than those considered in previous works, thus allowing to target more generic psychological profiling. To apply and test our methodology, we collected a labeled dataset: through a crowdsourcing platform, we recruited 243 individuals who consented to provide information about their psychology profiles. Subsequently, we compiled a ground-truth dataset labeled with their psychology profiles. We used the Twitter API to collect 350,000 tweets from the Twitter accounts of recruited participants and applied the proposed feature engineering methodology.

•

Holistic psychological profiling. We propose a novel machine learning (ML) methodology to predict users’ holistic psychological profile including both Big Five personality and relational traits. The novelty of the proposed methodology is that it (1) uses a large set of the collected (psychological-related) features, (2) carefully selects the subsets of them with the strongest predictive power for each trait, and (3) exploits correlations between personality and relational (i.e., social) behavior traits to enhance individual trait predictions. In this way, our approach not only predicts social facets of a psychology profile (which is not captured by existing personality prediction models) along with personality facets but also leverages the different traits for more accurate holistic profile prediction.

•

New insights and improvement of prediction accuracy. Evaluating our methodology reveals interesting insights for the prediction of psychology traits from OSN traces: (1) using different sets of features performs better in predicting different psychological traits, (2) relational traits can be predicted as efficiently as personality traits, and (3) holistic personality prediction outperforms individual trait predicting models. We believe that our findings can pave the ground for future experimentation and studies in psychology profiling in OSNs. Moreover, the accuracy achieved by our approach (across all traits) is higher than current state-of-the-art approaches, which currently are limited to Big Five personality traits instead of relational traits. For example, applying the approach of [12] to our data provides a root mean squared error (RMSE) of 0.284, while our prediction model achieves a 29% improvement for personality traits (RMSE = 0.203) and has 32% better average performance when accounting for all traits (0.192 RMSE); this improvement comes as a result of using both a psychology-driven feature engineering methodology and a holistic profiling approach.

•

Psychological profiling in the wild. We demonstrate the applicability of the proposed psychological profiling methodology through a use case. We identify a set of Twitter users who seem to be accepted as opinion leaders on social media (i.e., have a large following). We apply our methodology to predict their psychological profiles and analyze results. We find that the distributions of traits significantly deviates from regular users (defined as users included in our ground-truth dataset), and that the set of leaders can be clearly separated by only using their psychological profiles. These findings highlight the usefulness of our approach in the characterization of the personalities for different groups of OSN users (e.g., such a group psychological profile could be used to recommend skills/activities/jobs to users based on their profile similarity) and classification of users based on their profiles.

In this section, we provide an overview of related psychological literature, discuss related work, and highlight several open issues of existing methods. We also highlight the contribution of the present work. Section 3 details the collected dataset and the data mining and feature engineering methodology, and Section 4 presents the design and evaluation of the proposed machine learning predictive model. Finally, we conclude our article and discuss future work in Section 6.

2 Background and Related Work

In this section, we discuss psychological personality profiling principles (Section 2.1), personality traces in OSNs (Section 2.2), and personality prediction based on these traces (Section 2.3). We discuss outcomes of the most relevant preceding work, to identify (1) the main factors and techniques that can be used in personality prediction from user traces in online social media and (2) gaps in prior state of the art that motivate the proposed approach.

2.1 Personality Principles

In this study we consider two major factors of individuals’ psychological profiles: personality traits of oneself and relational traits of social behavior. The combination of characteristics describing someone’s self are captured by the Big Five model and attachment orientations that describe the dynamics of interpersonal relationships, representing the social self.

The Big Five (BF) model [23] is the most studied structural model that defines five major traits and statistically identified facets that are stable over time [9, 18]:

•

Openness to Experience: the degree of intellect, curiosity, and positive attitude toward new experiences

•

Conscientiousness: the tendency to be organized and self-disciplined

•

Extraversion: the tendency to be outgoing, talkative, and social

•

Agreeableness: the tendency to be cooperative and understanding of others

•

Neuroticism (or emotional stability): the degree of psychological stress

Attachment Orientations (AOs) capture one’s notion of the self in relation to others, with attachment theory being one of the key personality theoretical frames to explain social behavior and interactions [31]. AOs are formed on the basis of interactions with others in early socialization experiences and serve as a key framework for future interpersonal relationships [26, 43]. A dominant attachment orientation is conceptualized in terms of two dimensions: anxiety and avoidance. An anxious orientation involves positive views of others and less positive view of self, leading individuals to typically seek out others for support and approval, behavior that is reflected in both proximity and emotional intimacy. On the other hand, individuals with a dominant avoidant orientation typically hold a negative view of others, and therefore distance themselves from relationships with others. And although previous studies have found a correlation between AOs and BF personality traits [62], importantly, attachment orientations account for a unique variance with regard to interaction and relationships with others [35], and only moderately overlap with the BF personality dimensions [62]. The present study examined attachment orientations in relation to BF personality characteristics in line with the view that it is the combination of attachment orientations and personality traits that forms a more complete and accurate representation of individuals’ self-expression and their interactions with others online. To our best knowledge, this is the first study that accounts for both dimensions (see Table 1 for a comparison).

Table 1.

	Personality Model		Features
	Personality	Relational	Language	Behavior	Emotion
	Traits	Traits
	(Big Five)	(Attachment
		Orientations)
[12]	✔		✔
[4]	✔		✔
[72]	✔		✔
[36]	✔		✔
[54]	✔			✔
[1]	✔			✔
[71]	✔		✔	✔
[22]	✔		✔	✔	(✔)
[38]	✔		✔	✔	(✔)
[80]	✔		✔		(✔)
[70]	✔		✔	✔
[74]	✔			✔
[67]	(✔)			✔
[48]		(✔)	✔
This study	✔	✔	✔	✔	✔

Table 1. Comparison of This Study against Alternatives

Therefore, in this article, we study the psychological profiles of social media users in a holistic way, by considering both these major dimensions of psychological profiles.

2.2 Personality Traces in OSNs

Recent years have witnessed an increasing research interest in how personality characteristics are reflected in the behavioral traces individuals leave on social media platforms. The Big Five traits have been repeatedly found to be reflected in self-expression and behavior in OSNs [24, 53]. Recent work linking Big Five traits with language use in social networks has brought about several insights regarding linguistics, grammar and syntactic features, and discussed topics [49, 59]. For example, individuals who score higher on openness express themselves more frequently with emotion-rich and expressive language, and highly neurotic individuals tend to express more negative emotions and stress in their language. As for relational traits and social media behavior, avoidant individuals tend to think more often about removing their social media profile and are more likely to dissolve social ties first, for instance, unfollowing or unfriending someone on social media platforms [21]. Conversely, anxious individuals tend to spend more time on social media, are concerned about their public image, and use the medium to seek comfort [47]. In addition, evidence retrieved from social studies shows that AOs are also linked to linguistic and emotional expressions in written communication, as both anxious and avoidant AOs are correlated with number of expressed words, affective concepts, structural element use, and language patterns in general [10]. For example, the unwillingness of avoidant individuals to speak about themselves and their relationships is reflected in the way they use language, as well as their need to minimize the expression of their feelings [11]. On the other hand, highly anxious individuals talk more and frequently express anger in their discussions [75], using more swear words and emotion words in general, reflecting their tendency to intensify emotion in language use [11].

The evidence portrayed in the literature discussed above suggests that social media are transformed into rich and direct sources of behavioral patterns [33]. In particular, extracted motifs and features related to language, sentiment, and behavioral patterns can reflect individuals’ personality and attachment-related traits, which can be used as a proxy for personality prediction [11, 33, 37, 48]; our approach combines all these dimensions of information to create a large set of features (see Section 3.3) that are then used for the psychological profiling of OSN users.

2.3 Personality Prediction in OSNs

Several studies predict personality traits in OSNs [1, 4, 12, 22, 36, 38, 54, 67, 71, 72]. In Table 1 we summarize the state-of-the-art approaches and present an overview of their characteristics with respect to what personality traits they predict (“personality model”) and the type of data they are based on (“features”). Compared to these works, which we discuss in more detail below, our work provides a holistic personality prediction by targeting personality traits and social behavior traits leveraging a large set of textual and non-textual features.

Use of language in online social media provides information that relates to users’ psychological profile, and thus is used by the majority of existing approaches in personality prediction [4, 12, 36, 72]. For example, language features were used by [72] to predict Big Five traits on social network environments, while [12, 36] and [4] focused on the information contained in the text of the tweets that reflect personality.

Following a different approach, [1, 54] were based on users’ online behavior to predict personality traits. [54] studied personality and Twitter use, including influencers’ number of followers, and demonstrated that merely publicly available information regarding the number of statuses, number of followers, and number of lists reflecting user behavior can be used to predict personality. On the other hand, [1] defined six different categories of behavioral features and demonstrated that behavior proxies can be equally successful in text and language for personality prediction. Moreover, [67] followed a different approach based on acquaintances’ evaluations of individuals in order to build a personality classifier based on behavioral cues from Facebook profiles.

Both language and behavior features have been used by [22, 38, 70, 71]. Deep learning architectures are implemented in the approaches of [71, 74, 80] to address personality prediction challenges on Facebook user data, using different text, behavioral, or sentiment features that fed artificial neural networks. [22] and [38] take also sentiment scores into consideration, thus including some aspects of emotion; however, their main emphasis remains on language use and behavior.

All of the above approaches consider personality with regard to BF personality traits, while only [48] examined AO. In particular, they consider perceivers’ tracking accuracy in predicting other Twitter users’ personality based on users’ last 10 tweets and language expression. However, the approach of [48] is based on observers’ personality rating and not on self-reported data. And with regard to BF personality traits and other personality traits such as state and trait anxiety [25], we would agree that observer ratings can be highly beneficial in both detection and accuracy. However, AOs are relational traits, which refer to how an individual tends to relate to others and approach relationships with others. And prior research has shown that when it comes to relational traits, individuals oftentimes engage in projection [26]. This likely would mean that observers would not be detecting others’ relational traits as much as projecting their own relational traits and behaviors onto others. Therefore, we argue that in the case of relational traits, personality prediction on the basis of self-reported measures is superior to observer-rated measures, and our approach is the first to predict relational traits on the basis of AO. To the best of our knowledge, no study has attempted to predict both personality and relational traits of online social media (e.g., Twitter) users. We provide an integrated personality prediction, and apart from predicting both types of personality traits, we exploit the correlations between traits (see Section 3.4) as a key component of our predictive methodology. Moreover, we leverage all available types of features, namely language, behavior, and emotion, by taking a deeper look at the latter compared to previous approaches [22, 38] (see Section 3.3). As we show in Section 4.2, our holistic approach leads to more accurate personality profile predictions.

3 Harvesting Personality Traces in Twitter

In this section we present the data collection and analysis methodology we follow, which is used to feed our personality prediction models (Section 4). We first describe the Twitter dataset we collected, labeled with personality scores by the users themselves (Section 3.1). Then, we pre-process the collected data (Section 3.2), extract meaningful features for our analysis and modeling (Section 3.3), and conduct an initial data exploration analysis to characterize the quality of the collected dataset and obtain insights for the design of the prediction model (Section 3.4). Figure 1 depicts an overview of our research methodology.

Fig. 1.

3.1 Dataset Building

We collected a dataset by recruiting 243 (41.44 % female and 58.15% male) participants on Amazon’s Mechanical Turk, who provided information about their psychological profile (which was used to “label” the dataset) and their public Twitter accounts (which were used to extract the features). Participants were on average 38.96 years old (std = 10.96) with an average work experience of 15.37 years (std = 10.26). To minimize differences due to culture, we restricted our sample to participants from the United States. In addition, to lower the likelihood of MTurk workers to click through surveys randomly, we included (1) various pre-tested attention check questions including both direct and bogus items according to best practices [16], (2) time screens to prevent participants from answering items too quickly [76], and (3) only participants who had a high approval on the MTurk platform.

Psychological profiles. Participants completed a survey by answering questions related to the BF and AO models, as well as by providing some demographic information. Specifically, the Big Five personality dimensions of the participants were assessed using the 20-item mini-IPIP [18], where participants completed 20 items on a 5-point Likert scale from which personality traits were inferred. This scale has been shown to be consistent across studies and has good test-retest reliability, demonstrating a comparable pattern of convergent, discriminant, and criterion-related validity with other Big Five measures [18]. The mini-IPIP scale is composed of 20 items in total, measuring Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism on a 5-point (1 = “Very inaccurate” to 5 = “Very accurate”) Likert scale. Attachment orientations, anxiety, and avoidance were assessed with the “Experiences in Close Relationships” scale for adult attachment [63]. This scale has become the norm in attachment-related research [26] and has been applied widely as well [43]. Participants completed a 36-item Likert scale questionnaire to assess their behavior in general in close relationships, in order to extract scores about anxiety and avoidance dimensions. Summarizing, we acquired personality labels in order to build a ground-truth dataset, which is later used for designing personality prediction models. We normalized each variable to a common scale [0,1] in order to examine correlations among behavioral characteristics and feed the machine learning models.

Remark: Despite the fact that the methodology we use to infer the personality traits has been shown to be reliable by previous studies, we performed statistical analyses to further verify its reliability in our data. First, we calculated for each trait the Cronbach’s alpha value from the values of the survey items involved. We present the results in Table 3, where we observe high Cronbach alpha values (i.e., high reliability) for all traits. Moreover, we use confirmatory factor analysis (CFA), a statistical technique used to assess the fit between observed and prior conceptualized, theoretically grounded models [65]. Using principal components factor analysis with oblique rotation, accounting for possible correlation between factors, for all 36 attachment items and 20 Big Five personality items, we found that relevant factor loadings and commonalities were all above 0.3, confirming that each item shared some common variance with other items. Except for anxious attachment and neuroticism, there is not any high cross-loading between variables. That means that the seven variables are largely distinct from one another. In addition, all factor loadings loaded onto the seven identified factors. Given these overall indicators, factor analysis was deemed to be suitable with all examined items, allowing us to successfully replicate factor loadings comparable to prior research, namely a two-factor attachment structure [56] and five-factor Big Five personality trait structure [39].

Twitter accounts. Participants provided their Twitter account user name, which we used to retrieve their social profile. We manually verified the existence of the accounts and excluded from our dataset participants with private accounts, inactive accounts (i.e., with no tweets history), and false accounts. For each of these accounts, using the Twitter API, we collected user profiles, including both tweets history and profile metadata, such as number of followers, statuses, lists, etc.

3.2 Pre-processing Phase

We conducted an initial pre-processing of the data for (1) removal of spam users and tweets and (2) text cleaning, in order to remove factors not related to real users’ psychological profile.

Spam removal. We adopted the approach of [73] to remove spam users. As spam users we identified those who employ a large number of similar posts and a large number of hashtags. After a manual inspection, we observed that most tweets with five or more hashtags related to repetitive and exclusive content about contests or games. For instance, in cases where an individual uses Twitter for certain reasons only, i.e., participation in contests and giveaways or promoting certain blogs, their tweet history contains similar posts. These users were removed from our dataset. In a similar manner we removed only spam tweets and not the complete user profile, in case a tweet contained five or more hashtags but did not fall into the categories of repetitive and promotional tweets. Moreover, to avoid considering users with a non-typical platform behavior, we removed “ghost users” that have no tweets history at all (number of statuses) or have very few interactions (number of followers). As a result, our final dataset included 229 users with almost 350,000 tweets.

Text preprocessing. We parsed the collected tweets and filtered out all punctuation, digits, and Unicode characters, as well as user mentions, retweet identifiers, and external links (URLs). The most common words used in language, known as “stopwords” (e.g., “that,” “be,” “to,” etc.), were removed to keep only meaningful word units for analysis. After this step, a word tokenizer was used to fracture the text corpus into single-word units, in order to use them as inputs in other tasks, for example, lemmatization and part-of-speech tagging.

3.3 Features Engineering Phase

We extracted three types of features, related to behavior, language, and emotion. These feature sets are utilized in order to feed the machine learning algorithms as representations of online personality and relational psychological traces.

3.3.1 Behavioral Features.

We extract a set of attributes (behavioral features) that relate to users’ profile information and activity on Twitter. Table 2 presents all extracted features that reflect users’ expressivity (eight features), interactivity with other users (four features), and platform use (three features). These 15 features, after being normalized on a common scale [0,1], make up a vector for each user that mirrors users’ overall behavior on Twitter.

Table 2.

Category	Behavioral Attribute
Expressivity	#statuses a user has posted
	#statuses divided by the days the user is in the service
	#external links a user has shared in the platform
	#hashtags a user used to participate in public conversations
	#characters a users uses to describe themself
	avg character length of user’s tweets
	avg #words a user uses on their tweets
	avg #upper-letter-words in tweets reflecting intense emotions
Interactivity	#followers a user has
	#lists a user appears on
	#times a user has mentioned other Twitter users
	#times a user has retweeted other users’ content
Platform use	#days an individual uses the service
	#items a user finds interesting in the platform
	#characters a user uses as displayed name

Table 2. Behavioral Attributes

3.3.2 Language Features.

To extract the best possible features that could reveal patterns on language use, we employed an open vocabulary approach; in this way, we avoid depending on pre-defined lexicons, which do not allow for unintentional language use that exists in the real text contained in the tweets. For each user we extracted three different language vectors: TfIdf, N-grams, and part-of-speech (POS)or syntactic vectors.

•

Tfidf vectors reflect the importance of a word to a document in a collection [58]. Considering the set of all tweets in our dataset and all unique words contained within them (after the pre-processing task), each tweet is represented as a vector that calculates word frequency in the document and the relevance of each term in the document. The more a word appears in a document, the more significant it is estimated to be [2].

•

N-grams (Bigrams for N = 2 and Trigrams for N = 3) represent the phrasal language expression offering rich and powerful features. N-grams produce the frequency vector of any sequence of \(N\) (contiguous) words in a text, also known as phrases [55].

•

Syntactic features: Considering all the dataset’s unique words after the preprocessing task, we use a function that assigns a POS tag in each word based on the pre-trained model of [6]. In our analysis, we use three different syntactic features: frequency of (1) single POS tag vectors, (2) POS tag bigrams, and (3) trigrams. The motivation for using POS is the important associations between the POS use and personality [41, 78]. In particular, [41] noted that linguistic orientation is more consistently described by its syntactic component, reflected by the different use of POS, while [78] demonstrated the potential of POS N-grams as predictive features for the Big Five traits, based on the fact that syntactic features are not controlled so consciously as the use of individual words.

Feature selection and dimensionality reduction. Different combinations of feature subsets were tested using grid search. Moreover, the vectors of language features are sparse and much larger than the other vectors (i.e., behavior and emotion features) taken into account in our analysis. Hence, to reduce over-fitting and improve the generalization of models, we apply a dimensionality reduction of the language vectors. In particular, for Tfidf, N-grams, and POS vectors, we follow a univariate feature selection approach by computing the correlation between the regressors and the target variables and selecting the 100 best features emerging for each category.

3.3.3 Emotion Features.

As emotion features, we aimed to predict individuals’ emotional expression in terms of certain emotions and not in terms of general sentiment scores. In particular, we use a vector of six items representing the six primary emotions [19]: joy, sadness, anger, disgust, fear, and surprise. We need to extract specific emotions not only to be used as inputs to the personality prediction model but also to study the relationship among personality and emotions and provide new insights. To extract these emotional values from the tweets of the users, we follow the hybrid approach for emotion detection of [14], which predicts emotions in a text based on its textual attributes, use of emojis, and sentiment scores. Specifically, using the SemEval-2018 dataset [44] as ground truth, we trained a Support Vector Machine (SVM) multilabel emotion classifier to annotate the users’ tweets in our dataset. For each user, we aggregate the emotion values, and the emotion feature vector output is used in our methodology.

Textual attributes: We create for each user a vector representing the six primary emotions, exclusively based on the text in the tweets of the user. In order to extract the values of this vector, we follow a text processing methodology for emotion detection: We use the WordNet-Affect, a popular lexical database for emotion detection [69], which assigns a variety of affect labels to a subset of synonym sets (synsets) in WordNet-Affect. We extend the word list of our dataset using WordNet-Affect synsets of each word included in the representative set of words. After assigning emotions to each word using scores retrieved from the WordNet-Affect, we created the emotion frequency vector.

Emoji-based attributes: In addition to the text-based emotional vector, we create another vector (of six items), in which we detect the use frequency of affective emojis. We employ the affective emojis list of [77] that states groups of emojis related to the six primary emotions. We map each emoji found in our text data with the primary emotion it is linked with and we create the affective emoji frequency vectors.

Sentiment scores: Finally, we create a vector of two items corresponding to negative and positive sentiment scores based on sentiment values extracted from the WordNet-Affect lexicon, which also provides sentiment scores for concepts except for affective classification. To produce a total sentiment score for each user, we sum word sentiment scores at the tweet level and compute the average at the user level.

We proceed to combine the information in the collected attributes and create a single vector of six emotion features for each user, after normalization in a common scale [0,1]. To infer the emotional vector from the collected attributes, we first train a predictive machine learning model on the SemEval-2018 dataset [44] and then apply it to our users. Specifically, we split the the SemEval-2018 dataset in 80%-20% test-train sets, and trained an SVM emotion classifier, which achieved a precision score of 0.85. Then, in our dataset, we aggregate the aforementioned features, along with a TfIdf vector, and feed them to the SVM model that outputs the emotion vector for each user. This hybrid approach has been shown to be more robust compared to text-only or emoji-only emotion detection [14]. Moreover, by utilizing a valid and high-quality dataset as a ground truth to build an emotion model instead of using emotion signals as sentiment scores or emojis, we add an extra layer by transferring knowledge from another source, resulting in better results in specific emotion detection.

3.4 Exploratory Data Analysis

In this section we conduct an exploratory analysis of the collected data (self-reported psychological traits and extracted features) to validate to what extent our dataset is representative of and spans different psychological profiles, and obtain insights that can help toward designing an efficient predictive methodology.

Distribution of psychological profiles. We calculated the distributions for each of the (self-reported) psychological traits contained in our dataset and present their statistics in Table 3. As expressed in the min and max values for each trait, our dataset contains samples that span the entire range of psychological profiles (note that, before normalization, relational traits were scored on a scale of 1 to 7, while personality traits were scored on a scale of 1 to 5). The mean value for most traits is around the middle of the respective range of values, with Openness and Agreeableness being considerably skewed toward higher values and Neuroticism toward lower values. Trait variability (std) was higher in the case of AO traits, Extraversion, and Neuroticism; traits with more variability are expected to be more challenging to predict (which is also validated by our results; see Section 4.1). Correlations between psychological traits (and the motivation for a holistic psychological profiling approach). Figure 2 depicts the correlations (Pearson’s coefficient \(\rho , \rho \in [-1,1]\)) among the various traits in our dataset. There are significant positive/negative correlations between traits. Specifically, 17 out of the 21 trait pairs have a coefficient \(|\rho |\gt 0.2\). Moreover, the highest correlations are between anxious attachment and Neuroticism (\(\rho\) = 0.74), and avoidance and Agreeableness (\(\rho\) = \(-0.53\)); both these pairs contain a BF trait and an AO trait. These observations motivate us to consider a holistic psychology profile prediction approach, which accounts for inter-trait correlations and allows us to predict both types of traits at the same time with a single multi-output model (see Section 4.2).

Fig. 2.

Table 3.

		M	Std	Min	Max	Chronbach alpha	1	2	3	4	5	6	7
1.	Anxiety	3.08	1.44	1.00	6.83	.96	(0.96)
2.	Avoidance	3.72	1.25	1.00	6.88	.95		(0.95)
3.	Openness	4.10	0.78	1.00	5.00	.73			(0.73)
4.	Conscientiousness	3.72	0.88	1.00	5.00	.80			0.12	(0.80)
5.	Extraversion	2.69	1.09	1.00	5.00	.88			0.24***	0.25***	(0.88)
6.	Agreeableness	3.88	0.86	1.00	5.00	.83			0.44***	0.04	0.27***	(0.82)
7.	Neuroticism	2.45	1.06	1.00	5.00	.85			\(-\)0.23***	\(-\)0.52***	\(-\)0.41***	\(-\)0.17*	(0.85)

Table 3. Statistics and Correlations among the Main Reported Variables

Reported correlations refer to traits considered for our ground-truth dataset; n = 243. ***p < .001. **p < 0.1. *p < 0.05. Cronbach’s alphas are presented in matrix diagonal.

Correlations between psychological traits and OSN-extracted features. We proceed to examine the correlations between psychological traits and the online behavioral traces of OSN users that are captured by the extracted features (Section 3.3). As for language use, correlations arose from linguistic categories that exist in the Linguistic Inquiry and Word Count [51]. Details about the correlational analysis can be found in the appendix (see Table 9). The main findings show that our sample is representative of specific behavioral, language, and emotional indicators linked with personality from a psychological aspect and expression and OSN use. Moreover, we observe a lack of strong correlations among individual features and traits, since the vast majority of correlations remain below 0.20, while 37.8% of them are below 0.15. This indicates the need for complex features to capture patterns among personality traits and expression in OSN.

In particular, with regard to AO on the basis of online behavior, the language use of anxious individuals seems to be more personal and self-revealing. This is in line with findings that anxious attachment is associated with higher authenticity and higher self-disclosure in social interactions, thereby fulfilling their need for increased proximity seeking. On the other hand, avoidant individuals tend to express a negative view of others and are more likely to see themselves in a more favorable light, as they have learned not to depend on others. We also found that avoidant persons are more likely to use linguistic features that suggest a social comparison. In addition, and in keeping with previous findings [11], highly avoidant persons seem to be more likely to use objective words and less emotion-related words.

As for BF traits with regard to emotion features, we found an expected relationship between Neuroticism and the expression of sadness, while highly extraverted participants were less likely to express fear and highly agreeable participants were more expressive of joy. These associations are in line with previous research on the BF traits and social media use [59, 79]. Finally, with regard to language features, most associations were found with regard to extraverted participants, such as a lower likelihood for such individuals to express authenticity, likely due to their tendency toward expressing positive emotions and feelings [59, 79].

We provide details on the correlations between psychological traits and OSN-extracted features regarding effect sizes and significance as well as psychological interpretation in the appendix.

4 Personality Prediction

In this section, we proceed to design a model that predicts the user personality traits from the extracted features (see Figure 1). In the following, we first provide the overview of our approach. We then present and discuss our design choices step by step, along with evaluation results (i.e., prediction accuracy) and comparison with state-of-the-art approaches. Our approach combines expert knowledge from the psychology domain and advanced statistical analysis techniques.

Machine learning approach. The analysis in Section 3.4 reveals that no individual behavioral, emotion, or language metric has significant correlation with any of the individual personality traits. This indicates that simple statistical models (e.g., analytic models or linear models with a few features) are not expected to be able to predict accurately the personality profiles; we experimentally verified this intuition as well. Hence, in our approach we select to use advanced statistical models, i.e., ML models, which have been proved efficient in capturing complex statistical patterns and relations between features and target variables.

Regression, training strategy, and evaluation metric. The scores for the personality traits retrieved from the survey are continuous values in the interval [0,1].² Hence, we select to use regression models that correspondingly predict a (continuous) value for each trait. Before training the models described below, we first split our dataset into 80%-20% test-train. The performance of our model refers to the errors on test sets; however, errors on training sets are also provided.

To measure model performance, we use the RMSE:

\begin{equation*} RMSE = \sqrt {\frac{1}{n}\sum \limits _{i=1}^{n}{(y_{i} -\widehat{y_{i}})^2}}, \end{equation*}

where \(n\) is the number of samples, \(y_{i}\) is the actual trait value for sample \(i\), and \(\hat{y_{i}}\) is the corresponding prediction of the model. RMSE is the most common metric for regression problems that incorporates the model’s variance and bias.

Base regression model. We tested different regression models (linear, support vector, Gaussian, random forest, etc.) for predicting individual personality traits from the extracted features. The random forest (RF) regressor³ performed consistently better for the majority of the personality traits, and thus is selected as the base regression model in our analysis. We conducted a grid search for the tuning of the hyperparameters of the RF regressors, which led to 100 trees.

Single personality trait prediction and best-performing features (Section 4.1). Using the extracted features, we trained RF regressors to predict each personality trait individually, similarly to previous approaches. However, as discussed earlier we (1) use a larger set of features and (2) provide predictions both for BF and AO traits (the latter have not been considered before in a self-labeled dataset). We find that neither using all the extracted features is optimal for all traits nor was there a single set of features that performs best for all traits. We selected the (sub)set of features for each trait that leads to higher accuracy, built the respective RF regression models, and studied their performance.

Holistic personality prediction with multi-output regression chain models (Section 4.2). As can be seen in Figure 2, and is supported by related literature [28, 50], there are significant correlations between personality traits. Motivated by this observation, we propose a single multi-output model that predicts all personality traits at the same time. The proposed model uses chains of RF regressors (see Figure 3) and exploits the correlations between traits to achieve higher accuracy compared to individual trait predictive models.

Fig. 3.

4.1 Single Personality Trait Prediction

We first use RF regressors to predict each trait individually, similarly to previous approaches. For each trait, we build and train an RF regressor that receives all features, as well as RF regressors that receive a subset of the extracted features. We do so because using a smaller set of features may help to avoid over-fitting and improve predicting accuracy in relatively small datasets; in a sufficiently large dataset, selecting all features would be optimal.

\(\rightarrow\) “Using different sets of features performs better in predicting different personality traits”: In Table 4 we present the set of features that best predicted each trait. Using all features is not always the best choice; neither does there exist a single (sub)set of features that performs best for all traits. In particular, for the AO traits (anxiety and avoidance; top two rows), the best accuracy is achieved by using only language features, at a phrasal level (i.e., N-grams) for anxiety and at a syntactic level (i.e., POS) for avoidance. With respect to BF traits (bottom five rows), language features perform best for openness and agreeableness (at syntactic and word level, respectively), all features combined provide better predictions for Conscientiousness and Extraversion, and emotion features provide higher predictability capacity in the case of Neuroticism.

Table 4.

Trait	Best Performing	RF Regression	RF Regression	Baseline Model
	Feature Set	RMSE (Test Set)	RMSE (Training Set)
Anxiety	Language (N-grams)	0.216	0.190	0.471
Avoidance	Language (Syntactic)	0.181	0.180	0.225
Openness	Language (Syntactic)	0.158	0.142	0.179
Conscientiousness	All features	0.176	0.180	0.244
Extraversion	All features	0.221	0.191	0.262
Agreeableness	Language (Tfidf)	0.221	0.200	0.252
Neuroticism	Emotion	0.228	0.200	0.275
Average		0.200	0.183	0.273

Table 4. Single Trait Prediction: Best-Performing Features and Predictive Accuracy (in RMSE) of the Proposed RF Regression Models and a Baseline Model

For completeness, we state in Table 4 the RMSE of a baseline/dummy model that always predicts the mean value of the distribution of each trait. Our model achieves 27% lower RMSE than the baseline model. Table 5 provides the average RMSE of the single trait prediction model (top row) when the same set of features is used for all RF regressors. As we can see, when the feature selection is not tuned per trait, the RMSE is significantly higher (and in some cases close to the baseline accuracy), which highlights the benefits of our feature selection approach.

Table 5.

	Best Features	All features	Language	Emotion	Behavior
	(Test/Training Set)	(Test/Training Set)	(Test/Training Set)	(Test/Training Set)	(Test/Training Set)
IM	0.200/0.182	0.266/0.173	0.273/0.270	0.270/0.214	0.266/0.194
HM	0.192/0.179	0.258/0.170	0.263/0.240	0.270/0.220	0.258/0.200

Table 5. Prediction Accuracy (in RMSE) of the Independent Model (IM) and the Holistic Model (HM) for Different Sets of Features

\(\rightarrow\) “Relational traits can be predicted as efficiently as personality traits”: Apart from different best-performing features, we observe that the prediction accuracy differs among the different traits as well. For instance, Openness is the easiest trait to predict (RMSE = 0.158), while Neuroticism is the most difficult (RMSE = 0.228). These differences in prediction difficulty among traits are in line with previous findings [12, 22, 49, 64]. When it comes to relational traits (AO), we observe that the achieved accuracy is within the range of accuracy for BF traits. This shows that the proposed feature engineering methodology, when combined with a single-trait RF regressor, can be used to efficiently predict both relational and personality traits. To our best knowledge, this is the first time that a common methodology is applied to both types of traits (cf. Section 2.3 and Table 1). In fact, compared to the baseline model, we can see that in our dataset the prediction of AO traits is more challenging: the baseline model achieves an RMSE = 0.348 for AO traits and RMSE = 0.242 for BF traits, and the improvement by our model is 43% (RMSE = 0.199) for AO traits and 17% (RMSE = 0.201) for BF traits.

4.2 Holistic Personality Prediction with Multi-output Regression Chain Models

Individuals’ personality is not merely a set of independent traits but rather an integrated model of traits that interact with each other and are related to each other as well. For example, it is typical that anxious and insecure attachment is positively associated to Neuroticism [3]. This motivates us (from a data science point of view) to (1) exploit correlations between traits and (2) use a multi-label model that provides predictions for all traits at the same time.

To this end, we select to use regression chains, a multi-label model approach that arranges single-label regressor models into a chain. Each single-label model predicts a single trait, as in the single trait approach, but now also takes into account predictions of preceding models in the chain [68]. In total, a single multi-label model is produced that is capable of exploiting correlations among target variables.

For single-label models of the chain, we use the RF regressors we designed in Section 4.1; each RF regressor takes as input the best-performing set of features (see Table 4) and the predictions of the preceding models in the chain, as depicted in Figure 3.

An important parameter in the chain model is the order of the single-label models. For instance, models that appear earlier in the chain can bring significant benefits to predictions of later models (if the respective traits are correlated). Likewise, a poor prediction accuracy in the beginning of the chain may propagate and affect the accuracy of the following models [61]. Hence, in our approach we used an ensemble of regression chains, where 10 differently and randomly ordered chains are built, and the final predictions are derived from the mean result of all chains.

\(\rightarrow\) “Holistic personality prediction outperforms individual trait predicting models”: Applying the proposed holistic model in our dataset provides an RMSE of 0.192 (see Table 5). The prediction accuracy is improved compared to the independent model prediction, which demonstrates that the chain model can efficiently exploit the correlations between personality traits to achieve better personality prediction. In fact, Table 5 demonstrates that the superiority of regression chain models holds for all feature set configurations we tested (compare RMSE values in top vs. bottom rows).⁴ This is a promising finding that indicates that a holistic psychological prediction approach can be generic and bring improvements under different settings where different sets of features are available and/or selected as more appropriate.

Comparison with state of the art. We proceed to compare our approach with previous approaches. Since previous works neither take into account both BF and AO traits nor make their datasets and model implementations publicly available, a direct comparison is not possible. However, we compare our results with the prediction accuracy (1) stated in previous works and (2) achieved in our dataset when considering the configurations (e.g., set of features) of previous approaches.

With respect to the Big Five traits, [12] and [22] achieve an average RMSE of 0.18 and Mean Absolute Error⁵ of 0.14, respectively, which is comparable to our results. However, note that the models of [12] and [22] are trained and evaluated in different datasets, and a direct comparison of the results may not be conclusive. Given that the prediction accuracy typically increases with the volume of the training data, we expect that the results of our model would improve in larger datasets. We validate this with the results in Table 6, where we considered subsets of the dataset (i.e., fewer samples) on which we trained our model and present obtained prediction accuracy. This shows that the RMSE decreases when considering more training data.

Table 6.

Fraction of samples	40%	60%	80%	100%
RMSE	0.211	0.200	0.195	0.190

Table 6. RMSE of the Holistic Model with Different Training Dataset Sizes

Moreover, when considering only the Big Five traits (see Table 4), our single-trait model achieves a better range of RMSE (0.156–0.228) in comparison to the corresponding range of [12] (0.15–0.24). Using the holistic model further improves our single-trait model results. Applying a methodology similar to [12] to our dataset, i.e., using only N-grams as features for all traits and single-trait RF regression models, yields an RMSE of 0.284 (average over the BF traits). This clearly shows that the approach of [12] cannot generalize to our dataset, since it achieves a comparable performance with the baseline model, which is much worse than our approach that achieves an RMSE = 0.203 (by omitting the AO traits), i.e., a 29% improvement compared to [12].

Finally, with respect to the AO traits, to the best of our knowledge, this is the first study that attempts to predict self-reported relational traits and there are no previous studies to which we can compare our results. The approach by [48] relies on the accuracy of perceivers to identify AO in subjects’ tweets; however, the accuracy of the linear mixed effects models used cannot be directly compared with our approach and results.

4.3 Ground Truth Sample Selection Validation

In order to further validate that the selected ground-truth dataset of MTurkers is a good representative of real Twitter users and the HM built upon this sample is able to make predictions for different groups of users, we proceed by randomly collecting a large sample of random Twitter users (1,000 English-speaking Twitter users) and their profiles in order to apply our HM prediction model and get their predicted personality scores. We compare the distributions of personality traits of the two groups, as shown in Figure 4, and we observe that in most traits the two distributions are quite similar, with differences in the cases of most difficult traits, agreeableness and neuroticism. Moreover, these differences may also be justified by the prediction errors. These findings motivate us to apply our model on different groups of Twitter users studying their personality differences based on their roles; we present a use case in the following section.

Fig. 4.

5 A Use Case: Can You Tell If I Am an Opinion Leader?

Dataset and personality profiling annotation. In order to test our approach in a real-world application, we use a dataset with leaders obtained by Crunchbase (crunchbase.com). This dataset contains business and social media information of organizational employees, including CEOs and staff in executive positions, across organizations throughout the United States. Datasets provided by Crunchbase have been used successfully in recent research [27] in examining the relationship and interactions between leaders and their followers on Twitter. Importantly, as outlined in [27], individuals within the respective dataset who are highly active on Twitter and are followed by others on Twitter are defined as “leaders.” Therefore, on the basis of this definition, even a regular employee (i.e., someone without managerial responsibilities and team oversight in their usual workplace) can be defined as an opinion leader on social media. The case can be made that opinion leaders on social media are indeed leaders due to their social influence on followers’ communication content and patterns and yield large referent and expert power. One example would be a Twitter leader who is known for their expertise in communicating astrophysics and/or whose followers can identify with them [20]. Hence, although these “leaders” can and do use many of the same bases of power as more traditional organizational leaders, they can be regular employees and users. We ranked individuals within this dataset by their Twitter activity and selected a set of 1,063 highly active Twitter accounts (for brevity, we refer to them as “leaders”). For these respective accounts, we extracted tweets posted within the past 6 months from the onset of data collection, resulting in 238,665 tweets. We applied our methodology to extract language, emotional, and behavioral features (see Section 3.3) and fed this information into our pre-trained holistic model, which in turn predicts leaders’ personality traits.

Characterization of the leaders’ psychological profiles. After predicting leaders’ personality scores, we perform a basic statistical analysis to (1) identify the main characteristics of the leader and (2) investigate whether our identified leaders are similar or differ in their characteristics from random Twitter users in our collected dataset.

Table 7 presents basic descriptive statistics of obtained personality scores in the leaders dataset. A more detailed comparison of the distributions is presented in Figure 5 (the x-axis represents each trait value and the y-axis the cumulative distribution function (CDF) of the values). A first observation is that leader personality profiles differ significantly from those of the random users in general; see the very different distributions in Figure 6.

Fig. 5.

Fig. 6.

Table 7.

	Mean	Std	Min	Max
Anxiety	6.16	0.52	1.00	7.00
Avoidance	5.79	0.66	1.00	7.00
Openness	4.50	0.25	1.00	5.00
Conscientiousness	1.83	0.34	1.00	5.00
Extraversion	1.86	0.35	1.00	5.00
Agreeableness	2.89	0.83	1.00	5.00
Neuroticism	2.18	0.51	1.00	5.00

Table 7. Leaders’ Personality Traits Statistics (1,063 Samples)

Focusing on individual traits, and concerning the BF traits, we observe that leaders score higher than random users on Openness, the tendency to be open-minded and imaginative. Openness to experience also includes the facet “intellect,” which marks another expected association with individuals who have gained positions of social influence such as opinion leaders or positions of managerial power such as Chief Executive Officers or Heads of Design and similar positions. These results are expected. However, the individuals in our sample scored quite low on Extraversion and Conscientiousness, which is in contrast with the opinion that leaders are dominant and sociable in their environment. This may indicate that behaviors associated with these traits are not easily reflected, and therefore detected, online. As for the AO traits, we observe that leaders’ scores are high on both anxiety and avoidance. A possible interpretation for this is that the leaders in our sample do not wish to maintain successful relationships with other users online and merely use the Twitter platform to advertise their own efforts and accomplishments, as well as to respond to other users who seem to post interesting cognitive (instead of affective) content.

These initial findings highlight an interesting research direction in which our approach could be useful, namely in-depth investigations of possible differences between the real-world and OSN behavior of different groups of people. This may reveal groups or clusters of individuals who tend to have a persona in OSNs very different than their real-world personality, and possibly causing factors for this phenomenon. Clustering users based on their psychological profiles. To obtain further insights on how the two groups differ (random users vs. leaders), we cluster these groups based on the personality profiles (inferred by our model) of the users. Since each personality profile consists of a high-dimensional vector, we employ a t-distributed stochastic neighbor embedding (t-SNE) graph that reduces dimensionality by mapping profiles on a two-dimensional space. We present the t-SNE visualization in Figure 7, where users are represented by dots (red for random users, green for leaders) and user coordinates correspond to their psychological profile (as mapped in the two-dimensional space). Two dots that are close to each other denote that the users have similar profiles, since t-SNE calculates a similarity measure based on the distance between points. It can be clearly seen in Figure 7 that our holistic model is able to separate different types of users in distinct clusters; i.e., all users in each cluster are of the same role (color), and the users would be indistinguishable only if the colors in the same cluster were mixed.⁶ This is an interesting finding, which indicates that our holistic psychological profile model (and the seven dimensional profiles it predicts) can be used toward efficiently clustering/classifying different groups of Twitter users.

Fig. 7.

Moreover, to understand the role of AO, BF, and their combination on the clustering task, we evaluated the clustering results for three different cases: HM (AO+BF chain model), IM (AO+BF independent model), and BF independent model. While Figure 7 shows that there exist clearly four clusters, for a more complete analysis, in order to select the optimal number of clusters for each task we utilized the elbow method [34], and to evaluate the clustering results we use the silhouette coefficient, which calculates how similar the objects on a group are compared to other groups. Elbow method analysis in Figure 6 shows that the optimal number of clusters is among three and four and as a result we use these numbers as a proxy for the optimal number of clusters. We also account for k = 2 as the baseline. Table 8 presents the different silhouette scores for each of the three cases. It can be seen that when taking into account the AO information (either in the HM or the IM models), the clustering improves (i.e., higher silhouette value) compared to the model that uses only the BF characteristics, and this finding holds for all number of clusters we tested, indicating the robustness of using both AO and BF traits.

Table 8.

Model		Silhouette Score
	k = 4	k = 3	k = 2
HM (AO + BF chain model)	*0.576*	0.506	0.480
IM (AO + BF independent model)	0.557	0.520	0.501
BF (independent model)	0.434	0.401	0.377

Table 8. Clustering Evaluation for Different Models

Table 9.

	Anxiety	Avoidance	Openness	Conscientiousness	Extraversion	Agreeableness	Neuroticism
Language
authentic	\(\hphantom{.}\).15*	\(\hphantom{.}\).14*	\(-.01\hphantom{..}\)	.03	\(\hphantom{.}\) \(-\).21**	\(-.02\hphantom{..}\)	.06
feel	\(\hphantom{.}\).17*	.06	.09	\(-.02\hphantom{..}\)	\(-.09\hphantom{.}\)	.02	.13
religion	\(\hphantom{.}\).15*	.05	.10	\(-.01\hphantom{..}\)	\(-.03\hphantom{.}\)	.04	.04
filler	\(\hphantom{.}\).17*	.02	.02	\(-.12\hphantom{..}\)	\(-.05\hphantom{.}\)	.07	\(\hphantom{.}\).16*
compare	.05	\(\hphantom{.}\).14*	.11	.02	\(\hphantom{.}\) \(-\).14*\(\hphantom{.}\)	\(-.06\hphantom{..}\)	.03
number	\(-.04\hphantom{..}\)	\(\hphantom{..}\).18**	\(\hphantom{.}\) \(-\).18**	.00	\(\hphantom{.}\) \(-\).15*\(\hphantom{.}\)	\(-.07\hphantom{..}\)	\(-.05\hphantom{..}\)
health\(\hphantom{.}\)	.04	\(\hphantom{.}\).13*	.02	.01	\(-.10\hphantom{.}\)	\(-.09\hphantom{..}\)	.06
space	.06	\(\hphantom{.}\).13*	\(-.03\hphantom{..}\)	\(-.04\hphantom{..}\)	\(\hphantom{.}\)\(-\).17*\(\hphantom{.}\)	\(-.03\hphantom{..}\)	.03
leisure	\(-.01\hphantom{.}\)	\(-\).13*\(\hphantom{.}\)	.02	.03	.07	.05	.01
Emotion
fear	.08	.09	.10	\(-.05\hphantom{..}\)	\(\hphantom{.}\)\(-\).14*\(\hphantom{.}\)	.06	.09
sadness	.10	.07	\(-.07\hphantom{..}\)	\(-.05\hphantom{..}\)	\(-.01\hphantom{..}\)	.12	\(\hphantom{.}\).17*
joy	.06	\(-.09\hphantom{..}\)	\(-.06\hphantom{..}\)	.08	.01\(\hphantom{.}\)	\(\hphantom{.}\).14*	.04
surprise	.05	.12	\(\hphantom{.}\)\(-\).20**\(\hphantom{.}\)	.05	.01\(\hphantom{.}\)	\(-.02\hphantom{..}\)	.06
Behavior
#followers	.07	.06	.05	\(-.05\hphantom{.}\)	.06	.01	\(\hphantom{.}\).18**
#retweets	.05	\(-.02\hphantom{..}\)	\(-.08\hphantom{..}\)	\(-.08\hphantom{.}\)	\(-.05\hphantom{..}\)	.05	\(\hphantom{.}\).14*\(\hphantom{.}\)
#user mentions	\(\hphantom{.}\)\(-\).20**	\(\hphantom{.}\)\(-\).17*\(\hphantom{.}\)	.03	.01	.10	.09	\(-.10\hphantom{...}\)
#retweets received	.10	\(-.05\hphantom{.}\)	\(-.13\hphantom{..}\)	\(-\).16*\(\hphantom{.}\)	\(-.02\hphantom{..}\)	.06	.11\(\hphantom{..}\)
screename length	.03	\(-.04\hphantom{.}\)	.00	\(\hphantom{.}\).13*	.05	.05	.03\(\hphantom{..}\)
tweet length	.03	\(-.03\hphantom{.}\)	.04	\(-\).16*\(\hphantom{.}\)	\(-.08\hphantom{..}\)	.10	.10\(\hphantom{..}\)

Table 9. Detailed Correlations between Features and Traits

*p < 0.05 level, **p < 0.01 level (both 2-tailed).

While these preliminary results demonstrate the usefulness of our work in this direction (a more detailed analysis would be beyond the scope of this article), we believe that this can motivate further research in the topic, e.g., by considering different groups of users (such as artists, politicians, influencers, etc.) or classification techniques.

6 Conclusions and Future Work

Anxious and avoidant attachment orientations are important constructs in how people express themselves in interpersonal relationships [43]. In combination with BF personality traits, these relational traits have been associated with several offline and online environments that require individuals’ presence and/or interaction, for example, consumer behavior [45], social media use [57], social network friendships [60], as well as the formation of online communities. These associations highlight the important role of personality traits, both in self-expressions and as relational traits, in interactions with others. However, to the best of our knowledge, no previous studies have provided prediction algorithms to detect relational traits based on self-report measures. The present article addresses this gap by providing a machine learning detection and prediction approach with regard to both relational and Big Five personality traits in one model, providing a holistic image of the social self.

We find that the strong inter-correlations between personality and relational traits indicate that employing a multi-output model for personality prediction exploiting the best predictors among language, emotion, and behavioral residue combined better represents the psychological mechanisms behind personality expression and relationship interactions online, achieving at the same time higher accuracy performance than other state-of-the-art models.

The holistic psychological profile can find applicability in a number of studies and fields. We provided an initial demonstration through a use case by applying the model on Twitter users that are organizational leaders in the real world. The initial findings show groups are clustered efficiently based on similar personality characteristics. Similar approaches can be applied to different groups of persons and contribute in social science studies, investigation of the “pretend persona” phenomenon [5] (i.e., deviations between the offline and online behavior of people, similarly to those we found in the leaders dataset), user classification in OSNs, design of recommendation algorithms, information diffusion, and so forth. We believe that these are interesting directions for future research. Furthermore, the proposed approach could be combined with other psychological models reflecting individual differences in order to uncover more psychological patterns reflected on our behavior in OSNs applied in different case-specific samples creating a personality map. As additional future research we propose that the investigation of personality and attachment profiling in the social networking context by combining information of users in a network could reveal the mechanism by which people tend to form relationships online based on how they do it in real life. Finally, the study of information diffusion in personality networks could reveal which personality type is able to propagate certain kinds of information and emotions more effectively with applications on the marketing and news sector.

Footnotes

In this article, we use the term holistic to refer to the two main primary categories of traits and their main models, i.e., the personality traits of the Big Five model and the relational traits of the Attachment Orientations model; we do not refer to the exhaustive list of psychological traits or models that have been proposed.

Scores deduced from the personality questionnaires were in the interval [1, 7] for AO traits and [1, 5] for BF traits. Both types of trait measures were normalized to a common interval [0, 1].

The RF regressor is an ensemble method, which generates a number of decision tree regressors fitted on various subsamples of the dataset and aggregates their results, thus acting as a meta-estimator that can improve the predictive accuracy and control over-fitting [8].

⁴

To validate the performance difference between the HM and IM, we performed the non-parametric Mann-Whitney U Test in order to compare the squared errors of the two models. The test selection criteria were the small sample size of the testing set and the non-normal data distribution. The Mann-Whitney U Test analysis showed a significant means difference between the HM and the IM (U = 761, n = 46, P = 0.012, alpha = 0.05) supporting the performance improvement of the HM.

⁵

Note that the mean absolute error (MAE) is always less than or equal to RMSE.

⁶

We need to highlight that users’ clusters are not defined explicitly by the role we assigned (leader or random) since more subgroups may exist; that is why in Figure 7 we observe four clusters/groups.

A Detailed Results of Features-traits Relations

We present a detailed analysis and results on the correlations between features and traits in our dataset.

Several associations were found between anxious attachment and language features. Anxious individuals tended to use language that conveys authenticity, indicating the extent that the language used is personal and self-revealing [51] (\(\rho = 0.15\)), and feelings (\(\rho\) = 0.17; words such as “feels,” “touch,” etc.), as well as religion (\(\rho\) = 0.15; words such as “church,” “altar,” etc.). In addition, individuals who scored high on anxious attachment were also more likely to use filler words (\(\rho\) = 0.17, words such as “I mean” or “You know”). Avoidant attachment, on the other hand, was significantly and positively associated with authenticity (\(\rho\) = 0.14), mentions of health-related words (\(\rho = 0.13\); words such as “clinic,” “flu,” “pill,” etc.), and space (\(\rho\) = 0.13; words such as “down,” “in,” “thin,” etc.). We also found that avoidant individuals were less likely to mention leisure-related words (\(\rho = -0.13\); words such as “cook,” “chat,” “movie,” etc.). Avoidant persons also seemed to use comparisons (\(\rho = 0.14\); words such as “greater,” “best,” etc.) and numbers frequently (\(\rho\) = 0.18; words such as “second,” “thousand,” etc.), which hints at prior work that shows avoidant individuals tend to have a negative view of others, and hence would be more likely to see themselves in a more positive light. With regard to behavioral features, it seems both anxious and avoidant AOs were significantly and negatively related to the number of user mentions (\(\rho = -0.20\) and \(\rho = -0.17\), respectively). Hence, individuals who scored higher on either of these traits tend not to directly refer to others within a social media context.

Regarding the BF dimensions, we found several correlations among language and topic use, emotions, and behavioral patterns within each trait. For example, agreeableness was significantly and positively correlated with the expression of joy (\(\rho\) = 0.14), and extraversion was negatively correlated with fear (\(\rho = -0.14\)). We also observed that openness was significantly and negatively correlated with surprise (\(\rho = -0.20\)), while highly neurotic individuals were more likely to express sadness (\(\rho = -0.17\)). In addition, conscientiousness was negatively correlated with retweeting behavior (\(\rho = -0.16\)) as well as tweet length (\(\rho = -0.16\)), while conscientious individuals were more likely to pick lengthy screen names (\(\rho = 0.13\)). Moreover, we noticed that highly neurotic individuals were more likely to have more followers (\(\rho\) = 0.18) and tweet other people’s content on a regular basis (\(\rho\) = 0.14).

References

[1]

Sibel Adali and Jennifer Golbeck. 2012. Predicting personality with social behavior. In 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. IEEE, 302–309.

Abstract

1 Introduction

2 Background and Related Work

2.1 Personality Principles

2.2 Personality Traces in OSNs

2.3 Personality Prediction in OSNs

3 Harvesting Personality Traces in Twitter

3.1 Dataset Building

3.2 Pre-processing Phase

3.3 Features Engineering Phase

3.3.1 Behavioral Features.

3.3.2 Language Features.

3.3.3 Emotion Features.

3.4 Exploratory Data Analysis

4 Personality Prediction

4.1 Single Personality Trait Prediction

4.2 Holistic Personality Prediction with Multi-output Regression Chain Models

4.3 Ground Truth Sample Selection Validation

5 A Use Case: Can You Tell If I Am an Opinion Leader?

6 Conclusions and Future Work

Footnotes

A Detailed Results of Features-traits Relations

References

Cited By

Index Terms

Recommendations

Predicting Dark Triad Personality Traits from Twitter Usage and a Linguistic Analysis of Tweets

Personality traits and psychological motivations predicting selfie posting behaviors on social networking sites

Predicting Personality Traits of Users in Social Networks

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations