1 Introduction
Online Social Networks (OSNs) offer a virtual space in which people connect and interact with others, express themselves, and receive information, in a continuous digital reflection of the real (offline) world. In OSNs, people typically showcase their real self [
40] and leave traces in online behavior, which reflect their real-world personality [
24]. These traces expose a holistic image of oneself, including both personal characteristics (
personality traits) and characteristics that portray their behavior in relation to others (
relational traits).
The term
personality refers to characteristic combinations or patterns of behaviors, cognitions, and emotional reactions that evolve from biological and environmental factors and form relatively consistent individual differences [
13]. The
Big Five (BF) or Five Factor model [
29] is one of the most distinctive personality theories that constitutes five main traits of human personality representing individual differences in cognition, emotion, and behavior: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. On the other hand,
relational traits have also been linked with consistencies in social behavior and interaction patterns, with attachment theory [
7] as the most emblematic theoretical framework in that respect [
31,
43], capturing how individuals experience close relationships to and interactions with others.
Personality traits have been studied in the context of OSNs and the web overall, as findings show that they are strongly linked to OSN use [
57], online friendships [
60], and online reviews [
52]. Moreover, certain prediction models have been proposed [
37,
64] to extract users’ psychological background from their online behavioral residue and map it to personality characteristics. However,
relational traits such as attachment orientations (AO) have been overlooked in online environments, even though user activity in OSNs heavily relates to social behavior characteristics. This makes the study of a relational profile critical from an application point of view and provides rich information about individuals’ social profile.
The present research aims to address this limitation in OSN research, by studying and predicting both relational traits and personality traits of users. The importance of relational facets of personality for explaining social interaction cannot be overstated. Given that online social media engagement resembles actual social interactions in many respects [
15,
30], the need to study how different personality facets are reflected in online expression is particularly compelling. Attachment orientations, a key individual difference of relational orientation, can be derived on the basis of traces found in micro-blogs. Attachment orientations capture one’s notion of the self in relation to others and interpersonal relationships, with attachment theory being one of the key personality theoretical frames to explain actual social behavior [
31]. Considering both traditional personality Big Five traits and relational traits is important for (1)
providing holistic profiling of OSN users—humans have an integrated profile in which self and social aspects interrelate and affect each other, and joint profiling can be essential for understanding the overall human presence on OSNs; (2)
uncovering more traits of people’s psychological and social world has been identified as a direction in OSN research (which currently focuses only on the personality traits) that could help to better explain, analyze, and predict online user behavior [
66], e.g., with applications on customer segmentation [
46] or digital advertisement environments [
17]; and (3)
shedding light on social interaction phenomena taking place in OSNs is of great socioeconomic importance, e.g., community formation [
32], decision making [
42], or information diffusion [
12].
To this end, the present article proposes a novel data-driven approach to predict a holistic psychological profile of OSN users, capturing both their personality and relational traits.
1 Building on insights stemming from psychology theory, our approach applies data mining on OSN textual and non-textual data, carefully selects different sets of features for predicting different types of traits, and exploits the inherent correlations in psychological traits, to efficiently predict a complete image of OSN users’ psychological profile. The proposed approach is applied on the Twitter micro-blogging service, which stands as a live, dynamic, and open OSN platform on which people intensively interact, and is largely driven by people’s spontaneous reactions and emotions expressing themselves (personality facet) and interacting with others (relational facet) at the same time.
Specifically, our contributions in detail are as follows:
•
Data mining and feature engineering for psychology traces in OSN. Motivated by psychology theory on personality suggesting that traits are reflected in different types of online behavior and actions, we identify a large set of features that capture language, behavioral, and emotional expressions of users in OSNs. The proposed feature engineering methodology accounts for a larger set of features than those considered in previous works, thus allowing to target more generic psychological profiling. To apply and test our methodology, we collected a labeled dataset: through a crowdsourcing platform, we recruited 243 individuals who consented to provide information about their psychology profiles. Subsequently, we compiled a ground-truth dataset labeled with their psychology profiles. We used the Twitter API to collect 350,000 tweets from the Twitter accounts of recruited participants and applied the proposed feature engineering methodology.
•
Holistic psychological profiling. We propose a novel machine learning (ML) methodology to predict users’ holistic psychological profile including both Big Five personality and relational traits. The novelty of the proposed methodology is that it (1) uses a large set of the collected (psychological-related) features, (2) carefully selects the subsets of them with the strongest predictive power for each trait, and (3) exploits correlations between personality and relational (i.e., social) behavior traits to enhance individual trait predictions. In this way, our approach not only predicts social facets of a psychology profile (which is not captured by existing personality prediction models) along with personality facets but also leverages the different traits for more accurate holistic profile prediction.
•
New insights and improvement of prediction accuracy. Evaluating our methodology reveals interesting insights for the prediction of psychology traits from OSN traces: (1) using different sets of features performs better in predicting different psychological traits, (2) relational traits can be predicted as efficiently as personality traits, and (3) holistic personality prediction outperforms individual trait predicting models. We believe that our findings can pave the ground for future experimentation and studies in psychology profiling in OSNs. Moreover, the accuracy achieved by our approach (across all traits) is higher than current state-of-the-art approaches, which currently are limited to Big Five personality traits instead of relational traits. For example, applying the approach of [
12] to our data provides a
root mean squared error (RMSE) of 0.284, while our prediction model achieves a 29% improvement for personality traits (RMSE = 0.203) and has 32% better average performance when accounting for all traits (0.192 RMSE); this improvement comes as a result of using both a psychology-driven feature engineering methodology and a holistic profiling approach.
•
Psychological profiling in the wild. We demonstrate the applicability of the proposed psychological profiling methodology through a use case. We identify a set of Twitter users who seem to be accepted as opinion leaders on social media (i.e., have a large following). We apply our methodology to predict their psychological profiles and analyze results. We find that the distributions of traits significantly deviates from regular users (defined as users included in our ground-truth dataset), and that the set of leaders can be clearly separated by only using their psychological profiles. These findings highlight the usefulness of our approach in the characterization of the personalities for different groups of OSN users (e.g., such a group psychological profile could be used to recommend skills/activities/jobs to users based on their profile similarity) and classification of users based on their profiles.
In this section, we provide an overview of related psychological literature, discuss related work, and highlight several open issues of existing methods. We also highlight the contribution of the present work. Section
3 details the collected dataset and the data mining and feature engineering methodology, and Section
4 presents the design and evaluation of the proposed machine learning predictive model. Finally, we conclude our article and discuss future work in Section
6.
3 Harvesting Personality Traces in Twitter
In this section we present the data collection and analysis methodology we follow, which is used to feed our
personality prediction models (Section
4). We first describe the Twitter dataset we collected, labeled with personality scores by the users themselves (Section
3.1). Then, we
pre-process the collected data (Section
3.2),
extract meaningful features for our analysis and modeling (Section
3.3), and conduct an initial data exploration analysis to characterize the quality of the collected dataset and obtain insights for the design of the prediction model (Section
3.4). Figure
1 depicts an overview of our research methodology.
3.1 Dataset Building
We collected a dataset by recruiting 243 (41.44 % female and 58.15% male) participants on Amazon’s Mechanical Turk, who provided information about their psychological profile (which was used to “label” the dataset) and their public Twitter accounts (which were used to extract the features). Participants were on average 38.96 years old (std = 10.96) with an average work experience of 15.37 years (std = 10.26). To minimize differences due to culture, we restricted our sample to participants from the United States. In addition, to lower the likelihood of MTurk workers to click through surveys randomly, we included (1) various pre-tested attention check questions including both direct and bogus items according to best practices [
16], (2) time screens to prevent participants from answering items too quickly [
76], and (3) only participants who had a high approval on the MTurk platform.
Psychological profiles. Participants completed a survey by answering questions related to the BF and AO models, as well as by providing some demographic information. Specifically, the
Big Five personality dimensions of the participants were assessed using the 20-item mini-IPIP [
18], where participants completed 20 items on a 5-point Likert scale from which personality traits were inferred. This scale has been shown to be consistent across studies and has good test-retest reliability, demonstrating a comparable pattern of convergent, discriminant, and criterion-related validity with other Big Five measures [
18]. The mini-IPIP scale is composed of 20 items in total, measuring Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism on a 5-point (1 = “Very inaccurate” to 5 = “Very accurate”) Likert scale.
Attachment orientations, anxiety, and avoidance were assessed with the “Experiences in Close Relationships” scale for adult attachment [
63]. This scale has become the norm in attachment-related research [
26] and has been applied widely as well [
43]. Participants completed a 36-item Likert scale questionnaire to assess their behavior in general in close relationships, in order to extract scores about anxiety and avoidance dimensions. Summarizing, we acquired personality labels in order to build a ground-truth dataset, which is later used for designing personality prediction models. We normalized each variable to a common scale [0,1] in order to examine correlations among behavioral characteristics and feed the machine learning models.
Remark: Despite the fact that the methodology we use to infer the personality traits has been shown to be reliable by previous studies, we performed statistical analyses to further verify its reliability in our data. First, we calculated for each trait the Cronbach’s alpha value from the values of the survey items involved. We present the results in Table
3, where we observe high Cronbach alpha values (i.e., high reliability) for all traits. Moreover, we use
confirmatory factor analysis (CFA), a statistical technique used to assess the fit between observed and prior conceptualized, theoretically grounded models [
65]. Using principal components factor analysis with oblique rotation, accounting for possible correlation between factors, for all 36 attachment items and 20 Big Five personality items, we found that relevant factor loadings and commonalities were all above 0.3, confirming that each item shared some common variance with other items. Except for anxious attachment and neuroticism, there is not any high cross-loading between variables. That means that the seven variables are largely distinct from one another. In addition, all factor loadings loaded onto the seven identified factors. Given these overall indicators, factor analysis was deemed to be suitable with all examined items, allowing us to successfully replicate factor loadings comparable to prior research, namely a two-factor attachment structure [
56] and five-factor Big Five personality trait structure [
39].
Twitter accounts. Participants provided their Twitter account user name, which we used to retrieve their social profile. We manually verified the existence of the accounts and excluded from our dataset participants with private accounts, inactive accounts (i.e., with no tweets history), and false accounts. For each of these accounts, using the Twitter API, we collected user profiles, including both tweets history and profile metadata, such as number of followers, statuses, lists, etc.
3.2 Pre-processing Phase
We conducted an initial pre-processing of the data for (1) removal of spam users and tweets and (2) text cleaning, in order to remove factors not related to real users’ psychological profile.
Spam removal. We adopted the approach of [
73] to remove spam users. As spam users we identified those who employ a large number of similar posts and a large number of hashtags. After a manual inspection, we observed that most tweets with five or more hashtags related to repetitive and exclusive content about contests or games. For instance, in cases where an individual uses Twitter for certain reasons only, i.e., participation in contests and giveaways or promoting certain blogs, their tweet history contains similar posts. These users were removed from our dataset. In a similar manner we removed only spam tweets and not the complete user profile, in case a tweet contained five or more hashtags but did not fall into the categories of repetitive and promotional tweets. Moreover, to avoid considering users with a non-typical platform behavior, we removed “ghost users” that have no tweets history at all (number of statuses) or have very few interactions (number of followers). As a result, our final dataset included 229 users with almost 350,000 tweets.
Text preprocessing. We parsed the collected tweets and filtered out all punctuation, digits, and Unicode characters, as well as user mentions, retweet identifiers, and external links (URLs). The most common words used in language, known as “stopwords” (e.g., “that,” “be,” “to,” etc.), were removed to keep only meaningful word units for analysis. After this step, a word tokenizer was used to fracture the text corpus into single-word units, in order to use them as inputs in other tasks, for example, lemmatization and part-of-speech tagging.
3.3 Features Engineering Phase
We extracted three types of features, related to behavior, language, and emotion. These feature sets are utilized in order to feed the machine learning algorithms as representations of online personality and relational psychological traces.
3.3.1 Behavioral Features.
We extract a set of attributes (
behavioral features) that relate to users’ profile information and activity on Twitter. Table
2 presents all extracted features that reflect users’
expressivity (eight features),
interactivity with other users (four features), and
platform use (three features). These 15 features, after being normalized on a common scale [0,1], make up a vector for each user that mirrors users’ overall behavior on Twitter.
3.3.2 Language Features.
To extract the best possible features that could reveal patterns on language use, we employed an open vocabulary approach; in this way, we avoid depending on pre-defined lexicons, which do not allow for unintentional language use that exists in the real text contained in the tweets. For each user we extracted three different language vectors: TfIdf, N-grams, and part-of-speech (POS)or syntactic vectors.
•
Tfidf vectors reflect the importance of a word to a document in a collection [
58]. Considering the set of all tweets in our dataset and all unique words contained within them (after the pre-processing task), each tweet is represented as a vector that calculates word frequency in the document and the relevance of each term in the document. The more a word appears in a document, the more significant it is estimated to be [
2].
•
N-grams (Bigrams for N = 2 and Trigrams for N = 3) represent the phrasal language expression offering rich and powerful features. N-grams produce the frequency vector of any sequence of
\(N\) (contiguous) words in a text, also known as phrases [
55].
•
Syntactic features: Considering all the dataset’s unique words after the preprocessing task, we use a function that assigns a POS tag in each word based on the pre-trained model of [
6]. In our analysis, we use three different syntactic features: frequency of (1) single POS tag vectors, (2) POS tag bigrams, and (3) trigrams. The motivation for using POS is the important associations between the POS use and personality [
41,
78]. In particular, [
41] noted that linguistic orientation is more consistently described by its syntactic component, reflected by the different use of POS, while [
78] demonstrated the potential of POS N-grams as predictive features for the Big Five traits, based on the fact that syntactic features are not controlled so consciously as the use of individual words.
Feature selection and dimensionality reduction. Different combinations of feature subsets were tested using grid search. Moreover, the vectors of language features are sparse and much larger than the other vectors (i.e., behavior and emotion features) taken into account in our analysis. Hence, to reduce over-fitting and improve the generalization of models, we apply a dimensionality reduction of the language vectors. In particular, for Tfidf, N-grams, and POS vectors, we follow a univariate feature selection approach by computing the correlation between the regressors and the target variables and selecting the 100 best features emerging for each category.
3.3.3 Emotion Features.
As emotion features, we aimed to predict individuals’ emotional expression in terms of certain emotions and not in terms of general sentiment scores. In particular, we use a vector of six items representing the six primary emotions [
19]:
joy,
sadness,
anger,
disgust,
fear, and
surprise. We need to extract specific emotions not only to be used as inputs to the personality prediction model but also to study the relationship among personality and emotions and provide new insights. To extract these emotional values from the tweets of the users, we follow the hybrid approach for emotion detection of [
14], which predicts emotions in a text based on its textual attributes, use of emojis, and sentiment scores. Specifically, using the SemEval-2018 dataset [
44] as ground truth, we trained a
Support Vector Machine (SVM) multilabel emotion classifier to annotate the users’ tweets in our dataset. For each user, we aggregate the emotion values, and the emotion feature vector output is used in our methodology.
Textual attributes: We create for each user a vector representing the six primary emotions, exclusively based on the text in the tweets of the user. In order to extract the values of this vector, we follow a text processing methodology for emotion detection: We use the WordNet-Affect, a popular lexical database for emotion detection [
69], which assigns a variety of affect labels to a subset of
synonym sets (synsets) in WordNet-Affect. We extend the word list of our dataset using WordNet-Affect synsets of each word included in the representative set of words. After assigning emotions to each word using scores retrieved from the WordNet-Affect, we created the emotion frequency vector.
Emoji-based attributes: In addition to the text-based emotional vector, we create another vector (of six items), in which we detect the use frequency of affective emojis. We employ the affective emojis list of [
77] that states groups of emojis related to the six primary emotions. We map each emoji found in our text data with the primary emotion it is linked with and we create the affective emoji frequency vectors.
Sentiment scores: Finally, we create a vector of two items corresponding to negative and positive sentiment scores based on sentiment values extracted from the WordNet-Affect lexicon, which also provides sentiment scores for concepts except for affective classification. To produce a total sentiment score for each user, we sum word sentiment scores at the tweet level and compute the average at the user level.
We proceed to combine the information in the collected attributes and create a single vector of six emotion features for each user, after normalization in a common scale [0,1]. To infer the emotional vector from the collected attributes, we first train a predictive machine learning model on the SemEval-2018 dataset [
44] and then apply it to our users. Specifically, we split the the SemEval-2018 dataset in 80%-20% test-train sets, and trained an SVM emotion classifier, which achieved a precision score of 0.85. Then, in our dataset, we aggregate the aforementioned features, along with a TfIdf vector, and feed them to the SVM model that outputs the emotion vector for each user. This hybrid approach has been shown to be more robust compared to text-only or emoji-only emotion detection [
14]. Moreover, by utilizing a valid and high-quality dataset as a ground truth to build an emotion model instead of using emotion signals as sentiment scores or emojis, we add an extra layer by transferring knowledge from another source, resulting in better results in specific emotion detection.
3.4 Exploratory Data Analysis
In this section we conduct an exploratory analysis of the collected data (self-reported psychological traits and extracted features) to validate to what extent our dataset is representative of and spans different psychological profiles, and obtain insights that can help toward designing an efficient predictive methodology.
Distribution of psychological profiles. We calculated the distributions for each of the (self-reported) psychological traits contained in our dataset and present their statistics in Table
3. As expressed in the
min and
max values for each trait, our dataset contains samples that span the entire range of psychological profiles (note that, before normalization, relational traits were scored on a scale of 1 to 7, while personality traits were scored on a scale of 1 to 5). The mean value for most traits is around the middle of the respective range of values, with Openness and Agreeableness being considerably skewed toward higher values and Neuroticism toward lower values. Trait variability (std) was higher in the case of AO traits, Extraversion, and Neuroticism; traits with more variability are expected to be more challenging to predict (which is also validated by our results; see Section
4.1).
Correlations between psychological traits (and the motivation for a holistic psychological profiling approach). Figure
2 depicts the correlations (Pearson’s coefficient
\(\rho , \rho \in [-1,1]\)) among the various traits in our dataset. There are significant positive/negative correlations between traits. Specifically, 17 out of the 21 trait pairs have a coefficient
\(|\rho |\gt 0.2\). Moreover, the highest correlations are between anxious attachment and Neuroticism (
\(\rho\) = 0.74), and avoidance and Agreeableness (
\(\rho\) =
\(-0.53\)); both these pairs contain a BF trait and an AO trait.
These observations motivate us to consider a holistic psychology profile prediction approach, which accounts for inter-trait correlations and allows us to predict both types of traits at the same time with a single multi-output model (see Section
4.2).
Correlations between psychological traits and OSN-extracted features. We proceed to examine the correlations between psychological traits and the online behavioral traces of OSN users that are captured by the extracted features (Section
3.3). As for language use, correlations arose from linguistic categories that exist in the Linguistic Inquiry and Word Count [
51]. Details about the correlational analysis can be found in the appendix (see Table
9). The main findings show that our sample is representative of specific behavioral, language, and emotional indicators linked with personality from a psychological aspect and expression and OSN use. Moreover, we observe a lack of strong correlations among individual features and traits, since the vast majority of correlations remain below 0.20, while 37.8% of them are below 0.15. This indicates the need for complex features to capture patterns among personality traits and expression in OSN.
In particular, with regard to AO on the basis of online behavior, the language use of anxious individuals seems to be more personal and self-revealing. This is in line with findings that anxious attachment is associated with higher authenticity and higher self-disclosure in social interactions, thereby fulfilling their need for increased proximity seeking. On the other hand, avoidant individuals tend to express a negative view of others and are more likely to see themselves in a more favorable light, as they have learned not to depend on others. We also found that avoidant persons are more likely to use linguistic features that suggest a social comparison. In addition, and in keeping with previous findings [
11], highly avoidant persons seem to be more likely to use objective words and less emotion-related words.
As for BF traits with regard to emotion features, we found an expected relationship between Neuroticism and the expression of sadness, while highly extraverted participants were less likely to express fear and highly agreeable participants were more expressive of joy. These associations are in line with previous research on the BF traits and social media use [
59,
79]. Finally, with regard to language features, most associations were found with regard to extraverted participants, such as a lower likelihood for such individuals to express authenticity, likely due to their tendency toward expressing positive emotions and feelings [
59,
79].
We provide details on the correlations between psychological traits and OSN-extracted features regarding effect sizes and significance as well as psychological interpretation in the appendix.
4 Personality Prediction
In this section, we proceed to design a model that predicts the user personality traits from the extracted features (see Figure
1). In the following, we first provide the overview of our approach. We then present and discuss our design choices step by step, along with evaluation results (i.e., prediction accuracy) and comparison with state-of-the-art approaches. Our approach combines expert knowledge from the psychology domain and advanced statistical analysis techniques.
Machine learning approach. The analysis in Section
3.4 reveals that no individual behavioral, emotion, or language metric has significant correlation with any of the individual personality traits. This indicates that simple statistical models (e.g., analytic models or linear models with a few features) are not expected to be able to predict accurately the personality profiles; we experimentally verified this intuition as well. Hence, in our approach we select to use advanced statistical models, i.e., ML models, which have been proved efficient in capturing complex statistical patterns and relations between features and target variables.
Regression, training strategy, and evaluation metric. The scores for the personality traits retrieved from the survey are continuous values in the interval [0,1].
2 Hence, we select to use regression models that correspondingly predict a (continuous) value for each trait. Before training the models described below, we first split our dataset into 80%-20% test-train. The performance of our model refers to the errors on test sets; however, errors on training sets are also provided.
To measure model performance, we use the RMSE:
where
\(n\) is the number of samples,
\(y_{i}\) is the actual trait value for sample
\(i\), and
\(\hat{y_{i}}\) is the corresponding prediction of the model. RMSE is the most common metric for regression problems that incorporates the model’s variance and bias.
Base regression model. We tested different regression models (linear, support vector, Gaussian, random forest, etc.) for predicting individual personality traits from the extracted features. The
random forest (RF) regressor
3 performed consistently better for the majority of the personality traits, and thus is selected as the base regression model in our analysis. We conducted a grid search for the tuning of the hyperparameters of the RF regressors, which led to 100 trees.
Single personality trait prediction and best-performing features (Section 4.1). Using the extracted features, we trained RF regressors to predict each personality trait
individually, similarly to previous approaches. However, as discussed earlier we (1) use a larger set of features and (2) provide predictions both for BF and AO traits (the latter have not been considered before in a self-labeled dataset). We find that neither using all the extracted features is optimal for all traits nor was there a single set of features that performs best for all traits. We selected the (sub)set of features for each trait that leads to higher accuracy, built the respective RF regression models, and studied their performance.
Holistic personality prediction with multi-output regression chain models (Section 4.2). As can be seen in Figure
2, and is supported by related literature [
28,
50], there are significant correlations between personality traits. Motivated by this observation, we propose a single multi-output model that predicts all personality traits at the same time. The proposed model uses chains of RF regressors (see Figure
3) and exploits the correlations between traits to achieve higher accuracy compared to individual trait predictive models.
4.1 Single Personality Trait Prediction
We first use RF regressors to predict each trait individually, similarly to previous approaches. For each trait, we build and train an RF regressor that receives all features, as well as RF regressors that receive a subset of the extracted features. We do so because using a smaller set of features may help to avoid over-fitting and improve predicting accuracy in relatively small datasets; in a sufficiently large dataset, selecting all features would be optimal.
\(\rightarrow\) “Using different sets of features performs better in predicting different personality traits”: In Table
4 we present the set of features that best predicted each trait. Using all features is not always the best choice; neither does there exist a single (sub)set of features that performs best for all traits. In particular, for the AO traits (anxiety and avoidance; top two rows), the best accuracy is achieved by using only language features, at a phrasal level (i.e., N-grams) for anxiety and at a syntactic level (i.e., POS) for avoidance. With respect to BF traits (bottom five rows), language features perform best for openness and agreeableness (at syntactic and word level, respectively), all features combined provide better predictions for Conscientiousness and Extraversion, and emotion features provide higher predictability capacity in the case of Neuroticism.
For completeness, we state in Table
4 the RMSE of a baseline/dummy model that always predicts the mean value of the distribution of each trait. Our model achieves 27% lower RMSE than the baseline model. Table
5 provides the average RMSE of the single trait prediction model (top row) when the same set of features is used for all RF regressors. As we can see, when the feature selection is not tuned per trait, the RMSE is significantly higher (and in some cases close to the baseline accuracy), which highlights the benefits of our feature selection approach.
\(\rightarrow\) “Relational traits can be predicted as efficiently as personality traits”: Apart from different best-performing features, we observe that the prediction accuracy differs among the different traits as well. For instance, Openness is the easiest trait to predict (RMSE = 0.158), while Neuroticism is the most difficult (RMSE = 0.228). These differences in prediction difficulty among traits are in line with previous findings [
12,
22,
49,
64]. When it comes to relational traits (AO), we observe that the achieved accuracy is within the range of accuracy for BF traits. This shows that the proposed feature engineering methodology, when combined with a single-trait RF regressor, can be used to efficiently predict both relational and personality traits. To our best knowledge, this is the first time that a common methodology is applied to both types of traits (cf. Section
2.3 and Table
1). In fact, compared to the baseline model, we can see that in our dataset the prediction of AO traits is more challenging: the baseline model achieves an RMSE = 0.348 for AO traits and RMSE = 0.242 for BF traits, and the improvement by our model is 43% (RMSE = 0.199) for AO traits and 17% (RMSE = 0.201) for BF traits.
4.2 Holistic Personality Prediction with Multi-output Regression Chain Models
Individuals’ personality is not merely a set of independent traits but rather an integrated model of traits that interact with each other and are related to each other as well. For example, it is typical that anxious and insecure attachment is positively associated to Neuroticism [
3]. This motivates us (from a data science point of view) to (1) exploit correlations between traits and (2) use a
multi-label model that provides predictions for all traits at the same time.
To this end, we select to use
regression chains, a multi-label model approach that arranges single-label regressor models into a chain. Each single-label model predicts a single trait, as in the single trait approach, but now also takes into account predictions of preceding models in the chain [
68]. In total, a single multi-label model is produced that is capable of exploiting correlations among target variables.
For single-label models of the chain, we use the RF regressors we designed in Section
4.1; each RF regressor takes as input the best-performing set of features (see Table
4) and the predictions of the preceding models in the chain, as depicted in Figure
3.
An important parameter in the chain model is the order of the single-label models. For instance, models that appear earlier in the chain can bring significant benefits to predictions of later models (if the respective traits are correlated). Likewise, a poor prediction accuracy in the beginning of the chain may propagate and affect the accuracy of the following models [
61]. Hence, in our approach we used an
ensemble of regression chains, where 10 differently and randomly ordered chains are built, and the final predictions are derived from the mean result of all chains.
\(\rightarrow\) “Holistic personality prediction outperforms individual trait predicting models”: Applying the proposed holistic model in our dataset provides an RMSE of 0.192 (see Table
5). The prediction accuracy is improved compared to the independent model prediction, which demonstrates that the chain model can efficiently exploit the correlations between personality traits to achieve better personality prediction. In fact, Table
5 demonstrates that the superiority of regression chain models holds for all feature set configurations we tested (compare RMSE values in top vs. bottom rows).
4 This is a promising finding that indicates that a holistic psychological prediction approach can be generic and bring improvements under different settings where different sets of features are available and/or selected as more appropriate.
Comparison with state of the art. We proceed to compare our approach with previous approaches. Since previous works neither take into account both BF and AO traits nor make their datasets and model implementations publicly available, a direct comparison is not possible. However, we compare our results with the prediction accuracy (1) stated in previous works and (2) achieved in our dataset when considering the configurations (e.g., set of features) of previous approaches.
With respect to the Big Five traits, [
12] and [
22] achieve an average RMSE of 0.18 and Mean Absolute Error
5 of 0.14, respectively, which is comparable to our results. However, note that the models of [
12] and [
22] are trained and evaluated in different datasets, and a direct comparison of the results may not be conclusive. Given that the prediction accuracy typically increases with the volume of the training data, we expect that the results of our model would improve in larger datasets. We validate this with the results in Table
6, where we considered subsets of the dataset (i.e., fewer samples) on which we trained our model and present obtained prediction accuracy. This shows that the RMSE decreases when considering more training data.
Moreover, when considering only the Big Five traits (see Table
4), our single-trait model achieves a better range of RMSE (0.156–0.228) in comparison to the corresponding range of [
12] (0.15–0.24). Using the holistic model further improves our single-trait model results. Applying a methodology similar to [
12] to our dataset, i.e., using only N-grams as features for all traits and single-trait RF regression models, yields an RMSE of 0.284 (average over the BF traits). This clearly shows that the approach of [
12] cannot generalize to our dataset, since it achieves a comparable performance with the baseline model, which is much worse than our approach that achieves an RMSE = 0.203 (by omitting the AO traits), i.e., a 29% improvement compared to [
12].
Finally, with respect to the AO traits, to the best of our knowledge, this is the first study that attempts to predict self-reported relational traits and there are no previous studies to which we can compare our results. The approach by [
48] relies on the accuracy of perceivers to identify AO in subjects’ tweets; however, the accuracy of the linear mixed effects models used cannot be directly compared with our approach and results.
4.3 Ground Truth Sample Selection Validation
In order to further validate that the selected ground-truth dataset of MTurkers is a good representative of real Twitter users and the HM built upon this sample is able to make predictions for different groups of users, we proceed by randomly collecting a large sample of random Twitter users (1,000 English-speaking Twitter users) and their profiles in order to apply our HM prediction model and get their predicted personality scores. We compare the distributions of personality traits of the two groups, as shown in Figure
4, and we observe that in most traits the two distributions are quite similar, with differences in the cases of most difficult traits, agreeableness and neuroticism. Moreover, these differences may also be justified by the prediction errors. These findings motivate us to apply our model on different groups of Twitter users studying their personality differences based on their roles; we present a use case in the following section.
5 A Use Case: Can You Tell If I Am an Opinion Leader?
Dataset and personality profiling annotation. In order to test our approach in a real-world application, we use a dataset with leaders obtained by Crunchbase (crunchbase.com). This dataset contains business and social media information of organizational employees, including CEOs and staff in executive positions, across organizations throughout the United States. Datasets provided by Crunchbase have been used successfully in recent research [
27] in examining the relationship and interactions between leaders and their followers on Twitter. Importantly, as outlined in [
27], individuals within the respective dataset who are highly active on Twitter and are followed by others on Twitter are defined as “leaders.” Therefore, on the basis of this definition, even a regular employee (i.e., someone without managerial responsibilities and team oversight in their usual workplace) can be defined as an opinion leader on social media. The case can be made that opinion leaders on social media are indeed leaders due to their social influence on followers’ communication content and patterns and yield large referent and expert power. One example would be a Twitter leader who is known for their expertise in communicating astrophysics and/or whose followers can identify with them [
20]. Hence, although these “leaders” can and do use many of the same bases of power as more traditional organizational leaders, they can be regular employees and users. We ranked individuals within this dataset by their Twitter activity and selected a set of 1,063 highly active Twitter accounts (for brevity, we refer to them as “leaders”). For these respective accounts, we extracted tweets posted within the past 6 months from the onset of data collection, resulting in 238,665 tweets. We applied our methodology to extract language, emotional, and behavioral features (see Section
3.3) and fed this information into our pre-trained holistic model, which in turn predicts leaders’ personality traits.
Characterization of the leaders’ psychological profiles. After predicting leaders’ personality scores, we perform a basic statistical analysis to (1) identify the main characteristics of the leader and (2) investigate whether our identified leaders are similar or differ in their characteristics from random Twitter users in our collected dataset.
Table
7 presents basic descriptive statistics of obtained personality scores in the leaders dataset. A more detailed comparison of the distributions is presented in Figure
5 (the x-axis represents each trait value and the y-axis the
cumulative distribution function (CDF) of the values). A first observation is that leader personality profiles differ significantly from those of the random users in general; see the very different distributions in Figure
6.
Focusing on individual traits, and concerning the BF traits, we observe that leaders score higher than random users on Openness, the tendency to be open-minded and imaginative. Openness to experience also includes the facet “intellect,” which marks another expected association with individuals who have gained positions of social influence such as opinion leaders or positions of managerial power such as Chief Executive Officers or Heads of Design and similar positions. These results are expected. However, the individuals in our sample scored quite low on Extraversion and Conscientiousness, which is in contrast with the opinion that leaders are dominant and sociable in their environment. This may indicate that behaviors associated with these traits are not easily reflected, and therefore detected, online. As for the AO traits, we observe that leaders’ scores are high on both anxiety and avoidance. A possible interpretation for this is that the leaders in our sample do not wish to maintain successful relationships with other users online and merely use the Twitter platform to advertise their own efforts and accomplishments, as well as to respond to other users who seem to post interesting cognitive (instead of affective) content.
These initial findings highlight an interesting research direction in which our approach could be useful, namely in-depth investigations of possible differences between the real-world and OSN behavior of different groups of people. This may reveal groups or clusters of individuals who tend to have a persona in OSNs very different than their real-world personality, and possibly causing factors for this phenomenon.
Clustering users based on their psychological profiles. To obtain further insights on how the two groups differ (random users vs. leaders), we cluster these groups based on the personality profiles (inferred by our model) of the users. Since each personality profile consists of a high-dimensional vector, we employ a
t-distributed stochastic neighbor embedding (t-SNE) graph that reduces dimensionality by mapping profiles on a two-dimensional space. We present the t-SNE visualization in Figure
7, where users are represented by dots (red for random users, green for leaders) and user coordinates correspond to their psychological profile (as mapped in the two-dimensional space). Two dots that are close to each other denote that the users have similar profiles, since t-SNE calculates a similarity measure based on the distance between points. It can be clearly seen in Figure
7 that our holistic model is able to separate different types of users in distinct clusters; i.e., all users in each cluster are of the same role (color), and the users would be indistinguishable only if the colors in the same cluster were mixed.
6 This is an interesting finding, which indicates that our holistic psychological profile model (and the seven dimensional profiles it predicts) can be used toward efficiently clustering/classifying different groups of Twitter users.
Moreover, to understand the role of AO, BF, and their combination on the clustering task, we evaluated the clustering results for three different cases: HM (AO+BF chain model), IM (AO+BF independent model), and BF independent model. While Figure
7 shows that there exist clearly four clusters, for a more complete analysis, in order to select the optimal number of clusters for each task we utilized the elbow method [
34], and to evaluate the clustering results we use the silhouette coefficient, which calculates how similar the objects on a group are compared to other groups. Elbow method analysis in Figure
6 shows that the optimal number of clusters is among three and four and as a result we use these numbers as a proxy for the optimal number of clusters. We also account for k = 2 as the baseline. Table
8 presents the different silhouette scores for each of the three cases. It can be seen that when taking into account the AO information (either in the HM or the IM models), the clustering improves (i.e., higher silhouette value) compared to the model that uses only the BF characteristics, and this finding holds for all number of clusters we tested, indicating the robustness of using both AO and BF traits.
While these preliminary results demonstrate the usefulness of our work in this direction (a more detailed analysis would be beyond the scope of this article), we believe that this can motivate further research in the topic, e.g., by considering different groups of users (such as artists, politicians, influencers, etc.) or classification techniques.
6 Conclusions and Future Work
Anxious and avoidant attachment orientations are important constructs in how people express themselves in interpersonal relationships [
43]. In combination with BF personality traits, these relational traits have been associated with several offline and online environments that require individuals’ presence and/or interaction, for example, consumer behavior [
45], social media use [
57], social network friendships [
60], as well as the formation of online communities. These associations highlight the important role of personality traits, both in self-expressions and as relational traits, in interactions with others. However, to the best of our knowledge, no previous studies have provided prediction algorithms to detect relational traits based on self-report measures. The present article addresses this gap by providing a machine learning detection and prediction approach with regard to both relational and Big Five personality traits in one model, providing a holistic image of the social self.
We find that the strong inter-correlations between personality and relational traits indicate that employing a multi-output model for personality prediction exploiting the best predictors among language, emotion, and behavioral residue combined better represents the psychological mechanisms behind personality expression and relationship interactions online, achieving at the same time higher accuracy performance than other state-of-the-art models.
The holistic psychological profile can find applicability in a number of studies and fields. We provided an initial demonstration through a use case by applying the model on Twitter users that are organizational leaders in the real world. The initial findings show groups are clustered efficiently based on similar personality characteristics. Similar approaches can be applied to different groups of persons and contribute in social science studies, investigation of the “pretend persona” phenomenon [
5] (i.e., deviations between the offline and online behavior of people, similarly to those we found in the leaders dataset), user classification in OSNs, design of recommendation algorithms, information diffusion, and so forth. We believe that these are interesting directions for future research. Furthermore, the proposed approach could be combined with other psychological models reflecting individual differences in order to uncover more psychological patterns reflected on our behavior in OSNs applied in different case-specific samples creating a personality map. As additional future research we propose that the investigation of personality and attachment profiling in the social networking context by combining information of users in a network could reveal the mechanism by which people tend to form relationships online based on how they do it in real life. Finally, the study of information diffusion in personality networks could reveal which personality type is able to propagate certain kinds of information and emotions more effectively with applications on the marketing and news sector.