Integrating learned and explicit document features for reputation monitoring in social media

272 Accesses
Explore all metrics

Abstract

Currently, monitoring reputation in social media is probably one of the most lucrative applications of information retrieval methods. However, this task poses new challenges due to the dynamicity of contents and the need for early detection of topics that affect the reputations of companies. Addressing this problem with learning mechanisms that are based on training data sets is challenging, given that unseen features play a crucial role. However, learning processes are necessary to capture domain features and dependency phenomena. In this work, based on observational information theory, we define a document representation framework that enables the combination of explicit text features and supervised and unsupervised signals into a single representation model. Our theoretical analysis demonstrates that the observation information quantity (OIQ) generalizes the most popular representation methods, in addition to capturing quantitative values, which is required for integrating signals from learning processes. In other words, the OIQ allows us to give the same treatment to features that are currently managed separately. Empirically, our experiments on the reputation-monitoring scenario demonstrated that adding features progressively from supervised (in particular, Bayesian inference over annotated data) and unsupervised learning methods (in particular, proximity to clusters) increases the similarity estimation performance. This result is verified under various similarity criteria (pointwise mutual information, Jaccard and Lin’s distances and the information contrast model). According to our formal analysis, the OIQ is the first representation model that captures the informativeness (specificity) of quantitative features in the document representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RepLab: An Evaluation Campaign for Online Monitoring Systems

Multi-dimensional Reputation Modeling Using Micro-blog Contents

TexRep: A Text Mining Framework for Online Reputation Monitoring

Article 29 June 2017

Notes

For instance, just considering the occurrence of a few words as text features is enough to obtain an empty result in a standard web search engine.
Note that considering words as information pieces is not equivalent than considering words as features, due to the effect of repeated words:
$$\begin{aligned} {\mathcal {I}}\big (\{w_1,w_2,w_2\}\big )= & {} -\hbox {log}\big (P_{d\in \mathcal {D}}(tf(d,w_1)\ge 1)\cdot P_{d\in \mathcal {D}}(tf(d,w_2)\ge 2)\big )\ne \\\ne & {} -\hbox {log}\big (P(w_1)P(w_2)P(w_2)\big ). \end{aligned}$$
That is, pairs with equal similarity in the measure count a half in the probability estimation.

References

Adhikari A, Singh S, Mondal D, Dutta B, Dutta A (2016) ‘A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet’, CoRR. In: Jajodia S (ed), arXiv preprint arXiv:1607.05422
Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, Rijke Md, Spina D (2014) Overview of RepLab 2014: author profiling and reputation dimensions for online reputation management. In: Proceedings of information access evaluation. Multilinguality, multimodality, and interaction—5th international conference of the CLEF initiative, CLEF 2014, Sheffield, UK, 15–18 September 2014
Amigó E, de Albornoz JC, Chugur I, Corujo A, Gonzalo J, Martín-Wanton T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of information access evaluation. 4th International conference of the CLEF initiative (CLEF 2013) multilinguality, multimodality, and visualization. Springer, Berlin, Heidelberg
Google Scholar
Amigó E, Giner F, Gonzalo J, Verdejo F (2017a) An axiomatic account of similarity. In: Proceedings of the SIGIR’17 workshop on axiomatic thinking for information retrieval and related tasks (ATIR), SIGIR ’20. ACM, New York, NY, USA
Amigó E, Giner F, Gonzalo J, Verdejo F (2017b) A formal and empirical study of unsupervised signal combination for textual similarity tasks. Springer, Cham, pp 369–382
Google Scholar
Amigó E, Giner F, Mizzaro S, Spina D (2018) A formal account on effectiveness evaluation and ranking fusion. In: Proceedings of the ACM SIGIR international conference on theory of information retrieval, ICTIR 2018, Tianjin, China, 14–17 September 2018
Arora S, Li Y, Liang Y, Ma T, Risteski A (2016) A latent variable model approach to PMI-based word embeddings. J Trans Assoc Comput Linguist (TACL) 4:385–399
Article Google Scholar
Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Brigadir I, Greene D, Cunningham P (2014) Adaptive representations for tracking breaking news on Twitter, arXiv preprint arXiv:1403.2923
Bullinaria John A, Levy JP (2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behav Res Methods 39(3):510–526
Article Google Scholar
Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1:163–190
Article Google Scholar
Cross V (1994) Fuzzy information retrieval. J Intell Inf Syst 3(1):29–56
Article Google Scholar
Dagan I, Pereira F, Lee L (1994) Similarity-based estimation of word cooccurrence probabilities. In: In proceedings of the 32nd annual meeting of the association for computational linguistics, association for computational linguistics, pp 272–278
De Luca A, Termini S (1972) A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory. Inf Control 20(4):301–312
Article MathSciNet Google Scholar
Delgado M, Martín-Bautista M, Sánchez D, Vila M (2001) Aggregating opinions in an information retrieval problem. In: Proceedings of EUROFUSE workshop on preference modelling and applications, Granada, Spain, pp 169–173
Djuric N, Wu H, Radosavljevic V, Grbovic M, Bhamidipati N (2015) Hierarchical neural language models for joint representation of streaming documents and their content. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences steering committee, pp 248–255
Greiff WR, Ponte JM (2000) The maximum entropy approach and probabilistic IR models. ACM Trans Inf Syst (TOIS) 18(3):246–287
Article Google Scholar
Harter SP (1975) A probabilistic approach to automatic keyword indexing. Part II: an algorithm for probabilistc indexing. J Am Soc Inf Sci 26(4):280–289
Article Google Scholar
Herrera F, Herrera-Viedma E, Martínez L (2002) An information retrieval system with unbalanced linguistic information based on the linguistic 2-tuple model. In: 8th International conference on information processing and management of uncertainty in knowledge-bases systems (IPMU2002). Wiley Online Library, Annecy (France), vol 52, pp 23–29
Jiao Y, Cornec M, Jakubowicz J (2015) An entropy-based term weighting scheme and its application in e-commerce search engines. In: International symposium on web algorithms
Kaufmann A (1975) Introduction to the theory of fuzzy subsets, vol 2. Academic Press, Cambridge
MATH Google Scholar
Ke W (2013) Information-theoretic term weighting schemes for document clustering. In: Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, ACM, pp 143–152
Kohlas J (2017) Algebras of information. A new and extended axiomatic foundation, arXiv preprint arXiv:1701.02658
Kohlas J, Pouly M, Schneuwly C (2008) Information algebra. In: Wah B (ed) In formal theories of information. Wiley encyclopedia of computer science and engineering. Wiley, Berlin, pp 95–127
Google Scholar
Kosko B (1990) Fuzziness vs. probability. Int J Gen Syst 17(2–3):211–240
Article Google Scholar
Lazo AV, Rathie P (2006) On the entropy of continuous probability distributions (Corresp.). IEEE Trans Inf Theory 24(1):120–122
Article Google Scholar
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates, Inc., pp 2177–2185
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, ICML ’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 296–304
Ma J, Gao W, Mitra P, Kwon S, Jansen BJ, Wong K-F, Cha M (2016) Detecting rumors from microblogs with recurrent neural networks. In: International joint conferences on artificial intelligence (IJCAI). Elsevier, pp 3818–3824
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Conference proceedings of the advances in neural information processing systems. Journal CoRR, vol. abs/1310.4546
Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation (LREC). European Languages Resources Association (ELRA)
Papineni K (2001) Why inverse document frequency?, In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics on language technologies, NAACL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–8
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 1, IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 448–453
RI (2018) Why RI? Business through data-driven reputation management. https://www.reputationinstitute.com/why-ri. Accessed 21 July 2018
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520
Article Google Scholar
Robertson SE, van Rijsbergen CJ, Porter MF (1981) Probabilistic models of indexing and searching. In: Proceedings of the 3rd annual ACM conference on research and development in information retrieval, SIGIR ’80. Butterworth & Co., Kent, UK, UK, pp 35–56
Rudas IJ, Kaynak MO (1998) Entropy-based operations on fuzzy sets. J IEEE Trans Fuzzy Syst 6(1):33–40
Article Google Scholar
Shi Y, Wiggers P, Jonker CM (2012) Towards recurrent neural networks language models with linguistic and contextual features. In: 13th Annual conference of the international speech communication association, ISCA, pp 1664–1667
Shirakawa M, Hara T, Nishio S (2017) IDF for word n-grams. ACM Trans Inf Syst (TOIS) 36(1):5:1–5:38
Article Google Scholar
Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceedings of European conference on speech communication and technology
Toral A, Pecina P, Wang L, van Genabith J (2015) Linguistically-augmented perplexity-based data selection for language models. Hybrid machine translation: integration of linguistics and statistics. Comput Speech Lang 32(1):11–26
Article Google Scholar
Vakulenko S, Nixon L, Lupu M (2017) Character-based neural embeddings for tweet clustering. arXiv preprint arXiv:1703.05123
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends, In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06, ACM, New York, NY, USA, pp 424–433
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
Google Scholar
Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 233–242
Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23(2):421–427
Article MathSciNet Google Scholar
Zhai C (2008) Statistical language models for information retrieval: a critical review. Found Trends Inf Retr 2(3):137–213
Article Google Scholar
Zhao WX, Jiang J, Weng J, He J, Lim E-P, Yan H, Li X (2011) Comparing Twitter and traditional media using topic models. In: European conference on information retrieval. Springer, Heidelberg, pp 338–349
Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their very useful comments, which have added value to the manuscript. The work was supported by the Ministerio de Economía y Competitividad, TIN Program (Vemodalen), under Grant Number: TIN2015-71785-R.

Author information

Authors and Affiliations

National Distance University ETSI - UNED, C/ Juan del Rosal, 16, 28040, Madrid, Spain
Fernando Giner, Enrique Amigó & Felisa Verdejo

Authors

Fernando Giner
View author publications
You can also search for this author in PubMed Google Scholar
Enrique Amigó
View author publications
You can also search for this author in PubMed Google Scholar
Felisa Verdejo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enrique Amigó.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: formal proofs

Proposition 3.1

The proof is straightforward. According to the fuzzy set operators:

$$\begin{aligned} {\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2) \cup \ldots \cup {\mathcal {O}}_{\Gamma }(d_n)=(\Gamma ,f) \ , \end{aligned}$$

where

$$\begin{aligned} f(\gamma )=\max \limits _{ 1 \le i \le n} {\gamma }(d_i), \forall \gamma \in \Gamma \ . \end{aligned}$$

$\square $

Property 3.1

From ${\gamma }(d_1) \ge {\gamma }(d_2), \ \ \forall \gamma \in \Gamma $, it follows:

$$\begin{aligned} \Big \{d \in \mathcal {D} : {\gamma }(d) \ge {\gamma }(d_1), \ \forall \gamma \in \Gamma \Big \} \subseteq \Big \{d \in \mathcal {D} :{\gamma }(d) \ge {\gamma }(d_2), \ \forall \gamma \in \Gamma \Big \} \ . \end{aligned}$$

Then,

$$\begin{aligned} P_{d \in \mathcal {D}}\big ( {\gamma }(d) \ge {\gamma }(d_1), \ \forall \gamma \in \Gamma \big ) \le P_{d \in \mathcal {D}}\big ( {\gamma }(d) \ge {\gamma }(d_2), \ \forall \gamma \in \Gamma \big ) \ . \end{aligned}$$

This implies that

$$\begin{aligned} P_{d \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma }(d)\supseteq {\mathcal {O}}_{\Gamma }(d_1)\Big )\le P_{d \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma }(d)\supseteq {\mathcal {O}}_{\Gamma }(d_2)\Big ). \end{aligned}$$

And therefore, according to Definition 3.3:

$$\begin{aligned} {\mathcal {I}}_{\Gamma }\big (d_1\big ) \ge {\mathcal {I}}_{\Gamma }\big (d_2\big ). \end{aligned}$$

$\square $

Property 3.2

Notice that if we add a feature, the new observation is more restrictive than the initial observation, and thus, the set of messages which verify the new observation is contained in the set of messages which verify the initial observation, ${\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d) \subseteq {\mathcal {O}}_{\Gamma }(d)$. Then,

$$\begin{aligned} P_{d' \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d')\supseteq {\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d)\Big )\le P_{d' \in \mathcal {D}}\Big ({\mathcal {O}}_{\Gamma }(d')\supseteq {\mathcal {O}}_{\Gamma }(d)\Big ). \end{aligned}$$

And therefore, according to Definition 3.3:

$$\begin{aligned} {\mathcal {I}}_{\Gamma \cup \{\gamma '\}}\big (d\big ) \ge {\mathcal {I}}_{\Gamma }\big (d\big ). \end{aligned}$$

$\square $

Property 3.3

By Proposition 3.1:

$$\begin{aligned} {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2)\big ) = - \log \bigg (P_{d \in \mathcal {D}}\Big ( {\gamma }(d) \ \ge \ \max \big \{{\gamma }(d_1),{\gamma }(d_2)\big \}, \ \forall \gamma \in \Gamma \Big ) \bigg ) \ . \end{aligned}$$

Given that

$$\begin{aligned} P_{d \in \mathcal {D}}\Big ( {\gamma }(d) \ \ge \ \max \big \{{\gamma }(d_1),{\gamma }(d_2)\big \}, \ \forall \gamma \in \Gamma \Big ) \le P_{d \in \mathcal {D}}\Big ( {\gamma }(d) \ \ge \ {\gamma }(d_1), \ \forall \gamma \in \Gamma \Big ) \ , \end{aligned}$$

we finally get, $ {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2)\big )\ge {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\big )$. Similarly, we can get the same result for $d_2$. $\square $

Property 3.4

By hypothesis,

$$\begin{aligned} P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d)\big ) \le P_{d' \in \mathcal {D}}\big ({\gamma _2}(d') \ge {\gamma _2}(d) \big ) \ , \end{aligned}$$

which is equivalent to

$$\begin{aligned}&\frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _2}(d') \ge {\gamma _2}(d)\big )} \ge \frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d) \big )} \Rightarrow \\&\quad \log \left( \frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _2}(d') \ge {\gamma _2}(d)\big )}\right) \ge \log \left( \frac{1}{P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d) \big )}\right) \Rightarrow \\&\quad \Rightarrow {\mathcal {I}}_{\{\gamma _1\}}\big (d\big )\ge {\mathcal {I}}_{\{\gamma _2\}}\big (d\big ) \ . \end{aligned}$$

$\square $

Property 3.5

Consider two features, $\gamma _1$, $\gamma _2 \in \Gamma $, given a message, $d \in \mathcal {D}$, it produces an observation under $\gamma _1$, ${\mathcal {O}}_{\gamma _1}(d)$, whose Observation Information Quantity is:

$$\begin{aligned} {\mathcal {I}}_{\{\gamma _1\}}\big (d\big ) = P_{d' \in \mathcal {D}}\big ({\gamma _1}(d') \ge {\gamma _1}(d)\big ) = P_{d' \in \mathcal {D}}\Big ( g\big ( {\gamma _2}(d')\big ) \ge g\big ( {\gamma _2}(d)\big ) \Big ) \ . \end{aligned}$$

Given that g is a strict monotonic function,

$$\begin{aligned}&P_{d' \in \mathcal {D}}\Big ( g\big ( {\gamma _2}(d')\big ) \ge g\big ( {\gamma _2}(d)\big ) \Big ) = P_{d' \in \mathcal {D}}\Big ( g\big ( {\gamma _2}(d')\big ) \ge g\big ( {\gamma _2}(d)\big ) , {\gamma _2}(d')\ge {\gamma _2}(d)\Big ) = \\&\quad = {\mathcal {I}}_{\{\gamma _1,\gamma _2\}}\big (d\big ). \end{aligned}$$

$\square $

Property 3.6

Assume that we have a finite set of messages, the proof of this proposition is a direct consequence of the representativity of the messages by the features. If we have an infinite set of features, then they will describe every message, and messages will be unequivocally determined by the values of a set of features. $\square $

Property 3.7

Given a fixed message, $d \in \mathcal {D}$, consider all the messages, $d' \in \mathcal {D}$, which verify the inequalities:

$$\begin{aligned} {\gamma }(d') \le {\gamma }(d) \ \wedge \ {\gamma ^{-1}}(d') \le {\gamma ^{-1}}(d) \ . \end{aligned}$$

These inequalities are equivalent to (by definition of $\gamma ^{-1}$):

$$\begin{aligned} {\gamma }(d') \le {\gamma }(d) \ \wedge \ \frac{1}{{\gamma }(d')} \le \frac{1}{{\gamma }(d)} \ . \end{aligned}$$

Notice that ${\gamma }(d)$ and ${\gamma }(d')$ are non-negative numbers; therefore, these inequalities imply that: ${\gamma }(d) = {\gamma }(d')$. Then, the Observation Information Quantity is:

$$\begin{aligned} {\mathcal {I}}_{\left\{ \gamma , \gamma ^{-1}\right\} }\big (d\big )= -\log \bigg (P_{d' \in \mathcal {D}}\Big ( {\gamma }(d') \le {\gamma }(d) \wedge {\gamma ^{-1}}(d') \le {\gamma ^{-1}}(d) \Big ) \bigg ) \ \end{aligned}$$

which is equivalent to:

$$\begin{aligned} {\mathcal {I}}_{\left\{ \gamma , \gamma ^{-1}\right\} }\big (d\big )= -\log \Big (P_{d' \in \mathcal {D}}\big ( {\gamma }(d') = {\gamma }(d) \big ) \Big ) \ . \end{aligned}$$

$\square $

Proposition 5.1

Given the vocabulary, $\mathcal {V} = \{w_1, \ldots , w_n\}$, consider the set of features as $\Gamma = \{occ_{w_1},\ldots ,occ_{w_n}\}$, and given a message from the collection, $d \in \mathcal {D}$, we are interested in computing the described OIQ, ${\mathcal {I}}_{occ_{w_i}}\big (d\big )$.

Assuming information additivity and considering text words as basic linguistic units, we have

$$\begin{aligned} {\mathcal {I}}_{occ_{w_i}}\big (d\big ) = \sum _{w_j \in d} {\mathcal {I}}_{occ_{w_i}}\big (w_j\big ) = \sum _{ w_j \in d} - \log \Big ( P_{w' \in \mathcal {V}}\big ({occ_{w_i}}(w') \ge {occ_{w_i}}(w_j) \big ) \Big ) \ . \end{aligned}$$

Notice that, if $w_j \ne w_i$, then ${occ_{w_i}}(w_{j}) = 0$. Thus, $P\big ({occ_{w_i}}(w') \ge 0 \big ) = 1$, since by definition ${occ_{w_i}}(d) \ge 0$, $\forall d \in \mathcal {D}$. Therefore, in the last summation all the terms are null, except for $w_{j} = w_{i}$. In this case, we have that ${occ_{w_i}}(w_i) = 1$, and given that by definition of the function ${occ_{w_i}}(.)$, its maximum value is 1, we can say that ${occ_{w_i}}(w') \ge 1$ is equivalent to ${occ_{w_i}}(w') = 1$. Therefore, the probability $P\big ({occ_{w_i}}(w') = 1 \big )$ is exactly $P(w' = w_i) = P(w_i)$. And, ${\mathcal {I}}_{occ_{w_i}}\big (d\big ) \propto - \log \big ( P(w_i) \big )$.

One of the assumptions is that every word is equiprobable, i.e. $P(w_i) = k$, $1 \le i \le n$, for an arbitrary k. In order to achieve the result, we can choose k in such a way that $- \log (k) = 1$. And finally, the summation gives us the $tf(w_i, d)$. $\square $

Proposition 5.2

Given the vocabulary, $\mathcal {V} = \{w_1, \ldots , w_n\}$, considering the set of features as, $\Gamma = \{occ_{w_1},\ldots ,occ_{w_n}\}$, and given a message from the collection, $d \in \mathcal {D}$, we are interested in computing the described OIQ, ${\mathcal {I}}_{occ_{w_i}}\big (d\big )$.

Assuming information additivity and considering messages as basic linguistic units, we have

$$\begin{aligned} {\mathcal {I}}_{occ_{w_i}}\big (d\big ) = \sum _{w_j \in d} {\mathcal {I}}_{occ_{w_i}}\big (w_j\big ) = \sum _{ w_j \in d} - \log \Big ( P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') \ge {occ_{w_i}}(w_j) \big ) \Big ) \ . \end{aligned}$$

Notice that, if $w_{j} \ne w_{i}$, then ${occ_{w_i}}(w_j) = 0$. Thus, $P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') \ge 0 \big ) = 1$, since by definition ${occ_{w_i}}(d') \ge 0$, $\forall d' \in \mathcal {D}$. Therefore, in the last summation all the terms are null, except for $w_{j} = w_{i}$. In this case, we have as many terms as the number of times that the word $w_i$ appears in the message d, i.e. $tf(w_i, d)$. Moreover, we have that ${occ_{w_i}}(w_j) = 1$, and given that by definition of ${occ_{w_i}}(.)$, its maximum value is 1, we can say that ${occ_{w_i}}(d') \ge 1$ is equivalent to ${occ_{w_i}}(d') = 1$. Therefore, the expression $-\log \Big (P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') = 1 \big ) \Big )$ is exactly $-\log \Big ( P_{d' \in \mathcal {D}}(w_i \in d')\Big ) = idf(w_i)$. And thus, ${\mathcal {I}}_{occ_{w_i}}\big (d\big ) = tf(w_i, d) \cdot idf(w_i)$. $\square $

Proposition 5.3

Given the vocabulary, $\mathcal {V} = \{w_1, \ldots , w_n\}$, and considering as features:

$$\begin{aligned} {\gamma _{i,j}}(d)= {\left\{ \begin{array}{ll} 1, &{} \quad \text {if the }j^{th}\text { element in }d\text { is the word}~~ w_i \\ 0, &{} \quad \text {otherwise}\\ \end{array}\right. }, \ \ 1 \le i, j \le m. \end{aligned}$$

Assuming feature independence, we have

$$\begin{aligned} {\mathcal {I}}_{\Gamma }\big ((w_1,\ldots ,w_m)\big ) =&\sum _{\gamma _{i,j} \in \Gamma } {\mathcal {I}}_{\{\gamma _{i,j}\}}\big ((w_1, \ldots , w_m)\big ) = \\ =&\sum _{\gamma _{i,j} \in \Gamma } -\log \Big ( P_{d\in {\mathcal {D}}}\Big ( {\gamma _{i,j}}(d)\ge {\gamma _{i,j}}(w_1, \ldots , w_m)\Big )\Big ) \ . \end{aligned}$$

Being $\mathcal {S}$ all the possible sequences of words which form a message, in the previous formula we have that:

$$\begin{aligned}&P_{d\in {\mathcal {D}}}\Big ( {\gamma _{i,j}}(d)\ge {\gamma _{i,j}}(w_1, \ldots , w_m)\Big ) = \\&\quad = P_{(w_1', \ldots ,w_k') \in \mathcal {S}}\Big ( {\gamma _{i,j}}(w_1', \ldots , w_k') \ge {\gamma _{i,j}}(w_1, \ldots , w_m) \Big ) \ . \end{aligned}$$

Notice that ${\gamma _{i,j}}(w_1, \ldots , w_m)$ is equal to zero for all the sequences of the form $(w_1, \ldots , w_m)$ except for the sequences which verify that $w_j = w_i$. Since by definition, ${\gamma _{i,j}}(.) \ge 0$, in the summation all the terms are null, except for the sequences which verify $w_j = w_i$. In these cases, we have that ${\gamma _{i,j}}(w_1, \ldots , w_m) = 1$, and given that by definition of ${\gamma _{i,j}}(.)$, its maximum value is 1, we can say that ${\gamma _{i,j}}(w_1', \ldots , w_k') \ge 1$ is equivalent to ${\gamma _{i,j}}(w_1', \ldots , w_k') = 1$. Therefore, we have the next equality on probabilities:

$$\begin{aligned} P_{(w_1', \ldots ,w_k') \in \mathcal {S}}\Big ( {\gamma _{i,j}}(w_1', \ldots , w_k') = 1 \Big ) = P_{(w_1', \ldots ,w_k') \in \mathcal {S}}\Big ( w_i' = w_i, \Big ) \ . \end{aligned}$$

And finally, with trivial algebraic operations, we have:

$$\begin{aligned} Perplexity(w_1,\ldots ,w_m)=2^{\frac{1}{m}{\mathcal {I}}_{\Gamma }\big ((w_1, \ldots , w_m)\big )} \ . \end{aligned}$$

$\square $

Proposition 5.4

Considering the definition of Lin’s distance and assuming information additivity,

$$\begin{aligned} Lin(d_1, d_2) = \frac{ \displaystyle \sum _{w\in {d_1 \cap d_2}} {\mathcal {I}}_{\Gamma }\big (w\big )}{\displaystyle \sum _{w\in d_1} {\mathcal {I}}_{\Gamma }\big (w\big )+\displaystyle \sum _{w \in d_2} {\mathcal {I}}_{\Gamma }\big (w\big )} \ . \end{aligned}$$

Assuming feature independence, it is equivalent to:

$$\begin{aligned} \frac{{\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1) \cap {\mathcal {O}}_{\Gamma }(d_2)\big )}{{\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\big ) + {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_2)\big )} \ . \end{aligned}$$

$\square $

Proposition 5.5

We will start from the $IDF_{N\hbox {-}gram}$ term weighting of an n-gram, g, using the same notation of $|\phi (g)|$ for the message frequency of g, and $|\mu (g)|$ for the amount of messages containing at least one subsequence of the n-gram g.

$$\begin{aligned} IDF_{N\hbox {-}gram}(g)= & {} \log \frac{|\mathcal {D}|}{|\mu (g)|} - \log \frac{|\mu (g)|}{|\phi (g)|} = \log \frac{|\mathcal {D}|}{|\mu (g)|} - \log \frac{|\mu (g)| \cdot |\mathcal {D}|}{|\phi (g)| \cdot |\mathcal {D}|} = \\= & {} \log \frac{|\mathcal {D}|}{|\mu (g)|} - \bigg ( \log \frac{|\mathcal {D}|}{|\phi (g)|} - \log \frac{|\mathcal {D}|}{|\mu (g)|} \bigg ) = 2 \cdot \log \frac{|\mathcal {D}|}{|\mu (g)|} - \log \frac{|\mathcal {D}|}{|\phi (g)|} \ . \end{aligned}$$

Considering both sets of features, $\Gamma $ and $\Gamma '$, the OIQ of an n-gram in each set of features is computed by:

$$\begin{aligned} {\mathcal {I}}_{\Gamma }\big (g\big )=\log \frac{|\mathcal {D}|}{|\phi (g)|} \ \ \ \wedge \ \ \ {\mathcal {I}}_{\Gamma '}\big (g\big )= \log \frac{|\mathcal {D}|}{|\mu (g)|} \ . \end{aligned}$$

Replacing these expressions in the definition of the $IDF_{N\hbox {-}gram}$:

$$\begin{aligned} IDF_{N\hbox {-}gram}(g) = 2 \cdot {\mathcal {I}}_{\Gamma '}\big (g\big ) - {\mathcal {I}}_{\Gamma }\big (g\big ) \ . \end{aligned}$$

$\square $

Proposition 5.6

We will start from the $\mu _{d}(\gamma _i)$ term weighting of a message.

$$\begin{aligned} \mu _{d}(\gamma _i)= & {} \log \frac{P(\gamma _i \in d \ | \ \mathcal {D}_1)}{P(\gamma _i \in d \ | \ \mathcal {D}_2)} = \log \frac{P(\gamma _i \in d \ | \ \mathcal {D}_1) \cdot |\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_2) \cdot |\mathcal {D}|} = \\= & {} \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_2)} - \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_1)} \ . \end{aligned}$$

Considering two different scenarios relating to the relevancy of messages, $\mathcal {D}_1$ and, $\mathcal {D}_2$, the OIQ of a message is computed by:

$$\begin{aligned} {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_1}\big (d\big )= \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_1)} \ \ \ \wedge \ \ \ {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_2}\big (d\big )= \log \frac{|\mathcal {D}|}{P(\gamma _i \in d \ | \ \mathcal {D}_2)} \ . \end{aligned}$$

Replacing these expressions, we get:

$$\begin{aligned} \mu _{d}(\gamma _i) = {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_2}\big (d\big ) - {\mathcal {I}}_{\{\gamma _i\}}^{{\mathcal {D}}_1}\big (d\big ) \ . \end{aligned}$$

$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Giner, F., Amigó, E. & Verdejo, F. Integrating learned and explicit document features for reputation monitoring in social media. Knowl Inf Syst 62, 951–985 (2020). https://doi.org/10.1007/s10115-019-01383-w

Download citation

Received: 16 August 2018
Revised: 05 July 2019
Accepted: 05 July 2019
Published: 19 July 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10115-019-01383-w

Integrating learned and explicit document features for reputation monitoring in social media

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RepLab: An Evaluation Campaign for Online Monitoring Systems

Multi-dimensional Reputation Modeling Using Micro-blog Contents

TexRep: A Text Mining Framework for Online Reputation Monitoring

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: formal proofs

Appendix: formal proofs

Proposition 3.1

Property 3.1

Property 3.2

Property 3.3

Property 3.4

Property 3.5

Property 3.6

Property 3.7

Proposition 5.1

Proposition 5.2

Proposition 5.3

Proposition 5.4

Proposition 5.5

Proposition 5.6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation