Abstract
Currently, monitoring reputation in social media is probably one of the most lucrative applications of information retrieval methods. However, this task poses new challenges due to the dynamicity of contents and the need for early detection of topics that affect the reputations of companies. Addressing this problem with learning mechanisms that are based on training data sets is challenging, given that unseen features play a crucial role. However, learning processes are necessary to capture domain features and dependency phenomena. In this work, based on observational information theory, we define a document representation framework that enables the combination of explicit text features and supervised and unsupervised signals into a single representation model. Our theoretical analysis demonstrates that the observation information quantity (OIQ) generalizes the most popular representation methods, in addition to capturing quantitative values, which is required for integrating signals from learning processes. In other words, the OIQ allows us to give the same treatment to features that are currently managed separately. Empirically, our experiments on the reputation-monitoring scenario demonstrated that adding features progressively from supervised (in particular, Bayesian inference over annotated data) and unsupervised learning methods (in particular, proximity to clusters) increases the similarity estimation performance. This result is verified under various similarity criteria (pointwise mutual information, Jaccard and Lin’s distances and the information contrast model). According to our formal analysis, the OIQ is the first representation model that captures the informativeness (specificity) of quantitative features in the document representation.
Similar content being viewed by others
Notes
For instance, just considering the occurrence of a few words as text features is enough to obtain an empty result in a standard web search engine.
Note that considering words as information pieces is not equivalent than considering words as features, due to the effect of repeated words:
$$\begin{aligned} {\mathcal {I}}\big (\{w_1,w_2,w_2\}\big )= & {} -\hbox {log}\big (P_{d\in \mathcal {D}}(tf(d,w_1)\ge 1)\cdot P_{d\in \mathcal {D}}(tf(d,w_2)\ge 2)\big )\ne \\\ne & {} -\hbox {log}\big (P(w_1)P(w_2)P(w_2)\big ). \end{aligned}$$That is, pairs with equal similarity in the measure count a half in the probability estimation.
References
Adhikari A, Singh S, Mondal D, Dutta B, Dutta A (2016) ‘A Novel Information Theoretic Framework for Finding Semantic Similarity in WordNet’, CoRR. In: Jajodia S (ed), arXiv preprint arXiv:1607.05422
Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, Rijke Md, Spina D (2014) Overview of RepLab 2014: author profiling and reputation dimensions for online reputation management. In: Proceedings of information access evaluation. Multilinguality, multimodality, and interaction—5th international conference of the CLEF initiative, CLEF 2014, Sheffield, UK, 15–18 September 2014
Amigó E, de Albornoz JC, Chugur I, Corujo A, Gonzalo J, Martín-Wanton T, Meij E, de Rijke M, Spina D (2013) Overview of RepLab 2013: evaluating online reputation monitoring systems. In: Proceedings of information access evaluation. 4th International conference of the CLEF initiative (CLEF 2013) multilinguality, multimodality, and visualization. Springer, Berlin, Heidelberg
Amigó E, Giner F, Gonzalo J, Verdejo F (2017a) An axiomatic account of similarity. In: Proceedings of the SIGIR’17 workshop on axiomatic thinking for information retrieval and related tasks (ATIR), SIGIR ’20. ACM, New York, NY, USA
Amigó E, Giner F, Gonzalo J, Verdejo F (2017b) A formal and empirical study of unsupervised signal combination for textual similarity tasks. Springer, Cham, pp 369–382
Amigó E, Giner F, Mizzaro S, Spina D (2018) A formal account on effectiveness evaluation and ranking fusion. In: Proceedings of the ACM SIGIR international conference on theory of information retrieval, ICTIR 2018, Tianjin, China, 14–17 September 2018
Arora S, Li Y, Liang Y, Ma T, Risteski A (2016) A latent variable model approach to PMI-based word embeddings. J Trans Assoc Comput Linguist (TACL) 4:385–399
Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Brigadir I, Greene D, Cunningham P (2014) Adaptive representations for tracking breaking news on Twitter, arXiv preprint arXiv:1403.2923
Bullinaria John A, Levy JP (2007) Extracting semantic representations from word co-occurrence statistics: a computational study. Behav Res Methods 39(3):510–526
Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1:163–190
Cross V (1994) Fuzzy information retrieval. J Intell Inf Syst 3(1):29–56
Dagan I, Pereira F, Lee L (1994) Similarity-based estimation of word cooccurrence probabilities. In: In proceedings of the 32nd annual meeting of the association for computational linguistics, association for computational linguistics, pp 272–278
De Luca A, Termini S (1972) A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory. Inf Control 20(4):301–312
Delgado M, Martín-Bautista M, Sánchez D, Vila M (2001) Aggregating opinions in an information retrieval problem. In: Proceedings of EUROFUSE workshop on preference modelling and applications, Granada, Spain, pp 169–173
Djuric N, Wu H, Radosavljevic V, Grbovic M, Bhamidipati N (2015) Hierarchical neural language models for joint representation of streaming documents and their content. In: Proceedings of the 24th international conference on world wide web, international world wide web conferences steering committee, pp 248–255
Greiff WR, Ponte JM (2000) The maximum entropy approach and probabilistic IR models. ACM Trans Inf Syst (TOIS) 18(3):246–287
Harter SP (1975) A probabilistic approach to automatic keyword indexing. Part II: an algorithm for probabilistc indexing. J Am Soc Inf Sci 26(4):280–289
Herrera F, Herrera-Viedma E, Martínez L (2002) An information retrieval system with unbalanced linguistic information based on the linguistic 2-tuple model. In: 8th International conference on information processing and management of uncertainty in knowledge-bases systems (IPMU2002). Wiley Online Library, Annecy (France), vol 52, pp 23–29
Jiao Y, Cornec M, Jakubowicz J (2015) An entropy-based term weighting scheme and its application in e-commerce search engines. In: International symposium on web algorithms
Kaufmann A (1975) Introduction to the theory of fuzzy subsets, vol 2. Academic Press, Cambridge
Ke W (2013) Information-theoretic term weighting schemes for document clustering. In: Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries, ACM, pp 143–152
Kohlas J (2017) Algebras of information. A new and extended axiomatic foundation, arXiv preprint arXiv:1701.02658
Kohlas J, Pouly M, Schneuwly C (2008) Information algebra. In: Wah B (ed) In formal theories of information. Wiley encyclopedia of computer science and engineering. Wiley, Berlin, pp 95–127
Kosko B (1990) Fuzziness vs. probability. Int J Gen Syst 17(2–3):211–240
Lazo AV, Rathie P (2006) On the entropy of continuous probability distributions (Corresp.). IEEE Trans Inf Theory 24(1):120–122
Levy O, Goldberg Y (2014) Neural word embedding as implicit matrix factorization. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27. Curran Associates, Inc., pp 2177–2185
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the 15th international conference on machine learning, ICML ’98. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 296–304
Ma J, Gao W, Mitra P, Kwon S, Jansen BJ, Wong K-F, Cha M (2016) Detecting rumors from microblogs with recurrent neural networks. In: International joint conferences on artificial intelligence (IJCAI). Elsevier, pp 3818–3824
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Conference proceedings of the advances in neural information processing systems. Journal CoRR, vol. abs/1310.4546
Nießen S, Och FJ, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation (LREC). European Languages Resources Association (ELRA)
Papineni K (2001) Why inverse document frequency?, In: Proceedings of the second meeting of the North American chapter of the association for computational linguistics on language technologies, NAACL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 1–8
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1532–1543
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 1, IJCAI’95. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 448–453
RI (2018) Why RI? Business through data-driven reputation management. https://www.reputationinstitute.com/why-ri. Accessed 21 July 2018
Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520
Robertson SE, van Rijsbergen CJ, Porter MF (1981) Probabilistic models of indexing and searching. In: Proceedings of the 3rd annual ACM conference on research and development in information retrieval, SIGIR ’80. Butterworth & Co., Kent, UK, UK, pp 35–56
Rudas IJ, Kaynak MO (1998) Entropy-based operations on fuzzy sets. J IEEE Trans Fuzzy Syst 6(1):33–40
Shi Y, Wiggers P, Jonker CM (2012) Towards recurrent neural networks language models with linguistic and contextual features. In: 13th Annual conference of the international speech communication association, ISCA, pp 1664–1667
Shirakawa M, Hara T, Nishio S (2017) IDF for word n-grams. ACM Trans Inf Syst (TOIS) 36(1):5:1–5:38
Tillmann C, Vogel S, Ney H, Zubiaga A, Sawaf H (1997) Accelerated DP based search for statistical translation. In: Proceedings of European conference on speech communication and technology
Toral A, Pecina P, Wang L, van Genabith J (2015) Linguistically-augmented perplexity-based data selection for language models. Hybrid machine translation: integration of linguistics and statistics. Comput Speech Lang 32(1):11–26
Vakulenko S, Nixon L, Lupu M (2017) Character-based neural embeddings for tweet clustering. arXiv preprint arXiv:1703.05123
Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends, In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06, ACM, New York, NY, USA, pp 424–433
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, Burlington
Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 233–242
Zadeh LA (1968) Probability measures of fuzzy events. J Math Anal Appl 23(2):421–427
Zhai C (2008) Statistical language models for information retrieval: a critical review. Found Trends Inf Retr 2(3):137–213
Zhao WX, Jiang J, Weng J, He J, Lim E-P, Yan H, Li X (2011) Comparing Twitter and traditional media using topic models. In: European conference on information retrieval. Springer, Heidelberg, pp 338–349
Acknowledgements
We thank the anonymous reviewers for their very useful comments, which have added value to the manuscript. The work was supported by the Ministerio de Economía y Competitividad, TIN Program (Vemodalen), under Grant Number: TIN2015-71785-R.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: formal proofs
Appendix: formal proofs
Proposition 3.1
The proof is straightforward. According to the fuzzy set operators:
where
\(\square \)
Property 3.1
From \({\gamma }(d_1) \ge {\gamma }(d_2), \ \ \forall \gamma \in \Gamma \), it follows:
Then,
This implies that
And therefore, according to Definition 3.3:
\(\square \)
Property 3.2
Notice that if we add a feature, the new observation is more restrictive than the initial observation, and thus, the set of messages which verify the new observation is contained in the set of messages which verify the initial observation, \({\mathcal {O}}_{\Gamma \cup \{\gamma '\}}(d) \subseteq {\mathcal {O}}_{\Gamma }(d)\). Then,
And therefore, according to Definition 3.3:
\(\square \)
Property 3.3
By Proposition 3.1:
Given that
we finally get, \( {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\cup {\mathcal {O}}_{\Gamma }(d_2)\big )\ge {\mathcal {I}}\big ({\mathcal {O}}_{\Gamma }(d_1)\big )\). Similarly, we can get the same result for \(d_2\). \(\square \)
Property 3.4
By hypothesis,
which is equivalent to
\(\square \)
Property 3.5
Consider two features, \(\gamma _1\), \(\gamma _2 \in \Gamma \), given a message, \(d \in \mathcal {D}\), it produces an observation under \(\gamma _1\), \({\mathcal {O}}_{\gamma _1}(d)\), whose Observation Information Quantity is:
Given that g is a strict monotonic function,
\(\square \)
Property 3.6
Assume that we have a finite set of messages, the proof of this proposition is a direct consequence of the representativity of the messages by the features. If we have an infinite set of features, then they will describe every message, and messages will be unequivocally determined by the values of a set of features. \(\square \)
Property 3.7
Given a fixed message, \(d \in \mathcal {D}\), consider all the messages, \(d' \in \mathcal {D}\), which verify the inequalities:
These inequalities are equivalent to (by definition of \(\gamma ^{-1}\)):
Notice that \({\gamma }(d)\) and \({\gamma }(d')\) are non-negative numbers; therefore, these inequalities imply that: \({\gamma }(d) = {\gamma }(d')\). Then, the Observation Information Quantity is:
which is equivalent to:
\(\square \)
Proposition 5.1
Given the vocabulary, \(\mathcal {V} = \{w_1, \ldots , w_n\}\), consider the set of features as \(\Gamma = \{occ_{w_1},\ldots ,occ_{w_n}\}\), and given a message from the collection, \(d \in \mathcal {D}\), we are interested in computing the described OIQ, \({\mathcal {I}}_{occ_{w_i}}\big (d\big )\).
Assuming information additivity and considering text words as basic linguistic units, we have
Notice that, if \(w_j \ne w_i\), then \({occ_{w_i}}(w_{j}) = 0\). Thus, \(P\big ({occ_{w_i}}(w') \ge 0 \big ) = 1\), since by definition \({occ_{w_i}}(d) \ge 0\), \(\forall d \in \mathcal {D}\). Therefore, in the last summation all the terms are null, except for \(w_{j} = w_{i}\). In this case, we have that \({occ_{w_i}}(w_i) = 1\), and given that by definition of the function \({occ_{w_i}}(.)\), its maximum value is 1, we can say that \({occ_{w_i}}(w') \ge 1\) is equivalent to \({occ_{w_i}}(w') = 1\). Therefore, the probability \(P\big ({occ_{w_i}}(w') = 1 \big )\) is exactly \(P(w' = w_i) = P(w_i)\). And, \({\mathcal {I}}_{occ_{w_i}}\big (d\big ) \propto - \log \big ( P(w_i) \big )\).
One of the assumptions is that every word is equiprobable, i.e. \(P(w_i) = k\), \(1 \le i \le n\), for an arbitrary k. In order to achieve the result, we can choose k in such a way that \(- \log (k) = 1\). And finally, the summation gives us the \(tf(w_i, d)\). \(\square \)
Proposition 5.2
Given the vocabulary, \(\mathcal {V} = \{w_1, \ldots , w_n\}\), considering the set of features as, \(\Gamma = \{occ_{w_1},\ldots ,occ_{w_n}\}\), and given a message from the collection, \(d \in \mathcal {D}\), we are interested in computing the described OIQ, \({\mathcal {I}}_{occ_{w_i}}\big (d\big )\).
Assuming information additivity and considering messages as basic linguistic units, we have
Notice that, if \(w_{j} \ne w_{i}\), then \({occ_{w_i}}(w_j) = 0\). Thus, \(P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') \ge 0 \big ) = 1\), since by definition \({occ_{w_i}}(d') \ge 0\), \(\forall d' \in \mathcal {D}\). Therefore, in the last summation all the terms are null, except for \(w_{j} = w_{i}\). In this case, we have as many terms as the number of times that the word \(w_i\) appears in the message d, i.e. \(tf(w_i, d)\). Moreover, we have that \({occ_{w_i}}(w_j) = 1\), and given that by definition of \({occ_{w_i}}(.)\), its maximum value is 1, we can say that \({occ_{w_i}}(d') \ge 1\) is equivalent to \({occ_{w_i}}(d') = 1\). Therefore, the expression \(-\log \Big (P_{d' \in \mathcal {D}}\big ({occ_{w_i}}(d') = 1 \big ) \Big )\) is exactly \(-\log \Big ( P_{d' \in \mathcal {D}}(w_i \in d')\Big ) = idf(w_i)\). And thus, \({\mathcal {I}}_{occ_{w_i}}\big (d\big ) = tf(w_i, d) \cdot idf(w_i)\). \(\square \)
Proposition 5.3
Given the vocabulary, \(\mathcal {V} = \{w_1, \ldots , w_n\}\), and considering as features:
Assuming feature independence, we have
Being \(\mathcal {S}\) all the possible sequences of words which form a message, in the previous formula we have that:
Notice that \({\gamma _{i,j}}(w_1, \ldots , w_m)\) is equal to zero for all the sequences of the form \((w_1, \ldots , w_m)\) except for the sequences which verify that \(w_j = w_i\). Since by definition, \({\gamma _{i,j}}(.) \ge 0\), in the summation all the terms are null, except for the sequences which verify \(w_j = w_i\). In these cases, we have that \({\gamma _{i,j}}(w_1, \ldots , w_m) = 1\), and given that by definition of \({\gamma _{i,j}}(.)\), its maximum value is 1, we can say that \({\gamma _{i,j}}(w_1', \ldots , w_k') \ge 1\) is equivalent to \({\gamma _{i,j}}(w_1', \ldots , w_k') = 1\). Therefore, we have the next equality on probabilities:
And finally, with trivial algebraic operations, we have:
\(\square \)
Proposition 5.4
Considering the definition of Lin’s distance and assuming information additivity,
Assuming feature independence, it is equivalent to:
\(\square \)
Proposition 5.5
We will start from the \(IDF_{N\hbox {-}gram}\) term weighting of an n-gram, g, using the same notation of \(|\phi (g)|\) for the message frequency of g, and \(|\mu (g)|\) for the amount of messages containing at least one subsequence of the n-gram g.
Considering both sets of features, \(\Gamma \) and \(\Gamma '\), the OIQ of an n-gram in each set of features is computed by:
Replacing these expressions in the definition of the \(IDF_{N\hbox {-}gram}\):
\(\square \)
Proposition 5.6
We will start from the \(\mu _{d}(\gamma _i)\) term weighting of a message.
Considering two different scenarios relating to the relevancy of messages, \(\mathcal {D}_1\) and, \(\mathcal {D}_2\), the OIQ of a message is computed by:
Replacing these expressions, we get:
\(\square \)
Rights and permissions
About this article
Cite this article
Giner, F., Amigó, E. & Verdejo, F. Integrating learned and explicit document features for reputation monitoring in social media. Knowl Inf Syst 62, 951–985 (2020). https://doi.org/10.1007/s10115-019-01383-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01383-w