Abstract
This study aims to visualize financial documents in such a way that even nonexperts can understand the sentiments contained therein. To achieve this, we propose a novel text visualization method using an interpretable neural network (NN) architecture, called a gradient interpretable NN (GINN). A GINN can visualize a market sentiment score from an entire financial document and the sentiment gradient scores in both word and concept units. Moreover, the GINN can visualize important concepts given in various sentence contexts. Such visualization helps nonexperts easily understand financial documents. We theoretically analyze the validity of the GINN and experimentally demonstrate the validity of text visualization produced by the GINN using real financial texts.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl. Based Syst. 89(C), 14–46 (2015)
Hechtlinger, Y.: Interpretation of prediction models using the input gradient. In: NIPS 2016 Workshop on Interpretable Machine Learning in Complex Systems (2016)
Bach, S., Binder, A., Montavon, G., Klauschen, F., Muller, K.R., Samek, W.: On pixel-wise explanations for nonlinear classifier decisions by layer-wise relevance propagation. PLOS ONE 10(7), 1–46 (2015)
Mikolov, T., Chen, K., Sutskever, I., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. NIPS 2013, 3111–3119 (2013)
Hornik, K., Feinerer, I., Kober, M., Buchta, C.: Spherical k-means clustering. J. Stat. Softw. 50(10), 1–22 (2012)
Yuan, Y., He, L., Peng, L., Huang, Z.: A new study based on word2vec and cluster for document categorization. J. Comput. Inf. Syst. 10(21), 9301–9308 (2014)
Zhao, P., Zhang, T.: Accelerating Minibatch stochastic gradient descent using stratified sampling. arXiv:1405.3080v1 (2014)
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic optimization. In: ICLR (2015)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. JMLR 15(1), 1929–1958 (2014)
Kudo, T., Yamamoto, K., Matsumoto. Y.: Applying conditional random fields to japanese morphological analysis. In: EMNLP 2004(2004)
Fang, A., Macdonald, C., Ounis, I., Habel, P.: Using word embedding to evaluate the coherence of topics from twitter data. In: SIGIR 2016 (2016)
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC 2010 Workshop (2010)
Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through propagating activation differences. In: ICML 2017 (2017)
Xu, Q., Zhao, Q., Pei, W., Yang, L., He, Z.: Design interpretable neural network trees through self-organized learning of features. In: IJCNN 2004 (2004)
Zhang, Q., Wu, Y.N., Zhu, S.: Interpretable convolutional neural networks. In: CVPR 2018 (2018)
Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K.: Recurrent models of visual attention. NIPS 2014, 2204–2212 (2014)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. ICML 2015, 77–81 (2015)
Dong, Y., Su, H., Zhu, J., Zhang, B.: Improving interpretability of deep neural networks with semantic information. In: CVPR 2017 (2017)
Patrik, E.K., Liu, Y.: A survey on interactivity in topic models. IJACSA 7(4), 456–461 (2016)
Jeffrey, L., Connor, C., Kevin, S., Jordan, B.: Tandem anchoring: a multiword anchor approach for interactive topic modeling. In: ACL 2017, pp. 896–905 (2017)
Hu, L., Jian, S., Cao, L., Chen, Q.: Interpretable recommendation via attraction modeling: learning multilevel attractiveness over multimodal movie contents. In: IJCAI 2018 (2018)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: NAACL 2016 (2016)
Rahman, M.K.M., Chow, W.S.C.: Content-based hierarchical document organization using multi-layer hybrid network and tree-structured features. Expert Syst. Appl. 37(4), 2874–2881 (2010)
Zhao, H., Du, L., Buntine, W., Zhou, M.: Inter and intra topic structure learning with word embeddings. In: ICML 2018 (2018)
Hasan, M., RundensteinerE., Agu, E.: Automatic emotion detection in text streams by analyzing Twitter data. Int. J. Data Sci. Anal. (2018) https://doi.org/10.1007/s41060-018-0096-z
Barranco, R.C., Boedihardjo, A.P., Hossain, M.S.: Analyzing evolving stories in news articles. Int. J. Data Sci. Anal. (2017). https://doi.org/10.1007/s41060-017-0091-9
Ito, T., Sakaji, H., Tsubouchi, K., Izumi, K., Yamashita, T.: Text-visualizing neural network model: understanding online financial textual data. In: PAKDD 2018 (2018)
Acknowledgements
This work was supported in part by JSPS KAKENHI Grant No. JP17J04768.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This paper is an extension version of the PAKDD’2018 Long Presentation paper Text-visualizing Neural Network Model: Understanding Online Financial Textual Data [27]
A Appendix
A Appendix
1.1 A.1 Theoretical analysis of the II algorithm
This section theoretically explains the validity of the II algorithm. Let \(\varOmega _{dw}^{(k)}\) be a set of words in the polarity dictionary included in the kth cluster. Then, Propositions 1– 3 are established.
Proposition 1
If Update is utilized for the parameter updates, then
Proposition 1 indicates that if Cond 1:the values of\({t^+}\)and\({t^-}\)are sufficiently large and Cond 2:for every word\(w_{k,i^{+}} \in \varOmega _{dw}^{(k)} \cap \varOmega _{pw}^{(k)}\), and\(w_{k,i^{-}} \in \varOmega _{dw}^{(k)} \cap \varOmega _{nw}^{(k)}\), the initial values of\(w^{(2)}_{k,i^{+}}\)and\(w^{(2)}_{k,i^{-}}\)given byInitare positive and sufficiently large, and negative and sufficiently small, respectively, are met for every k, then the II algorithm is expected to award each positive word \(\in \varOmega _{pw}^{(k)}\) (negative word \(\in \varOmega _{nw}^{(k)})\) a positive (negative) sentiment score. Let \({\varvec{H}^{d}}^{(j, t)}\) be \(\varvec{H}^{(j, t)} - {\varvec{H}^{*}}^{(j, t)*}\). Then, the following propositions, which are important for explaining the market mood predictability of the GINN, are established.
Proposition 2
If the initial values of \(|\varvec{W^{(3)}}|\) and \(|\varvec{W^{(4)}}|\) are sufficiently small (Cond 3), and for every \(j \in \varOmega ^{(t)}_m\), the values of \(\varvec{z}^{(2)}_{j}\) are \( \left\{ \begin{array}{ll} \mathrm{positive} &{} (j \in D^{(p)}) \\ \mathrm{negative} &{} (j \in D^{(n)}) \end{array} \right. \), then the first and second row vector values of \(\partial \varvec{H}^{(j, t)}\) are positive and negative, respectively, and
Proposition 3
If, for every k, Conds. 1–3 are established, and the values \(|\varOmega _{pw}^{(k, t^+)}|\), \(|\varOmega _{nw}^{(k, t^-)}|\), and \(|\varOmega _m|\) are sufficiently large, then \(\lim _{t \rightarrow \infty } \frac{\sum _{j \in \varOmega ^{(t)}_m} \Vert {\varvec{H}^{d}}^{(j, t)} \Vert _{1} }{\sum _{j \in \varOmega ^{(t)}_m} \Vert \varvec{H}^{(j, t)}\Vert _{1}} = 0\).
Propositions 2 and 3 indicate that we can obtain the local optimal solution using the II algorithm in an ideal case because the influence of Update disappears over time. From these propositions, we can also see that Init maintains model predictability because Init is useful for satisfying Cond 2.
Proposition 1 explains the interpretability of the GINN, and Propositions 2 and 3 confirm the predictability of the GINN in an ideal case.
1.1.1 A.1.1 Proof of Proposition 1
Proof
Here, for every \(k (\le K)\), if \(j \in D^{(p)}\), then \({\varDelta }^{(2)*}_{k, j} \le 0\), and if \(j \in D^{(n)}\), then \({\varDelta }^{(2)*}_{k, j} \ge 0\). Thus,
Therefore, Proposition 1 can be established. \(\square \)
1.1.2 A.1.2 Proof of Proposition 2
Proof
Let us denote \(\varvec{Z}^{(2)} := [\varvec{v}^{(CS)}_{m(1)}, \ldots , \varvec{v}^{(CS)}_{m(N)}] (\in {\mathbb {R}}^{K \times N})\), \(\varvec{U}^{(2)} := \tanh ^{-1}(\varvec{Z}^{(2)})\), \(\varvec{U}^{(3)} := \varvec{W}^{(3)}\varvec{Z}^{(2)}\), \(\varvec{u}^{(l)}_j\) is the jth column of \(\varvec{U}^{(l)}\) (\(l = 2, 3\)), and \(\varvec{z}^{(l)}_j\) and \(z^{(l)}_{i,j}\) are the jth column and the (i, j) component \(\varvec{Z}^{(l)}\) (\(l = 2\)). We approximate \(\partial {\varvec{H}^{(j, t)}}\) as follows:
First, we confirm that if for every \(j \in \varOmega ^{(t)}_m\), the values of \(\varvec{z}^{(2)}_{j}\) are \( \left\{ \begin{array}{ll} \mathrm{positive} &{} (j \in D^{(p)}) \\ \mathrm{negative} &{} (j \in D^{(n)}) \end{array} \right. \) , then the following three lemmas are established. \(\square \)
Lemma 1
The first and second row vector values of
\({\varDelta ^{(4)}_j}{\varvec{z}^{(2)}_j}^\mathrm{T}\) are positive and negative, respectively.
Lemma 2
The first and second rows of
are positive and negative, respectively.
Lemma 3
The first and second rows of
are positive and negative.
1.1.3 A.1.3 Proof of Lemma 1
From the condition,
Moreover, Eq. (4) is established.
Thus, from Eqs. (2) and (4), Lemma 1 is established.
1.1.4 A.1.4 Proof of Lemma 2
Here, \( \partial \varvec{w}^{(4)}_1 = - \partial \varvec{w}^{(4)}_2 \) because \(\partial \varvec{W}^{(4)} = {\varDelta }^{(4)}{\varvec{Z}^{(3)}}^\mathrm{T}\) and \({\varDelta }^{(4)}_{1, j} = - {\varDelta }^{(4)}_{2, j}\) for every j. Considering that \(\varvec{W}^{(4)}\) is the sum of the values of \(\partial \varvec{W}^{(4)}\) in the previous updates, if the initial value of \(|\varvec{W}^{(4)}|\) is sufficiently small, then we can approximate it as
Let us denote \(A^{l}\) as
We define \({v}^{{\varDelta }^{(4)}{\varvec{Z}^{(3)}}}_{1, i}\) as the ith component of \(\varvec{w}^{(4)}_1\) and \(F_{i,j}\) as the (i, i) component of \(\mathrm{diag} \left( f_3'(\varvec{u}^{(3)}_j)\right) \mathrm{diag} \left( f_3'(\varvec{u}^{(3)}_i)\right) \). Then, \( A^{l} = \left( \begin{array}{ll} \sum _{i = 1}^{K2} F_{i,j} |{{v}^{{\varDelta }^{(4)}{\varvec{Z}^{(3)}}}_{1, i}}|^2 &{} - \sum _{i = 1}^{K2} F_{i,j} |{{v}^{{\varDelta }^{(4)}{\varvec{Z}^{(3)}}}_{1, i}}|^2 \\ - \sum _{i = 1}^{K2} F_{i,j} |{{v}^{{\varDelta }^{(4)}{\varvec{Z}^{(3)}}}_{1, i}}|^2 &{} \sum _{i = 1}^{K2} F_{i,j} |{{v}^{{\varDelta }^{(4)}{\varvec{Z}^{(3)}}}_{1, i}}|^2 \end{array}\right) . \)
Thus, from Lemma 1, if the initial value of \(|\varvec{W}^{(4)}|\) is sufficiently small, then Lemma 2 is established.
1.1.5 A.1.5 Proof of Lemma 3
Let us define the matrix \(\varvec{M}^{i}\) as \( \varvec{M}^{i} := \mathrm{diag} \left( \frac{f_3(u_i)}{u_i}\right) \) and \(A^{r}\) as \( \varvec{A}^{r} := {\varvec{W}^{(3)}}^\mathrm{T} \varvec{M}^{i} \mathrm{diag} \left( f_3'(\varvec{u}^{(3)}_j)\right) {\varvec{W}^{(3)}}. \) Here, \( \partial \varvec{W}^{(3)} = \frac{1}{N} \sum _{i} \mathrm{diag}\left( f_3'(\varvec{u}^{(3)}_i)\right) {\varvec{W}^{(4)}}^\mathrm{T} {\varDelta ^{(4)}_i}{\varvec{z}^{(2)}_i}^\mathrm{T}. \)
Thus,
where we denote
as \(D^{r}_{i, j, l, m}\).
Considering that \( \varvec{w}^{(4)}_1 \approx - \varvec{w}^{(4)}_2\) (Eq (5)),
where \(k^{(4)} := {\varvec{w}^{(4)}_1} D^{r}_{i, j, l, m} {\varvec{w}^{(4)}_1}^\mathrm{T} > 0\) because \(D^{r}_{i, j, l, m}\) is the diagonal matrix, and each diagonal element value of \(D^{r}_{i, j, l, m}\) is positive.
Moreover, from Eq. (4),
Therefore, from Eq. (2), each element value of
is positive. Thus, each element value of
\( \partial {\varvec{W}^{(3)}}^\mathrm{T} \varvec{M}^{i} \mathrm{diag} \left( f_3'(\varvec{u}^{(3)}_j)\right) \partial {\varvec{W}^{(3)}} \) is positive. Considering that \(\varvec{W}^{(3)}\) is the sum of the values of \(\partial \varvec{W}^{(3)}\) in the previous updates, if the initial value of \(\varvec{W}^{(3)}\) is sufficiently small and N is sufficiently large, then each element value of \(\varvec{A}^{r}\) is positive. Thus, from Lemma 1 and the above, the first and second array values of \({\varDelta ^{(4)}_i}{\varvec{z}^{(2)}_i}^\mathrm{T}A^{r}\) are positive and negative, respectively. Thus, if N is sufficiently large, then Lemma 3 is established.
1.1.6 A.1.6 Summarization
From \( \partial {\varvec{H}^{(j, t)}} = \frac{1}{N} \sum _{i = 1}^{N} A^{l} {\varDelta ^{(4)}_i}{\varvec{z}^{(2)}_i}^\mathrm{T} + {\varDelta ^{(4)}_i}{\varvec{z}^{(2)}_i}^\mathrm{T} A^{r} \), Lemmas 2 and 3, the first and second row values of \(E[\partial {\varvec{H}^{(j, t)}}]\) are positive and negative, respectively, for every j. Thus, Proposition 3 is established.
1.1.7 A.1.7 Explanation of Proposition 3
Proof
If the following conditions are met for every k:
Cond 1: the values of \({t^+}\) and \({t^-}\) are sufficiently large,
Cond 2: for every word \(w_{k,i^{+}} \in \varOmega _{dw}^{(k)} \cap \varOmega _{pw}^{(k)}\), and \(w_{k,i^{-}} \in \varOmega _{dw}^{(k)} \cap \varOmega _{nw}^{(k)}\), the initial value of \(w^{(2)}_{k,i^{+}}\) given by Init is positive and sufficiently large, and negative and sufficiently small, respectively,
Cond 3: the initial values of \(|\varvec{W^{(3)}}|\) and \(|\varvec{W^{(4)}}|\) are sufficiently small, and
Cond 4: the values \(|\varOmega _{pw}^{(k, t^+)}|\), \(|\varOmega _{nw}^{(k, t^-)}|\), and \(|\varOmega _m|\) are sufficiently large,
then, from Cond 1, Cond 2, Cond 4, and Proposition 1, Eq. (2) is established. Thus, from Proposition 2 and Cond 3, Proposition 3 is established. \(\square \)
1.1.8 Experimental examples of influence by Update
Figure 8 illustrates examples for obtaining the mean value of
in the fivefold cross-validation using real datasets. The upper part of Fig. 8 is the result for the Yahoo dataset, where \(T = 0.02 \mathrm{and} K2 = K = 500\), and the lower part is the result for the News article dataset, where \(K2 = K = 500\). The results demonstrate that the influence of Update converges to zero according to Proposition 3, even when real datasets are used.
1.2 A.2 Gradient method for assigning terms to their polarity scores using fully MLP
We assign sentiment scores to words using the gradient method [2] and the fully MLP as follows. Let the output value of the fully MLP, \(\varvec{y}^{mlp}_{j} \), be \( f^{MLP}(\varvec{v}^{\mathrm{(BOW)}}_j) \in {\mathbb {R}}^{2} \) , and \(D_\mathrm{train}\) be the training dataset documents. The sentiment value of word \(w_{k, i}\), \(Gr(w_{k, i})\), is calculated as
1.3 A.3 Experimental result details
1.3.1 A.3.1 Interpretability evaluation
Table 7 summarizes the interpretability evaluation results of different parameter settings: the mean \(F_1\) scores and the standard deviation scores.
1.3.2 A.3.2 Market mood predictability evaluation
Tables 8 and 9 show the market mood predictability results of the fivefold cross-validation.
Table 10 shows the mean scores and standard deviation values from different parameter settings when the market mood predictability is evaluated in terms of the mean score of the fivefold cross-validation.
1.4 A.4 Detailed text visualization results for other initialization settings in Init
Tables 11, 12, and 13 represent the detailed text visualization results for other initialization settings in Init.
Rights and permissions
About this article
Cite this article
Ito, T., Sakaji, H., Izumi, K. et al. GINN: gradient interpretable neural networks for visualizing financial texts. Int J Data Sci Anal 9, 431–445 (2020). https://doi.org/10.1007/s41060-018-0160-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41060-018-0160-8