Overlapping thematic structures extraction with mixed-membership stochastic blockmodel

Shuo Xu ORCID: orcid.org/0000-0002-8602-1819^1,2,
Junwan Liu¹,
Dongsheng Zhai¹,
Xin An³,
Zheng Wang⁴ &
…
Hongshen Pang⁵

814 Accesses
15 Citations
Explore all metrics

Abstract

It is increasing important to identify automatically thematic structures from massive scientific literature. The interdisciplinarity enables thematic structures without natural boundaries. In this work, the identification of thematic structures is regarded as an overlapping community detection problem from the large-scale citation-link network. A mixed-membership stochastic blockmodel, armed with stochastic variational inference algorithm, is utilized to detect the overlapping thematic structures. In the meanwhile, in order to enhance readability, each theme is labeled with soft mutual information based method by several topical terms. Extensive experimental results on the astro dataset indicate that mixed-membership stochastic blockmodel primarily uses the local information and allows for the pervasive overlaps, but it favors similar sized themes, which disqualifies this approach from being used to extract the thematic structures from scientific literature. In addition, the thematic structures from the bibliographic coupling network is similar to those from the co-citation network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Memetic search for overlapping topics based on a local evaluation of link communities

Article 09 March 2017

Identifying Diachronic Topic-Based Research Communities by Clustering Shared Research Trajectories

Multi-view clustering with exemplars for scientific mapping

Article 04 September 2015

Notes

http://www.topic-challenge.info.
https://github.com/premgopalan/svinet.
https://gephi.org/.
https://github.com/ziqizhang/jate.
The clustering result hd is obtained from http://www.topic-challenge.info/solutions.html.
http://scikit-learn.org.

References

Abbe, E. & Sandon, C. (2015). Community detection in general stochastic block models: Fundamental limits and efficient algorithms for recovery. In Proceedings of the 56th IEEE annual symposium on foundations of computer science (pp. 670–688). Washington, DC: IEEE Computer Society. https://doi.org/10.1109/FOCS.2015.47.
Ahlgren, P., & Colliander, C. (2009). Document–document similarity approaches and science mapping: Experimental comparison of five approaches. Journal of Informetrics, 3(1), 49–63. https://doi.org/10.1016/j.joi.2008.11.003.
Article Google Scholar
Airoldi, E. M., Blei, D. M., Fienberg, S. E., & Xing, E. P. (2008). Mixed membership stochastic blockmodels. Journal of Machine Learning Research, 9(Sep), 1981–2014.
MATH Google Scholar
Amelio, A., & Pizzuti, C. (2014). Overlapping community discovery methods: A survey (pp. 105–125). Vienna: Springer. https://doi.org/10.1007/978-3-7091-1797-2_6.
Google Scholar
An, X., Xu, S., Wen, Y., & Hu, M. (2014). A shared interest discovery model for co-author relationship in SNS. International Journal of Distributed Sensor Networks, 2014, 1–9. https://doi.org/10.1155/2014/820715.
Google Scholar
Ananiadou, S. (1994). A methodology for automatic term recognition. In Proceedings of the 15th international conference on computational linguistics (pp. 1034–1038). Stroudsburg, PA: Association for Computational Linguistics. https://doi.org/10.3115/991250.991317.
Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine Learning, 50(1–2), 5–43. https://doi.org/10.1023/A:1020281327116.
Article MATH Google Scholar
Bastian, M., Heymann, S., and Jacomy, M. (2009). Gephi: An open source software for exploring and manipulating networks. In Proceedings of the 3rd international AAAI conference on weblogs and social media (pp. 361–362).
Bennett, C. L., Halpern, M., Hinshaw, G., Jarosik, N., Kogut, A., Limon, M., et al. (2003). First-year wilkinson microwave anisotropy probe (WMAP) observations: Preliminary maps and basic results. The Astrophysical Journal Supplement Series, 148(1), 1–27.
Article Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
MATH Google Scholar
Boyack, K. W. (2017). Thesaurus-based methods for mapping contents of publication sets. Scientometrics, 111(2), 1141–1155. https://doi.org/10.1007/s11192-017-2304-3.
Article Google Scholar
Chen, P.-Y., & Hero, A. O, I. I. I. (2015). Universal phase transition in community detectability under a stochastic block model. Physical Review E: Statistical Physics, Plasmas, Fluids, and Related Interdisciplinary Topics, 91(3), 032804. https://doi.org/10.1103/PhysRevE.91.032804.
Article MathSciNet Google Scholar
Conroy, C., & Gunn, J. E. (2010). The propagation of uncertainties in stellar population synthesis modeling. III. Model calibration, comparison, and evaluation. The Astrophysical Journal, 712(2), 833–857. https://doi.org/10.1088/0004-637X/712/2/833.
Article Google Scholar
Dave, R. N. (1996). Validation fuzzy partition obtained through $c$-shells clustering. Pattern Recognition Letters, 17(6), 613–623. https://doi.org/10.1016/0167-8655(96)00026-8.
Article MathSciNet Google Scholar
Dhillon, I. S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 269–274). New York, NY: ACM. https://doi.org/10.1145/502512.502550.
Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word term: The C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130. https://doi.org/10.1007/s007999900023.
Article Google Scholar
Ginsparg, P. (2011). ArXiv at 20. Nature, 476, 145–147. https://doi.org/10.1038/476145a.
Article Google Scholar
Glänzel, W., & Thijs, B. (2011). Using ’core documents’ for the representation of clusters and topics. Scientometrics, 88(1), 297–309. https://doi.org/10.1007/s11192-011-0347-4.
Article Google Scholar
Glänzel, W., & Thijs, B. (2017). Using hybrid methods and ’core documents’ for the representation of clusters and topics: The astronomy dataset. Scientometrics, 111(2), 1071–1087. https://doi.org/10.1007/s11192-017-2301-6.
Article Google Scholar
Gläser, J., Glänzel, W., & Scharnhorst, A. (2017). Same data-different results? Towards a comparative approach to the identification of thematic structures in science. Scientometrics, 111(2), 981–998. https://doi.org/10.1007/s11192-017-2296-z.
Article Google Scholar
Gopalan, P. K., & Blei, D. M. (2013). Efficient discovery of overlapping communities in massive networks. Proceedings of the National Academy of Sciences of the United States of America, 110(36), 14534–14539. https://doi.org/10.1073/pnas.1221839110.
Article MathSciNet MATH Google Scholar
Goswami, S., Murthy, C. A., and Das, A. K. (2016). Sparsity measure of a network graph: Gini index. eprint arXiv:1612.07074.
Havemann, F., Gläser, J., & Heinz, M. (2017). Memetic search for overlapping topics based on a local evaluation of link communities. Scientometrics, 111(2), 1089–1118. https://doi.org/10.1007/s11192-017-2302-5.
Article Google Scholar
Havemann, F., Gläser, J., Heinz, M., & Struck, A. (2012). Identifying overlapping and hierarchical thematic structures in networks of scholarly papers: A comparison of three approaches. PLoS ONE, 7(3), e33255. https://doi.org/10.1371/journal.pone.0033255.
Article Google Scholar
Healey, P., Rothman, H., & Hoch, P. K. (1986). An experiment in science mapping for research planning. Research Policy, 15(5), 233–251. https://doi.org/10.1016/0048-7333(86)90024-7.
Article Google Scholar
Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14(May), 1303–1347.
MathSciNet MATH Google Scholar
Hurley, N., & Rickard, S. (2009). Comparing measures of sparsity. IEEE Transactions on Information Theory, 55(10), 4723–4741. https://doi.org/10.1109/TIT.2009.2027527.
Article MathSciNet MATH Google Scholar
Janssens, F., Glänzel, W., & de Moor, B. (2008). A hybrid mapping of information science. Scientometrics, 75(3), 607–631. https://doi.org/10.1007/s11192-007-2002-7.
Article Google Scholar
Jordan, M., Grhahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233. https://doi.org/10.1023/A:1007665907178.
Article MATH Google Scholar
Klavans, R., & Boyack, K. W. (2011). Using global mapping to create more accurate document-level maps of research fields. Journal of the Association for Information Science and Technology, 62(1), 1–18. https://doi.org/10.1002/asi.21444.
Google Scholar
Koopman, R., & Wang, S. (2017). Mutual information based labelling and comparing clusters. Scientometrics, 111(2), 1157–1167. https://doi.org/10.1007/s11192-017-2305-2.
Article Google Scholar
Leydesdorff, L., & Welbers, K. (2011). The semantic mapping of words and co-words in contexts. Journal of Informetrics, 5(3), 469–475. https://doi.org/10.1016/j.joi.2011.01.008.
Article Google Scholar
Lorenz, M. O. (1905). Methods of measuring the concentration of wealth. Publications of the American Statistical Association, 9(70), 209–219.
Article Google Scholar
Manning, C. D., Raghavan, P., & Schütze, H. (Eds.). (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
MATH Google Scholar
Matsuo, Y., & Ishizuka, M. (2004). Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools, 13(01), 157–169. https://doi.org/10.1142/S0218213004001466.
Article Google Scholar
Mei, Q., Shen, X., and Zhai, C. (2007). Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 490–499). https://doi.org/10.1145/1281192.1281246.
Nepusz, T., Petróczi, A., Négyessy, L., & Bazsó, F. (2008). Fuzzy communities and the concept of bridgeness in complex networks. Physical Review E, 77(1), 016107. https://doi.org/10.1103/PhysRevE.77.016107.
Article MathSciNet Google Scholar
Park, Y., Byrd, R. J., and Boguraev, B. K. (2002). Automatic glossary extraction: Beyond terminology identification. In Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan (pp. 1–7).
Pedregosa, F., Varoquaus, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. (2011). Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(Oct), 2825–2830.
MathSciNet MATH Google Scholar
Role, F., & Nadif, M. (2014). Beyond cluster labeling: Semantic interpretation of clusters’ contents using a graph representation. Knowledge-based System, 56, 141–155. https://doi.org/10.1016/j.knosys.2013.11.005.
Article Google Scholar
Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). In M. W. Berry & J. Kogan (Eds.), Text mining: Application and theory (pp. 1–20). Hoboken: Wiley.
Google Scholar
Sclano, F. and Velardi, P. (2007). Termextractor: A web application to learn the common terminology of interest groups and research communities. In Proceedings of the 3rd international conference on interoperability for enterprise software and applications.
Shi, Q., Qiao, X., Xu, S., & Nong, G. (2013). Author-topic evolution model and its application in analysis of research interests evolution. Journal of the China Society for Scientific and Technical Information, 32(9), 912–919.
Google Scholar
Shibata, N., Kajikawa, Y., Takeda, Y., & Matsushima, K. (2009). Comparative study on methods of detecting research fronts using different types of citation. Journal of the Association for Information Science and Technology, 60(3), 571–580. https://doi.org/10.1002/asi.20994.
Google Scholar
Skrutskie, M. F., Cutri, R. M., Stiening, R., Weinberg, M. D., Schneider, S., Carpenter, J. M., et al. (2006). The two micron all sky survey (2MASS). The Astronomical Journal, 131(2), 1163–1183.
Article Google Scholar
van Eck, N. J., & Waltman, L. (2009). How to normalize cooccurrence data? an analysis of some well-known similarity measures. Journal of the Association for Information Science and Technology, 60(8), 1635–1651. https://doi.org/10.1002/asi.21075.
Google Scholar
van Eck, N. J., & Waltman, L. (2010). Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics, 84(2), 523–538. https://doi.org/10.1007/s11192-009-0146-3.
Article Google Scholar
van Eck, N. J., & Waltman, L. (2017). Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics, 111(2), 1053–1070. https://doi.org/10.1007/s11192-017-2300-7.
Article Google Scholar
van Raan, A. F. J. (1996). Advanced bibliometric methods as quantitative core of peer review based evaluation and foresight exercises. Scientometrics, 36(3), 397–420. https://doi.org/10.1007/BF02129602.
Article Google Scholar
Velden, T., Boyack, K. W., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017). Comparison of topic extraction approaches and their results. Scientometrics, 111(2), 1169–1221. https://doi.org/10.1007/s11192-017-2306-1.
Article Google Scholar
Vinh, N. X., Epps, J., & Bailey, J. (2010). Information theoretic measures for clustering comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(Oct), 2837–2854.
MathSciNet MATH Google Scholar
Waltman, L., & van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the Association for Information Science and Technology, 63(12), 2378–2392. https://doi.org/10.1002/asi.22748.
Google Scholar
Wilk, M. B., & Gnanadesikan, R. (1968). Probability plotting methods for the analysis for the analysis of data. Biometrika, 55(1), 1–17. https://doi.org/10.1093/biomet/55.1.1.
Google Scholar
Xie, J., Kelley, S., & Szymanski, B. K. (2013). Overlapping community detection in networks: The state-of-the-art and comparative study. ACM Computing Surveys, 45(4), 43:1–43:35. https://doi.org/10.1145/2501654.2501657.
Article MATH Google Scholar
Xu, S., Liu, J., & Wang, Z. (2017). Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. In Proceedings of ISSI 2017—the 16th international conference on scientometrics & informetrics (pp. 1007–1012).
Xu, S., Qiao, X., Zhu, L., Zhang, Y., Xue, C., & Li, L. (2016). Reviews on determining the number of clusters. Applied Mathematics & Information Sciences, 10(4), 1493–1512.
Article Google Scholar
Xu, S., Shi, Q., Qiao, X., Zhu, L., Zhang, H., Jung, H., et al. (2014). A dynamic users’ interest discovery model with distributed inference algorithm. International Journal of Distributed Sensor Networks, 2014, 1–11. https://doi.org/10.1155/2014/280892.
Google Scholar
Yau, C.-K., Porter, A., Newman, N., & Suominen, A. (2014). Clustering scientific documents with topic modeling. Scientometrics, 100(3), 767–786. https://doi.org/10.1007/s11192-014-1321-8.
Article Google Scholar
Zhang, Z., Gao, J., & Ciravegna, F. (2016). JATE 2.0: Java automatic term extraction with Apache Solr. In Proceedings of the 10th language resources and evaluation conference (pp. 2262–2269).
Zhang, Z., Iria, J., Brewster, C., & Ciravegna, F. (2008). A comparative evaluation of term recognition algorithms. In Proceedings of the 6th international conference on language resources and evaluation, Marrakech, Morocco (pp. 2108–2113).
Zhu, G., Blanton, M. R., & Moustakas, J. (2010). Stellar populations of elliptical galaxies in the local universe. The Astrophysical Journal, 722(1), 491–519. https://doi.org/10.1088/0004-637X/722/1/491.
Article Google Scholar
Zitt, M., Ramanana-Rahary, S., & Bassecoulard, E. (2005). Relativity of citation performance and excellence measures: From cross-field to cross-scale effects of field-normalisation. Scientometrics, 63(2), 373–401. https://doi.org/10.1007/s11192-005-0218-y.
Article Google Scholar

Download references

Acknowledgements

The present study is an extended version of an article (Xu et al. 2017) presented at the 16th International Conference on Scientometrics and Informetrics, Wuhan (China), 16–20 October 2017. The clustering results from this work have been deposited with the other astro-dataset results. Our gratitude also goes to the anonymous reviewers and the editor for their valuable comments. This work was supported partially by the Social Science Foundation of Beijing (Grant No. 17GLB074), Science and Technology Project of Guangdong Province (Grant No. 2017A030303065), and National Natural Science Foundation of China (Grant Nos. 71403255 and 71473237).

Author information

Authors and Affiliations

School of Economics and Management, Beijing University of Technology, Beijing, People’s Republic of China
Shuo Xu, Junwan Liu & Dongsheng Zhai
Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, People’s Republic of China
Shuo Xu
School of Economics and Management, Beijing Forestry University, Beijing, People’s Republic of China
Xin An
Institute of Scientific and Technical Information of China, Beijing, People’s Republic of China
Zheng Wang
Library, Shenzhen University, Shenzhen, People’s Republic of China
Hongshen Pang

Authors

Shuo Xu
View author publications
You can also search for this author in PubMed Google Scholar
Junwan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongsheng Zhai
View author publications
You can also search for this author in PubMed Google Scholar
Xin An
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongshen Pang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuo Xu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, S., Liu, J., Zhai, D. et al. Overlapping thematic structures extraction with mixed-membership stochastic blockmodel. Scientometrics 117, 61–84 (2018). https://doi.org/10.1007/s11192-018-2841-4

Download citation

Received: 16 October 2017
Published: 13 July 2018
Issue Date: October 2018
DOI: https://doi.org/10.1007/s11192-018-2841-4

Overlapping thematic structures extraction with mixed-membership stochastic blockmodel

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Memetic search for overlapping topics based on a local evaluation of link communities

Identifying Diachronic Topic-Based Research Communities by Clustering Shared Research Trajectories

Multi-view clustering with exemplars for scientific mapping

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Overlapping thematic structures extraction with mixed-membership stochastic blockmodel

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Memetic search for overlapping topics based on a local evaluation of link communities

Identifying Diachronic Topic-Based Research Communities by Clustering Shared Research Trajectories

Multi-view clustering with exemplars for scientific mapping

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation