Abstract
The paper presents a probabilistic topic model (PTM) application to citation network collection. Snowball sampling method is moderated with the selection of the most relevant papers by means of the PTM. The PTM used in the paper is modified to treat collections of short texts. It is constructed from the titles of seed papers collection united with the papers obtained through unrestricted snowball sampling. The objective of the research is to propose and to experimentally verify the approach of application of PTM of short text documents for improvement of a citation network collection. The preliminary analysis has shown that the method is robust: seed paper collection variations do not affect the most influencing papers subset in the collected citation network.
H. Dobrovolskyi and N. Keberle—The work has been partially done in frame of the EU FP7 Marie Curie IRSES SemData project (http://www.semdata-project.eu/), grant agreement No PIRSESGA-2013-612551.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
AAAI Digital Library Conference Proceedings. https://aaai.org/Library/conferences-library.php.
- 2.
Journals: Free Texts: Download & Streaming: Internet Archive. https://archive.org/details/journals.
- 3.
Stanford Large Network Dataset Collection. https://snap.stanford.edu/data/.
- 4.
Google Scholar https://scholar.google.com.ua/.
- 5.
Microsoft Academic https://academic.microsoft.com/.
- 6.
Semantic Scholar https://www.semanticscholar.org/.
- 7.
See NLTK Stemmers http://www.nltk.org/howto/stem.html.
- 8.
See NLTK part-of-speech tagger http://www.nltk.org/book/ch05.html.
- 9.
For instance list of English stop words is available at Snowball stemmer site http://snowball.tartarus.org/algorithms/english/stop.txt.
- 10.
References
Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)
Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. IWCS 13, 13–22 (2013)
Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: Review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014)
Fouz-González, J.: Trends and directions in computer-assisted pronunciation training. In: Mompean, J.A., Fouz-González, J. (eds.) Investigating English Pronunciation, pp. 314–342. Palgrave Macmillan UK, London (2015). doi:10.1057/9781137509437_14
Garfield, E.: From computational linguistics to algorithmic historiography. In: Symposium in Honor of Casimir Borkowski at the University of Pittsburgh School of Information Sciences (2001)
Garfield, E., Merton, R.K.: Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, vol. 8. Wiley, New York (1979)
Gillis, N.: Introduction to nonnegative matrix factorization. arXiv preprint arXiv:1703.00663 (2017)
Harris, J.K., Beatty, K.E., Lecy, J.D., Cyr, J.M., Shapiro, R.M.: Mapping the multidisciplinary field of public health services and systems research. Am. J. Prev. Med. 41(1), 105–111 (2011)
Hoyer, P.O.: Non-negative sparse coding. In: Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 557–565. IEEE (2002)
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD) 2(2), 10 (2008)
Jijkoun, V., de Rijke, M.: Recognizing textual entailment: is word similarity enough? In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS, vol. 3944, pp. 449–460. Springer, Heidelberg (2006). doi:10.1007/11736790_25
Jolliffe, I.T.: Principal component analysis and factor analysis. Principal component analysis. Springer Series in Statistics, pp. 115–128. Springer, New York (1986). doi:10.1007/978-1-4757-1904-8_7
Kajikawa, Y., Ohno, J., Takeda, Y., Matsushima, K., Komiyama, H.: Creating an academic landscape of sustainability science: an analysis of the citation network. Sustain. Sci. 2(2), 221 (2007)
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)
Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)
Lee, A., et al.: Language-independent methods for computer-assisted pronunciation training. Ph.D. thesis, Massachusetts Institute of Technology (2016)
Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (2007)
Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis 1978–2010: a citation-based literature survey. Omega 41(1), 3–15 (2013)
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Lu, Z., Li, H.: A deep architecture for matching short texts. In: Advances in Neural Information Processing Systems, pp. 1367–1375 (2013)
MacKay, D.J.: Information Theory. Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
Meho, L.I.: The rise and rise of citation analysis. Phys. World 20(1), 32 (2007)
Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6, 775–780 (2006)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)
Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)
Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(suppl 1), 5200–5205 (2004)
Pang, J., Li, X., Xie, H., Rao, Y.: SBTM: topic modeling over short texts. In: Gao, H., Kim, J., Sakurai, Y. (eds.) DASFAA 2016. LNCS, vol. 9645, pp. 43–56. Springer, Cham (2016). doi:10.1007/978-3-319-32055-7_4
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014)
Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. In: Health Psychology in Practice, pp. 150–179 (2004)
Popova, S., Khodyrev, I., Egorov, A., Logvin, S., Gulyaev, S., Karpova, M., Mouromtsev, D.: Sci-search: academic search and analysis system based on keyphrases. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 281–288. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41360-5_24
Price, D.: Citation measures of hard science, soft science, technology, and nonscience. In: Nelson, C.E., Pollack, D.K. (eds.) Communication Among Scientists and Engineers. Heath Lexington Books Massachusetts (1970)
Ramage, D., Rafferty, A.N., Manning, C.D.: Random walks for text semantic similarity. In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 23–31. Association for Computational Linguistics (2009)
Small, H.: Visualizing science by citation mapping. J. Associat. Inf. Sci. Technol. 50(9), 799 (1999)
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)
Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). doi:10.1007/978-3-319-12580-0_3
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 749–757. SIAM (2013)
Yang, K., Meho, L.I.: Citation analysis: a comparison of Google Scholar, Scopus, and web of science. Proc. Am. Soc. Inf. Sci. Technol. 43(1), 1–15 (2006). http://dx.doi.org/10.1002/meet.14504301185
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, pp. 1480–1489 (2016)
Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Dobrovolskyi, H., Keberle, N., Todoriko, O. (2017). Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection. In: Różewski, P., Lange, C. (eds) Knowledge Engineering and Semantic Web. KESW 2017. Communications in Computer and Information Science, vol 786. Springer, Cham. https://doi.org/10.1007/978-3-319-69548-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-69548-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69547-1
Online ISBN: 978-3-319-69548-8
eBook Packages: Computer ScienceComputer Science (R0)