Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection

Hennadii Dobrovolskyi¹¹,
Nataliya Keberle¹¹ &
Olga Todoriko¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 786))

Included in the following conference series:

International Conference on Knowledge Engineering and the Semantic Web

832 Accesses
1 Citations

Abstract

The paper presents a probabilistic topic model (PTM) application to citation network collection. Snowball sampling method is moderated with the selection of the most relevant papers by means of the PTM. The PTM used in the paper is modified to treat collections of short texts. It is constructed from the titles of seed papers collection united with the papers obtained through unrestricted snowball sampling. The objective of the research is to propose and to experimentally verify the approach of application of PTM of short text documents for improvement of a citation network collection. The preliminary analysis has shown that the method is robust: seed paper collection variations do not affect the most influencing papers subset in the collected citation network.

H. Dobrovolskyi and N. Keberle—The work has been partially done in frame of the EU FP7 Marie Curie IRSES SemData project (http://www.semdata-project.eu/), grant agreement No PIRSESGA-2013-612551.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On Convergence of Controlled Snowball Sampling for Scientific Abstracts Collection

Topic discovery and evolution in scientific literature based on content and citations

Article 01 October 2017

Pragmatic Text Mining Method to Find the Topics of Citation Network

Notes

1.
AAAI Digital Library Conference Proceedings. https://aaai.org/Library/conferences-library.php.
2.
Journals: Free Texts: Download & Streaming: Internet Archive. https://archive.org/details/journals.
3.
Stanford Large Network Dataset Collection. https://snap.stanford.edu/data/.
4.
Google Scholar https://scholar.google.com.ua/.
5.
Microsoft Academic https://academic.microsoft.com/.
6.
Semantic Scholar https://www.semanticscholar.org/.
7.
See NLTK Stemmers http://www.nltk.org/howto/stem.html.
8.
See NLTK part-of-speech tagger http://www.nltk.org/book/ch05.html.
9.
For instance list of English stop words is available at Snowball stemmer site http://snowball.tartarus.org/algorithms/english/stop.txt.
10.
https://azure.microsoft.com/en-us/services/cognitive-services/academic-knowledge/.

References

Ahad, A., Fayaz, M., Shah, A.S.: Navigation through citation network based on content similarity using cosine similarity algorithm. Int. J. Database Theory Appl. 9(5), 9–20 (2016)
Article Google Scholar
Aletras, N., Stevenson, M.: Evaluating topic coherence using distributional semantics. IWCS 13, 13–22 (2013)
Google Scholar
Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)
Article MathSciNet MATH Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: Review and trends. Int. J. Comput. Sci. Appl. 11(3), 57–115 (2014)
Google Scholar
Fouz-González, J.: Trends and directions in computer-assisted pronunciation training. In: Mompean, J.A., Fouz-González, J. (eds.) Investigating English Pronunciation, pp. 314–342. Palgrave Macmillan UK, London (2015). doi:10.1057/9781137509437_14
Chapter Google Scholar
Garfield, E.: From computational linguistics to algorithmic historiography. In: Symposium in Honor of Casimir Borkowski at the University of Pittsburgh School of Information Sciences (2001)
Google Scholar
Garfield, E., Merton, R.K.: Citation Indexing: Its Theory and Application in Science, Technology, and Humanities, vol. 8. Wiley, New York (1979)
Google Scholar
Gillis, N.: Introduction to nonnegative matrix factorization. arXiv preprint arXiv:1703.00663 (2017)
Harris, J.K., Beatty, K.E., Lecy, J.D., Cyr, J.M., Shapiro, R.M.: Mapping the multidisciplinary field of public health services and systems research. Am. J. Prev. Med. 41(1), 105–111 (2011)
Article Google Scholar
Hoyer, P.O.: Non-negative sparse coding. In: Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 557–565. IEEE (2002)
Google Scholar
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data (TKDD) 2(2), 10 (2008)
Google Scholar
Jijkoun, V., de Rijke, M.: Recognizing textual entailment: is word similarity enough? In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS, vol. 3944, pp. 449–460. Springer, Heidelberg (2006). doi:10.1007/11736790_25
Chapter Google Scholar
Jolliffe, I.T.: Principal component analysis and factor analysis. Principal component analysis. Springer Series in Statistics, pp. 115–128. Springer, New York (1986). doi:10.1007/978-1-4757-1904-8_7
Chapter Google Scholar
Kajikawa, Y., Ohno, J., Takeda, Y., Matsushima, K., Komiyama, H.: Creating an academic landscape of sustainability science: an analysis of the citation network. Sustain. Sci. 2(2), 221 (2007)
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1188–1196 (2014)
Google Scholar
Lecy, J.D., Beatty, K.E.: Representative literature reviews using constrained snowball sampling and citation network analysis (2012)
Google Scholar
Lee, A., et al.: Language-independent methods for computer-assisted pronunciation training. Ph.D. thesis, Massachusetts Institute of Technology (2016)
Google Scholar
Lin, C.J.: Projected gradient methods for nonnegative matrix factorization. Neural Comput. 19(10), 2756–2779 (2007)
Article MathSciNet MATH Google Scholar
Liu, J.S., Lu, L.Y., Lu, W.M., Lin, B.J.: Data envelopment analysis 1978–2010: a citation-based literature survey. Omega 41(1), 3–15 (2013)
Article Google Scholar
López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 250, 113–141 (2013)
Article Google Scholar
Lu, Z., Li, H.: A deep architecture for matching short texts. In: Advances in Neural Information Processing Systems, pp. 1367–1375 (2013)
Google Scholar
MacKay, D.J.: Information Theory. Inference and Learning Algorithms. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
Meho, L.I.: The rise and rise of citation analysis. Phys. World 20(1), 32 (2007)
Article Google Scholar
Mihalcea, R., Corley, C., Strapparava, C., et al.: Corpus-based and knowledge-based measures of text semantic similarity. AAAI 6, 775–780 (2006)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Moya-Anegón, F., Vargas-Quesada, B., Herrero-Solana, V., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.: A new technique for building maps of large scientific domains based on the cocitation of classes and categories. Scientometrics 61(1), 129–145 (2004)
Article Google Scholar
Newman, M.E.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. 98(2), 404–409 (2001)
Article MathSciNet MATH Google Scholar
Newman, M.E.: Coauthorship networks and patterns of scientific collaboration. Proc. Natl. Acad. Sci. 101(suppl 1), 5200–5205 (2004)
Article Google Scholar
Pang, J., Li, X., Xie, H., Rao, Y.: SBTM: topic modeling over short texts. In: Gao, H., Kim, J., Sakurai, Y. (eds.) DASFAA 2016. LNCS, vol. 9645, pp. 43–56. Springer, Cham (2016). doi:10.1007/978-3-319-32055-7_4
Chapter Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. EMNLP 14, 1532–1543 (2014)
Google Scholar
Petticrew, M., Gilbody, S.: Planning and conducting systematic reviews. In: Health Psychology in Practice, pp. 150–179 (2004)
Google Scholar
Popova, S., Khodyrev, I., Egorov, A., Logvin, S., Gulyaev, S., Karpova, M., Mouromtsev, D.: Sci-search: academic search and analysis system based on keyphrases. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 281–288. Springer, Heidelberg (2013). doi:10.1007/978-3-642-41360-5_24
Chapter Google Scholar
Price, D.: Citation measures of hard science, soft science, technology, and nonscience. In: Nelson, C.E., Pollack, D.K. (eds.) Communication Among Scientists and Engineers. Heath Lexington Books Massachusetts (1970)
Google Scholar
Ramage, D., Rafferty, A.N., Manning, C.D.: Random walks for text semantic similarity. In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing, pp. 23–31. Association for Computational Linguistics (2009)
Google Scholar
Small, H.: Visualizing science by citation mapping. J. Associat. Inf. Sci. Technol. 50(9), 799 (1999)
Google Scholar
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
de Solla Price, D.J.: Networks of scientific papers. Science 149(3683), 510–515 (1965)
Article Google Scholar
Vorontsov, K., Potapenko, A.: Tutorial on probabilistic topic modeling: additive regularization for stochastic matrix factorization. In: Ignatov, D.I., Khachay, M.Y., Panchenko, A., Konstantinova, N., Yavorskiy, R.E. (eds.) AIST 2014. CCIS, vol. 436, pp. 29–46. Springer, Cham (2014). doi:10.1007/978-3-319-12580-0_3
Google Scholar
Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)
Google Scholar
Yan, X., Guo, J., Liu, S., Cheng, X., Wang, Y.: Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 749–757. SIAM (2013)
Google Scholar
Yang, K., Meho, L.I.: Citation analysis: a comparison of Google Scholar, Scopus, and web of science. Proc. Am. Soc. Inf. Sci. Technol. 43(1), 1–15 (2006). http://dx.doi.org/10.1002/meet.14504301185
Google Scholar
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A.J., Hovy, E.H.: Hierarchical attention networks for document classification. In: HLT-NAACL, pp. 1480–1489 (2016)
Google Scholar
Zuo, Y., Zhao, J., Xu, K.: Word network topic model: a simple but general solution for short and imbalanced texts. Knowl. Inf. Syst. 48(2), 379–398 (2016)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Zaporizhzhya National University, Zhukovskogo St. 66, Zaporizhzhya, 69600, Ukraine
Hennadii Dobrovolskyi, Nataliya Keberle & Olga Todoriko

Authors

Hennadii Dobrovolskyi
View author publications
You can also search for this author in PubMed Google Scholar
Nataliya Keberle
View author publications
You can also search for this author in PubMed Google Scholar
Olga Todoriko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hennadii Dobrovolskyi .

Editor information

Editors and Affiliations

West Pomeranian University of Technology in Szczecin, Szczecin, Poland
Przemysław Różewski
University of Bonn, Bonn, Germany
Christoph Lange

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dobrovolskyi, H., Keberle, N., Todoriko, O. (2017). Probabilistic Topic Modelling for Controlled Snowball Sampling in Citation Network Collection. In: Różewski, P., Lange, C. (eds) Knowledge Engineering and Semantic Web. KESW 2017. Communications in Computer and Information Science, vol 786. Springer, Cham. https://doi.org/10.1007/978-3-319-69548-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-69548-8_7
Published: 18 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69547-1
Online ISBN: 978-3-319-69548-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics