Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2928294.2928306acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

CiteSeerX data: semanticizing scholarly papers

Published: 26 June 2016 Publication History

Abstract

Scholarly big data is, for many, an important instance of Big Data. Digital library search engines have been built to acquire, extract, and ingest large volumes of scholarly papers. This paper provides an overview of the scholarly big data released by CiteSeerX, as of the end of 2015, and discusses various aspects such as how the data is acquired, its size, general quality, data management, and accessibility. Preliminary results on extracting semantic entities from body text of scholarly papers with Wikifier show biases towards general terms appearing in Wikipedia and against domain specific terms. We argue that the latter will play a more important role in extracting important facts from scholarly papers.

References

[1]
B. Aleman-Meza, F. Hakimpour, I. B. Arpinar, and A. P. Sheth. Swetodblp ontology of computer science publications. Web Semantics: Science, Services and Agents on the World Wide Web, 5(3):151--155, 2007.
[2]
Y. An, J. Janssen, and E. E. Milios. Characterizing and mining the citation graph of the computer science literature. Knowledge and Information Systems, 6(6):664--678, 2004.
[3]
C. Caragea, J. Wu, S. D. Gollapalli, and C. L. Giles. Document Type Classification in Online Digital Libraries. Phoenix, Arizona USA, 2016. AAAI.
[4]
H.-H. Chen, P. Treeratpituk, P. Mitra, and C. L. Giles. CSSeer: an expert recommendation system based on CiteSeerX. JCDL '14, pages 381--382, 2013.
[5]
X. Cheng and D. Roth. Relational inference for wikification. In EMNLP, 2013.
[6]
I. Councill, C. L. Giles, and M.-Y. Kan. Parscit: an open-source crf reference string parsing package. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco, may 2008.
[7]
FooLabs. http://www.foolabs.com/xpdf/index.html. Accessed 06-May-2016.
[8]
D. Graus, D. Odijk, M. Tsagkias, W. Weerkamp, and M. de Rijke. Semanticizing search engine queries: The university of amsterdam at the erd 2014 challenge. In Proceedings of the First International Workshop on Entity Recognition & Disambiguation, ERD '14, pages 69--74, New York, NY, USA, 2014. ACM.
[9]
H. Han, C. L. Giles, E. Manavoglu, H. Zha, Z. Zhang, and E. A. Fox. Automatic document metadata extraction using support vector machines. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '03, pages 37--48, 2003.
[10]
W. Huang, Z. Wu, P. Mitra, and C. L. Giles. Refseer: A citation recommendation system. In IEEE/ACM Joint Conference on Digital Libraries, JCDL 2014, London, United Kingdom, September 8--12, 2014, pages 371--374, 2014.
[11]
M. Khabsa and C. L. Giles. The number of scholarly documents on the public web. PLoS ONE, 9(5):e93949, May 2014.
[12]
M. Khabsa, P. Treeratpituk, and C. Giles. Large scale author name disambiguation in digital libraries. In Big Data (Big Data), 2014 IEEE International Conference on, pages 41--42, Oct. 2014.
[13]
P. Larsen and M. von Ins. The rate of growth in scientific publication and the decline in coverage provided by science citation index. Scientometrics, 84(3):575--603, 2010.
[14]
J. Leskovec and R. Sosič. Snap.py: SNAP for Python, a general purpose network analysis and graph mining tool in Python. http://snap.stanford.edu/snappy, June 2014.
[15]
M. Ley. DBLP - some lessons learned. PVLDB, 2(2):1493--1500, 2009.
[16]
Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1070--1078. ACM, 2013.
[17]
M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp. Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, pages 385--386, New York, NY, USA, 2013. ACM.
[18]
P. Lopez. Grobid: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, ECDL'09, pages 473--474, Berlin, Heidelberg, 2009. Springer-Verlag.
[19]
L.-A. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and global algorithms for disambiguation to wikipedia. In ACL, 2011.
[20]
A. Sil and A. Yates. Re-ranking for joint named-entity recognition and linking. In Proceedings of the 22nd ACM international conference on information & knowledge management, pages 2369--2374. ACM, 2013.
[21]
R. Sinatra, P. Deville, M. Szell, D. Wang, and A.-L. Barabasi. A century of physics. Nat Phys, 11(10):791--796, 10 2015.
[22]
A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu, and K. Wang. An Overview of Microsoft Academic Service (MAS) and Applications. In Proceedings of the 24th International Conference on World Wide Web, WWW '15 Companion, pages 243--246, Republic and Canton of Geneva, Switzerland, 2015.
[23]
L. Subelj, D. Fiala, and M. Bajec. Network-based statistical comparison of citation topology of bibliographic databases. Scientific Reports, 4:6496, Sep 2014. Article.
[24]
N. Vitucci, M. A. Neri, R. Tedesco, and G. Gini. Semanticizing syntactic patterns in NLP processing using SPARQL-DL queries. CEUR Workshop Proceedings, 849, 2012.
[25]
M. Wick, S. Singh, and A. McCallum. A Discriminative Hierarchical Model for Fast Coreference at Large Scale. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL '12, pages 379--388, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
[26]
K. Williams and C. L. Giles. Near duplicate detection in an academic digital library. DocEng '13, pages 91--94, 2013.
[27]
J. Wu, J. Killian, H. Yang, K. Williams, S. R. Choudhury, S. Tuarob, C. Caragea, and C. L. Giles. Pdfmef: A multi-entity knowledge extraction framework for scholarly documents and semantic search. In Proceedings of the 8th International Conference on Knowledge Capture, K-CAP 2015, pages 13:1--13:8, New York, NY, USA, 2015. ACM.
[28]
J. Wu, P. Teregowda, J. P. F. Ramírez, P. Mitra, S. Zheng, and C. L. Giles. The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists. In Proceedings of the 3rd Annual ACM Web Science Conference, WebSci '12, pages 340--343, New York, NY, USA, 2012. ACM.
[29]
J. Wu, K. Williams, H.-H. Chen, M. Khabsa, C. Caragea, A. Ororbia, D. Jordan, and C. L. Giles. Citeseerx: Ai in a digital library search engine. In The Twenty-Sixth Annual Conference on Innovative Applications of Artificial Intelligence, IAAI '14, 2014.
[30]
J. G. Zheng, D. Howsmon, B. Zhang, J. Hahn, D. McGuinness, J. Hendler, and H. Ji. Entity linking for biomedical literature. BMC medical informatics and decision making, 15(Suppl 1):S4, 2015.

Cited By

View all
  • (2020)Keyphrase Extraction in Scholarly Digital Library Search EnginesWeb Services – ICWS 202010.1007/978-3-030-59618-7_12(179-196)Online publication date: 19-Sep-2020
  • (2019)On The Current State of Scholarly Retrieval SystemsEngineering, Technology & Applied Science Research10.48084/etasr.24489:1(3863-3870)Online publication date: 16-Feb-2019
  • (2019)Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databasesScientometrics10.1007/s11192-018-2958-5118:1(177-214)Online publication date: 1-Jan-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SBD '16: Proceedings of the International Workshop on Semantic Big Data
June 2016
83 pages
ISBN:9781450342995
DOI:10.1145/2928294
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • GRAPHIQ: Graphiq Inc.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CiteSeerX
  2. citation graph
  3. digital library search engine
  4. scholarly big data
  5. semantic entity extraction

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'16
Sponsor:
  • GRAPHIQ
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 30 of 54 submissions, 56%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)73
  • Downloads (Last 6 weeks)9
Reflects downloads up to 10 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2020)Keyphrase Extraction in Scholarly Digital Library Search EnginesWeb Services – ICWS 202010.1007/978-3-030-59618-7_12(179-196)Online publication date: 19-Sep-2020
  • (2019)On The Current State of Scholarly Retrieval SystemsEngineering, Technology & Applied Science Research10.48084/etasr.24489:1(3863-3870)Online publication date: 16-Feb-2019
  • (2019)Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databasesScientometrics10.1007/s11192-018-2958-5118:1(177-214)Online publication date: 1-Jan-2019
  • (2018)Global citation recommendation using knowledge graphsJournal of Intelligent & Fuzzy Systems10.3233/JIFS-16949334:5(3089-3100)Online publication date: 24-May-2018
  • (2018)CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622114(5465-5467)Online publication date: Dec-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media