Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

“Introducing Capisco: a semantically-enhanced search and discovery system for large-scale text corpora”

Published: 03 November 2015 Publication History

Abstract

This article discusses a new approach to scholarly search and discovery in large-scale text corpora. While lexicographic search is at present the predominant means to access large document corpora, it cannot directly address the inherent ambiguity of natural language. As a pragmatic solution, many scholars manually build their own list of suitable search terms to be used in repeated searches in digital libraries and other online resources; however, scholars then have to resolve on a case-by-case basis issues caused by synonyms, homonyms and OCR errors. Our approach differs from this by supporting scholars in developing and refining a set of relevant concepts, searches a large document collection using semantic concepts, and categorizes the potentially relevant documents from search results into worksets. The developed technique revisits the notion of semantic search and redesigns both the underlying data representation and interface support. This is achieved through an end-to-end design that relies centrally on a Concept-in-Context network sourced through the link structure of Wikipedia. We discuss here the principles of our approach, its implementation in the Capisco prototype, and the relationship between established search techniques and our approach.

References

[1]
Aasman, J. 2006. Allegro graph: RDF triple database. Tech. rep., White paper. Franz Incorporated, 2006. http://www.franz.com/agraph/allegrograph.
[2]
Basile, V., Bos, J., Evang, K., and Venhuizen, N. 2012. Developing a large semantically annotated corpus. In LREC. Vol. 12. 3196--3200.
[3]
Cunningham, S. J., Hinze, A., Bainbridge, D., Taube-Schock, C., and Ryan, T. 2015. Building heritage document collections for pacific island nations using semantic-enriched search. In Samoa III Conference.
[4]
Erling, O. and Mikhailov, I. 2009. Rdf support in the virtuoso dbms. In Networked Knowledge-Networked Media. Springer, 7--24.
[5]
Fields, B., Phippen, S., and Cohen, B. 2015. A case study in pragmatism: exploring the practical failure modes of linked data as applied to classical music catalogues. In Proceedings of the 2nd International Workshop on Digital Libraries for Musicology. ACM, 21--24.
[6]
Giles, J. 2005. Internet encyclopaedias go head to head: Jimmy wales' wikipedia comes close to britannica in terms of the accuracy of its science entries. Nature 438, 900--901.
[7]
Hinze, A., Heese, R., Luczak-Rösch, M., and Paschke, A. 2012. Semantic enrichment by non-experts: usability of manual annotation tools. In The Semantic Web--ISWC 2012. Springer, 165--181.
[8]
Hinze, A., Taube-Schock, C., Bainbridge, D., Matamua, R., and Downie, J. S. 2015. Improving access to large-scale digital libraries through semantic-enhanced search and disambiguation. In Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries. ACM, 147--156.
[9]
Khoo, M. and Hall, C. 2012. What would Google do? user's mental models of a digital library search engine. In Theory and Practice of Digital Libraries. Springer, 1--12.
[10]
Lei, Y., Uren, V., and Motta, E. 2006. Semsearch: A search engine for the semantic web. In Managing Knowledge in a World of Networks. Springer, 238--245.
[11]
Milne, D. and Witten, I. H. 2013. An open-source toolkit for mining wikipedia. Artificial Intelligence 194, 222--239.
[12]
Redmond, J. 1910. Home rule, speeches of John Redmond. T.F. Unwin, London.
[13]
Ren, J. 2015. User-guided disambiguation for semantic-enriched search. M.S. thesis, Computer Science, University of Waikato.
[14]
Ronallo, J. 2012. HTML5 microdata and schema. org. Code4Lib Journal 16.
[15]
Stojanovic, N., Studer, R., and Stojanovic, L. 2004. An approach for step-by-step query refinement in the ontology-based information retrieval. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 36--43.
[16]
Venhuizen, N., Basile, V., Evang, K., and Bos, J. 2013. Gamification for word sense labeling. In Proc. 10th International Conference on Computational Semantics (IWCS-2013). 397--403.
[17]
Witten, I. H., Moffat, A., and Bell, T. C. 1999. Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann.
[18]
Zeng, J., Ruan, G., Crowell, A., Prakash, A., and Plale, B. 2014. Cloud computing data capsules for non-consumptive use of texts. In Proceedings of the 5th ACM workshop on Scientific cloud computing. ACM, 9--16.
[19]
Zhang, J., Deng, B., and Li, X. 2009. Concept based query expansion using wordnet. In Proceedings of the 2009 international e-conference on advanced science and technology. IEEE Computer Society, 52--55.

Cited By

View all
  • (2018)Capisco: low-cost concept-based access to digital librariesInternational Journal on Digital Libraries10.1007/s00799-018-0232-320:4(307-334)Online publication date: 14-Mar-2018
  • (2017)Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding PublicationsDigital Libraries: Data, Information, and Knowledge for Digital Lives10.1007/978-3-319-70232-2_18(212-224)Online publication date: 3-Nov-2017

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGWEB Newsletter
ACM SIGWEB Newsletter  Volume 2015, Issue Autumn
Autumn 2015
33 pages
ISSN:1931-1745
EISSN:1931-1435
DOI:10.1145/2833219
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2015
Published in SIGWEB Volume 2015, Issue Autumn

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Capisco: low-cost concept-based access to digital librariesInternational Journal on Digital Libraries10.1007/s00799-018-0232-320:4(307-334)Online publication date: 14-Mar-2018
  • (2017)Writers of the Lost Paper: A Case Study on Barriers to (Re-) Finding PublicationsDigital Libraries: Data, Information, and Knowledge for Digital Lives10.1007/978-3-319-70232-2_18(212-224)Online publication date: 3-Nov-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media