Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Building and querying semantic layers for web archives (extended version)

Published: 01 June 2020 Publication History

Abstract

Web archiving is the process of collecting portions of the Web to ensure that the information is preserved for future exploitation. However, despite the increasing number of web archives worldwide, the absence of efficient and meaningful exploration methods still remains a major hurdle in the way of turning them into a usable and useful information source. In this paper, we focus on this problem and propose an RDF/S model and a distributed framework for building semantic profiles (“layers”) that describe semantic information about the contents of web archives. A semantic layer allows describing metadata information about the archived documents, annotating them with useful semantic information (like entities, concepts, and events), and publishing all these data on the Web as Linked Data. Such structured repositories offer advanced query and integration capabilities, and make web archives directly exploitable by other systems and tools. To demonstrate their query capabilities, we build and query semantic layers for three different types of web archives. An experimental evaluation showed that a semantic layer can answer information needs that existing keyword-based systems are not able to sufficiently satisfy.

References

[1]
Alam, S., Nelson, M.L., Van de Sompel, H., Balakireva, L.L., Shankar, H., Rosenthal, D.S.: Web archive profiling through cdx summarization. In: International Conference on Theory and Practice of Digital Libraries, Springer (2015)
[2]
Alam, S., Nelson, M.L., Van de Sompel, H., Rosenthal, D.S.: Web archive profiling through fulltext search. In: International Conference on Theory and Practice of Digital Libraries, Springer (2016)
[3]
Alexander, K., Hausenblas, M.: Describing linked datasets-on the design and usage of void, the vocabulary of interlinked datasets. In: In Linked Data on the Web Workshop (LDOW 09), in conjunction with 18th International World Wide Web Conference (WWW 09, Citeseer) (2009)
[4]
AlSum A, Weigle MC, Nelson ML, and Van de Sompel H Profiling web archive coverage for top-level domain and content language Int. J. Digit. Libr. 2014 14 3–4 149-166
[5]
Anand, A., Bedathur, S., Berberich, K., Schenkel, R., Tryfonopoulos, C.: Everlast: a distributed architecture for preserving the web. In: 9th ACM/IEEE-CS Joint Conference on Digital libraries, ACM (2009)
[6]
Arenas, M., CuencaGrau, B., Kharlamov, E., Marciuska, S., Zheleznyakov, D., Jimenez-Ruiz, E.: SemFacet: semantic faceted search over YAGO. In: 23rd International Conference on World Wide Web, ACM (2014)
[7]
Antoniou Grigoris and van Harmelen Frank Web Ontology Language: OWL Handbook on Ontologies 2004 Berlin, Heidelberg Springer Berlin Heidelberg 67-92
[8]
Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, pp. 179–188. ACM (2015)
[9]
Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking in queries. In: Eight ACM International Conference on Web Search and Data Mining, ACM, New York, NY, USA (2015)
[10]
Bornand, N.J., Balakireva, L., Van de Sompel, H.: Routing memento requests using binary classifiers. In: 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM (2016)
[11]
Brickley D, Guha RV, and McBride B Rdf schema 1.1 W3C Recomm. 2014 25 2004-2014
[12]
Fafalios P and Tzitzikas Y Stochastic re-ranking of biomedical search results based on extracted entities J. Assoc. Inf. Sci. Technol. (JASIST) 2017 68 11 2572-2586
[13]
Fafalios P, Baritakis M, and Tzitzikas Y Exploiting linked data for open and configurable named entity extraction Int. J. Artif. Intell. Tools 2015 24 02 1540012
[14]
Fafalios, P., Yannakis, T., Tzitzikas, Y.: Querying the web of data with sparql-ld. In: International Conference on Theory and Practice of Digital Libraries, Springer, pp. 175–187 (2016)
[15]
Fafalios Pavlos, Iosifidis Vasileios, Stefanidis Kostas, and Ntoutsi Eirini Multi-aspect Entity-Centric Analysis of Big Social Media Archives Research and Advanced Technology for Digital Libraries 2017 Cham Springer International Publishing 261-273
[16]
Fafalios, P., Kasturia, V., Nejdl, W.: Towards a ranking model for semantic layers over digital archives. In: ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL’17 - Posters & Demonstrations)), Toronto, Ontario, Canada (2017)
[17]
Fernando, Z.T., Marenzi, I., Nejdl, W., Kalyani, R.: Archiveweb: Collaboratively extending and exploring web archive collections. In: International Conference on Theory and Practice of Digital Libraries, Springer (2016)
[18]
Ferragina, P., Scaiella, U.: Tagme: on-the-fly annotation of short text fragments (by wikipedia entities). In: 19th ACM international conference on Information and knowledge management, ACM (2010)
[19]
Ferré S Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language Semant. Web 2017 8 3 405-418
[20]
Gossen Gerhard, Demidova Elena, and Risse Thomas Extracting Event-Centric Document Collections from Large-Scale Web Archives Research and Advanced Technology for Digital Libraries 2017 Cham Springer International Publishing 116-127
[21]
Heath T and Bizer C Linked data: evolving the web into a global data space Synth. Lectures Semantic Web Theory Technol. 2011 1 1 1-136
[22]
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text. In: Conference on Empirical Methods in Natural Language Processing (2011)
[23]
Holzmann, H., Anand, A.: Tempas: temporal archive search based on tags. In: International Conference on World Wide Web (2016)
[24]
Holzmann, H., Risse, T.: Accessing web archives from different perspectives with potential synergies. In: 2nd International Conference on Web Archives/Web Archiving Week (RESAW/IIPC) (2017)
[25]
Holzmann, H., Goel, V., Anand, A.: Archivespark: efficient web archive access, extraction and derivation. In: 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM (2016)
[26]
Holzmann, H., Nejdl, W., Anand, A.: Exploring web archives through temporal anchor texts. In: Proceedings of the 2017 ACM on Web Science Conference, ACM, pp 289–298 (2017)
[27]
Jackson, A., Lin, J., Milligan, I., Ruest, N.: Desiderata for exploratory search interfaces to web archives in support of scholarly activities. In: 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, ACM (2016)
[28]
Kanhabua, N., Kemkes, P., Nejdl, W., Nguyen, T.N., Reis, F., Tran, N.K.: How to search the internet archive without indexing it. In: 20th International Conference on Theory and Practice of Digital Libraries, Springer (2016)
[29]
Lehmann J, Isele R, Jakob M, Jentzsch A, Kontokostas D, Mendes PN, Hellmann S, Morsey M, van Kleef P, Auer S, et al. Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia Semantic Web 2015 6 2 167-195
[30]
Lin, J., Gholami, M., Rao, J.: Infrastructure for supporting exploration and discovery in web archives. In: International Conference on World Wide Web (2014)
[31]
Marchionini G Exploratory search: from finding to understanding Commun. ACM 2006 49 4 41-46
[32]
Matthews, M., Tolchinsky, P., Blanco, R., Atserias, J., Mika, P., Zaragoza, H.: Searching through time in the New York times. In: 4th Workshop on Human-Computer Interaction and Information Retrieval (2010)
[33]
Moro A, Raganato A, and Navigli R Entity linking meets word sense disambiguation: a unified approach Trans. Assoc. Comput. Linguist. 2014 2 231-244
[34]
Padia, K., AlNoamany, Y., Weigle, M.C.: Visualizing digital collections at archive-it. In: 12th ACM/IEEE-CS joint conference on Digital Libraries, pp. 15–18. ACM (2012)
[35]
Page, K.R., Bechhofer, S., Fazekas, G., Weigl, D.M., Wilmering, T.: Realising a layered digital library: exploration and analysis of the live music archive through linked data. In: Digital Libraries (JCDL), 2017 ACM/IEEE Joint Conference on, IEEE, pp 1–10 (2017)
[36]
PrudHommeaux, E., Seaborne, A., et al.: Sparql query language for rdf. W3C recommendation 15 (2008)
[37]
Buil-Aranda C, Arenas M, Corcho O, and Polleres A Federating queries in SPARQL 1.1: syntax, semantics and evaluation. Web Semant. Sci. Serv. Agents World Wide Web 2013 18 1 1-17
[38]
Sacco GM and Tzitzikas Y Dynamic Taxonomies and Faceted Search: Theory, Practice, and Experience 2009 New York Springer
[39]
Sanderson, R., Ciccarese, P., Van de Sompel, H.: Designing the W3C open annotation data model. In: Proceedings of the 5th Annual ACM Web Science Conference, pp. 366–375. ACM (2013)
[40]
Sandhaus E The New Tork Times annotated corpus Linguist. Data Consort. Philadelphia 2008 6 12 e26752
[41]
Singh, J., Nejdl, W., Anand, A.: Expedition: a time-aware exploratory search system designed for scholars. In: SIGIR conference on Research and Development in Information Retrieval (2016)
[42]
Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: ACM Conference on Human Information Interaction and Retrieval (2016)
[43]
Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP Framework for Time-Based Access to Resource States—Memento. RFC 7089 (2013). https://doi.org/10.17487/RFC7089
[44]
Tran, N.K., Tran, T., Niederée, C.: Beyond time: dynamic context-aware entity recommendation. In: European Semantic Web Conference, Springer (2017)
[45]
Tzitzikas Y, Manolis N, and Papadakos P Faceted exploration of RDF/S datasets: a survey J. Intell. Inf. Syst. 2017 48 2 329-364
[46]
Unger, C., Bühmann, L., Lehmann, J., Ngonga Ngomo, A.C., Gerber, D., Cimiano, P.: Template-based question answering over rdf data. In: 21st international Conference on World Wide Web, ACM (2012)
[47]
Vo, K.D., Tran, T., Nguyen, T.N., Zhu, X., Nejdl, W.: Can we find documents in web archives without knowing their contents? In: ACM Conference on Web Science (2016)
[48]
Weikum, G., Spaniol, M., Ntarmos, N., Triantafillou, P., Benczúr, A., Kirkpatrick, S., Rigaux, P., Williamson, M.: Longitudinal analytics on web archive data: it’s about time! In: 5th Biennial Conference on Innovative Data Systems Research, CIDR 2011 (2011)
[49]
Whitelaw M Generous interfaces for digital cultural collections Digital Humanit. Q. 2015 9 1 1
[50]
Xiong, C., Power, R., Callan, J.: Explicit semantic ranking for academic search via knowledge graph embedding. In: Proceedings of the 26th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1271–1279 (2017)
[51]
Zhang, L., Rettinger, A., Zhang, J.: A probabilistic model for time-aware entity recommendation. In: International Semantic Web Conference, Springer (2016)

Index Terms

  1. Building and querying semantic layers for web archives (extended version)
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image International Journal on Digital Libraries
    International Journal on Digital Libraries  Volume 21, Issue 2
    Jun 2020
    135 pages
    ISSN:1432-5012
    EISSN:1432-1300
    Issue’s Table of Contents

    Publisher

    Springer-Verlag

    Berlin, Heidelberg

    Publication History

    Published: 01 June 2020
    Accepted: 28 June 2018
    Revision received: 09 December 2017
    Received: 15 September 2017

    Author Tags

    1. Web archives
    2. Semantic layer
    3. Profiling
    4. Linked data
    5. Exploratory search

    Qualifiers

    • Research-article

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media