ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL

Research Article

Year 2020, Volume: 21 Issue: 1, 182 - 198, 31.03.2020

Ahmet Arslan

https://doi.org/10.18038/estubtda.615103

Cited By: 1

Abstract

References

Robertson S, Zaragoza H, Taylor M. Simple BM25 Extension to Multiple Weighted Fields, in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 42-49.
Croft WB. "Combining Approaches to Information Retrieval," W. B. Croft, Ed., ed: Springer US, 2000, pp. 1-36.
Turner TP, Brackbill L. Rising to the top: evaluating the use of the HTML meta tag to improve retrieval of World Wide Web documents through Internet search engines. Library Resources & Technical Services 1998; 42: 258-271.
Hiemstra D, Hauff C, "MapReduce for Information Retrieval Evaluation: “Let's Quickly Test This on 12 TB of Data”," in Multilingual and Multimodal Information Access Evaluation, M. Agosti, N. Ferro, C. Peters, M. de Rijke, and A. Smeaton, Eds., ed: Springer Berlin Heidelberg, 2010, pp. 64-69.
Mao J, Sakai T, Luo C, Xiao P, Liu Y, Dou Z. Overview of the NTCIR-14 we want web task. 2019; 455-467.
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 1998; 30: 107-117.
Ounis I, Amati G, Plachouras V, He B, Macdonald C, Johnson D. Terrier Information Retrieval Platform, in Advances in Information Retrieval, pp. 517-519.
Yang P, Fang H, Lin J. Anserini: Reproducible Ranking Baselines Using Lucene. J. Data and Information Quality 2018; 10: 16:1-16:20.
Verma M, Yilmaz E, Mehrotra R, Kanoulas E, Carterette B, Craswell N, et al. Overview of the TREC Tasks Track 2016. 2016.
Sanderson M, Croft WB. The History of Information Retrieval Research. Proceedings of the IEEE 2012; 100: 1444-1451.
Craswell N, Hawking D, Robertson S. Effective Site Finding Using Link Anchor Information, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; New Orleans, Louisiana, USA; 2001, pp. 250-257.
Eiron N, McCurley KS, "Analysis of anchor text for web search," presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, 2003.
Kraft R, Zien J, "Mining anchor text for query refinement," presented at the Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, 2004.
Dang V, Croft BW, "Query reformulation using anchor text," presented at the Proceedings of the third ACM international conference on Web search and data mining, New York, New York, USA, 2010.
Anh VN, Moffat A. The Role of Anchor Text in ClueWeb09 Retrieval. 2010.
Macdonald C, Santos RLT, Ounis I. The whens and hows of learning to rank for web search. Information Retrieval 2013; 16: 584-628.
Kang I-H, Kim G. Query Type Classification for Web Document Retrieval, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 64-71.
Song R, Wen J-R, Shi S, Xin G, Liu T-Y, Qin T, et al. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. 2004.
Ogilvie P, Callan J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. 2003.
Westerveld T, Kraaij W, Hiemstra D. Retrieving web pages using content, links, urls and anchors. 2001.
Chibane I, Doan B-L. A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 817-818.
Craswell N, Hawking D. Overview of the TREC-2004 Web Track. 2004.
Zheng G, Callan J, "Learning to Reweight Terms with Distributed Representations," presented at the Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 2015.
Qin T, Liu T-Y, Xu J, Li H. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 2010; 13: 346-374.
Macdonald C, Santos RLT, Ounis I, He B. About Learning Models with Multiple Query-dependent Features. ACM Trans. Inf. Syst. 2013; 31: 11:1-11:39.
Collins-Thompson K, Ogilvie P, Zhang Y, Callan J. Information filtering, novelty detection, and named-page finding. 2002.
Ogilvie P, Callan J, Callan J. Combining Document Representations for Known-item Search, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 143-150.
Savoy J, Rasolofo Y. Report on the TREC 11 experiment: Arabic, named page and topic distillation searches. 2002.
Zhou Z, Guo Y, Wang B, Cheng X, Xu H, Zhang G. TREC 2004 Web Track Experiments at CAS-ICT. 2004.
Tomlinson S. Robust, web and terabyte retrieval with hummingbird searchserver at TREC 2004. 2004.
Wen J-R, Song R, Cai D, Zhu K, Yu S, Ye S, et al. Microsoft Research Asia at the Web Track of TREC 2003. 2003.
Roy D, Mitra M, Ganguly D. To Clean or Not to Clean: Document Preprocessing and Reproducibility. J. Data and Information Quality 2018; 10: 18:1-18:25.
Gadge J, Bhirud S. Contextual weighting approach to compute term weight in layered vector space model. Journal of Information Science; 0: 0165551519860043-0165551519860043.
Spirin N, Han J. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl. 2012; 13: 50-64.
Lewandowski D. Web searching, search engines and Information Retrieval. Inf. Serv. Use 2005; 25: 137-147.
Craven TC. Variations in use of meta tag descriptions by Web pages in different languages. Information Processing & Management 2004; 40: 479-493.
Craven TC. Variations in Use of Meta Tag Keywords by Web Pages in Different Languages. Journal of Information Science 2004; 30: 268-279.
Zhang J, Jastram I. A study of metadata element co-occurrence. Online Information Review 2006; 30: 428-453.
Alimohammadi D. Meta-tags: still a matter of opinion. The Electronic Library 2005; 23: 625-631.
Clarke C, Craswell N, Soboroff I. Overview of the TREC 2004 Terabyte Track. 2004.
Callan J, Hoy M, Yoo C, Zhao L. (2009, The ClueWeb09 Dataset. Available: http://boston.lti.cs.cmu.edu/classes/11-742/S10-TREC/TREC-Nov19-09.pdf
Callan J. (2012, The Lemur Project And its ClueWeb12 Dataset. Available: http://opensearchlab.otago.ac.nz/SIGIR12-OSIR-callan.pdf
Luo C, Sakai T, Liu Y, Dou Z, Xiong C, Xu J. Overview of the NTCIR-13 we want web task. 2017; 394-401.
Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends(®) in Information Retrieval 2009; 3: 333-389.
Zhai C, Lafferty J. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Trans. Inf. Syst. 2004; 22: 179-214.
Kocabaş İ, Dinçer BT, Karaoğlan B. A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Information Retrieval 2014; 17: 153-176.
Clinchant S, Gaussier É. Information-based Models for Ad Hoc IR, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 234-241.
Białecki A, Muir R, Ingersoll G. Apache Lucene 4, in Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval, pp. 17-24.
Azzopardi L, Crane M, Fang H, Ingersoll G, Lin J, Moshfeghi Y, et al. The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017, in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1429-1430.
McCandless M, Hatcher E, Gospodnetic O. Lucene in Action, Second Edition: Covers Apache Lucene 3.0, Manning Publications Co., 2010.
Krovetz R. Viewing Morphology As an Inference Process, in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191-202.
Carterette B, Pavlu V, Fang H, Kanoulas E. Million Query Track 2009 Overview. 2009.
Järvelin K, Kekäläinen J. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002; 20: 422-446.
Khan MNA, Mahmood A. A distinctive approach to obtain higher page rank through search engine optimization. Sādhanā 2018; 43: p. 43.
Aslam JA, Montague M. Models for Metasearch, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 276-284.
Montague M, Aslam JA. Condorcet Fusion for Improved Retrieval, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 538-548.
Macdonald C, Plachouras V, He B, Lioma C, Ounis I, "University of Glasgow at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific Stemming," in Accessing Multilingual Information Repositories, C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al., Eds., ed: Springer Berlin Heidelberg, 2006, pp. 898-907.
Plachouras V, Ounis I, Cacheda F. Selective Combination of Evidence for Topic Distillation Using Document and Aggregate-level Information, in Proceedings of the RIAO 2004 - Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pp. 610-622.
Plachouras V, Cacheda F, Ounis I. A decision mechanism for the selective combination of evidence in topic distillation. Information Retrieval 2006; 9: 139-163.

ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL

Year 2020, Volume: 21 Issue: 1, 182 - 198, 31.03.2020

Ahmet Arslan

https://doi.org/10.18038/estubtda.615103

Cited By: 1

Abstract

Web retrieval studies have mostly used URL, title, body, and anchor text fields to represent Web documents. On the other hand, HTML standards provide a rich set of elements to define different parts of a Web page. For example, meta elements are used to provide structured metadata about a Web page not to end users, but instead to browsers or crawlers. However, it is unclear whether meta tags are or are not useful for Web retrieval, as most of the previous studies leveraged URL, title, body, and anchor text fields. In this work, we examine the usefulness of two meta tags, namely keywords and description, based on ad-hoc tasks of previous TREC studies. Through experiments on the standard TREC Web datasets and several query sets, our results using the state-of-the-art term-weighting models show that the utilization of description field systematically increases the retrieval effectiveness, to a statistically significant degree most of the time. By contrast, the employment of keywords field may cause a significant deterioration in retrieval effectiveness for certain term-weighting models.

Keywords

Information Retrieval, Web Retrieval, Meta Tags, ClueWeb, HTML

References

Robertson S, Zaragoza H, Taylor M. Simple BM25 Extension to Multiple Weighted Fields, in Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 42-49.
Croft WB. "Combining Approaches to Information Retrieval," W. B. Croft, Ed., ed: Springer US, 2000, pp. 1-36.
Turner TP, Brackbill L. Rising to the top: evaluating the use of the HTML meta tag to improve retrieval of World Wide Web documents through Internet search engines. Library Resources & Technical Services 1998; 42: 258-271.
Hiemstra D, Hauff C, "MapReduce for Information Retrieval Evaluation: “Let's Quickly Test This on 12 TB of Data”," in Multilingual and Multimodal Information Access Evaluation, M. Agosti, N. Ferro, C. Peters, M. de Rijke, and A. Smeaton, Eds., ed: Springer Berlin Heidelberg, 2010, pp. 64-69.
Mao J, Sakai T, Luo C, Xiao P, Liu Y, Dou Z. Overview of the NTCIR-14 we want web task. 2019; 455-467.
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 1998; 30: 107-117.
Ounis I, Amati G, Plachouras V, He B, Macdonald C, Johnson D. Terrier Information Retrieval Platform, in Advances in Information Retrieval, pp. 517-519.
Yang P, Fang H, Lin J. Anserini: Reproducible Ranking Baselines Using Lucene. J. Data and Information Quality 2018; 10: 16:1-16:20.
Verma M, Yilmaz E, Mehrotra R, Kanoulas E, Carterette B, Craswell N, et al. Overview of the TREC Tasks Track 2016. 2016.
Sanderson M, Croft WB. The History of Information Retrieval Research. Proceedings of the IEEE 2012; 100: 1444-1451.
Craswell N, Hawking D, Robertson S. Effective Site Finding Using Link Anchor Information, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; New Orleans, Louisiana, USA; 2001, pp. 250-257.
Eiron N, McCurley KS, "Analysis of anchor text for web search," presented at the Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, Toronto, Canada, 2003.
Kraft R, Zien J, "Mining anchor text for query refinement," presented at the Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, 2004.
Dang V, Croft BW, "Query reformulation using anchor text," presented at the Proceedings of the third ACM international conference on Web search and data mining, New York, New York, USA, 2010.
Anh VN, Moffat A. The Role of Anchor Text in ClueWeb09 Retrieval. 2010.
Macdonald C, Santos RLT, Ounis I. The whens and hows of learning to rank for web search. Information Retrieval 2013; 16: 584-628.
Kang I-H, Kim G. Query Type Classification for Web Document Retrieval, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 64-71.
Song R, Wen J-R, Shi S, Xin G, Liu T-Y, Qin T, et al. Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. 2004.
Ogilvie P, Callan J. Combining Structural Information and the Use of Priors in Mixed Named-Page and Homepage Finding. 2003.
Westerveld T, Kraaij W, Hiemstra D. Retrieving web pages using content, links, urls and anchors. 2001.
Chibane I, Doan B-L. A Web Page Topic Segmentation Algorithm Based on Visual Criteria and Content Layout, in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 817-818.
Craswell N, Hawking D. Overview of the TREC-2004 Web Track. 2004.
Zheng G, Callan J, "Learning to Reweight Terms with Distributed Representations," presented at the Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 2015.
Qin T, Liu T-Y, Xu J, Li H. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 2010; 13: 346-374.
Macdonald C, Santos RLT, Ounis I, He B. About Learning Models with Multiple Query-dependent Features. ACM Trans. Inf. Syst. 2013; 31: 11:1-11:39.
Collins-Thompson K, Ogilvie P, Zhang Y, Callan J. Information filtering, novelty detection, and named-page finding. 2002.
Ogilvie P, Callan J, Callan J. Combining Document Representations for Known-item Search, in Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 143-150.
Savoy J, Rasolofo Y. Report on the TREC 11 experiment: Arabic, named page and topic distillation searches. 2002.
Zhou Z, Guo Y, Wang B, Cheng X, Xu H, Zhang G. TREC 2004 Web Track Experiments at CAS-ICT. 2004.
Tomlinson S. Robust, web and terabyte retrieval with hummingbird searchserver at TREC 2004. 2004.
Wen J-R, Song R, Cai D, Zhu K, Yu S, Ye S, et al. Microsoft Research Asia at the Web Track of TREC 2003. 2003.
Roy D, Mitra M, Ganguly D. To Clean or Not to Clean: Document Preprocessing and Reproducibility. J. Data and Information Quality 2018; 10: 18:1-18:25.
Gadge J, Bhirud S. Contextual weighting approach to compute term weight in layered vector space model. Journal of Information Science; 0: 0165551519860043-0165551519860043.
Spirin N, Han J. Survey on web spam detection: principles and algorithms. SIGKDD Explor. Newsl. 2012; 13: 50-64.
Lewandowski D. Web searching, search engines and Information Retrieval. Inf. Serv. Use 2005; 25: 137-147.
Craven TC. Variations in use of meta tag descriptions by Web pages in different languages. Information Processing & Management 2004; 40: 479-493.
Craven TC. Variations in Use of Meta Tag Keywords by Web Pages in Different Languages. Journal of Information Science 2004; 30: 268-279.
Zhang J, Jastram I. A study of metadata element co-occurrence. Online Information Review 2006; 30: 428-453.
Alimohammadi D. Meta-tags: still a matter of opinion. The Electronic Library 2005; 23: 625-631.
Clarke C, Craswell N, Soboroff I. Overview of the TREC 2004 Terabyte Track. 2004.
Callan J, Hoy M, Yoo C, Zhao L. (2009, The ClueWeb09 Dataset. Available: http://boston.lti.cs.cmu.edu/classes/11-742/S10-TREC/TREC-Nov19-09.pdf
Callan J. (2012, The Lemur Project And its ClueWeb12 Dataset. Available: http://opensearchlab.otago.ac.nz/SIGIR12-OSIR-callan.pdf
Luo C, Sakai T, Liu Y, Dou Z, Xiong C, Xu J. Overview of the NTCIR-13 we want web task. 2017; 394-401.
Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends(®) in Information Retrieval 2009; 3: 333-389.
Zhai C, Lafferty J. A Study of Smoothing Methods for Language Models Applied to Information Retrieval. ACM Trans. Inf. Syst. 2004; 22: 179-214.
Kocabaş İ, Dinçer BT, Karaoğlan B. A nonparametric term weighting method for information retrieval based on measuring the divergence from independence. Information Retrieval 2014; 17: 153-176.
Clinchant S, Gaussier É. Information-based Models for Ad Hoc IR, in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 234-241.
Białecki A, Muir R, Ingersoll G. Apache Lucene 4, in Proceedings of the SIGIR 2012 Workshop on Open Source Information Retrieval, pp. 17-24.
Azzopardi L, Crane M, Fang H, Ingersoll G, Lin J, Moshfeghi Y, et al. The Lucene for Information Access and Retrieval Research (LIARR) Workshop at SIGIR 2017, in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1429-1430.
McCandless M, Hatcher E, Gospodnetic O. Lucene in Action, Second Edition: Covers Apache Lucene 3.0, Manning Publications Co., 2010.
Krovetz R. Viewing Morphology As an Inference Process, in Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 191-202.
Carterette B, Pavlu V, Fang H, Kanoulas E. Million Query Track 2009 Overview. 2009.
Järvelin K, Kekäläinen J. Cumulated Gain-based Evaluation of IR Techniques. ACM Trans. Inf. Syst. 2002; 20: 422-446.
Khan MNA, Mahmood A. A distinctive approach to obtain higher page rank through search engine optimization. Sādhanā 2018; 43: p. 43.
Aslam JA, Montague M. Models for Metasearch, in Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 276-284.
Montague M, Aslam JA. Condorcet Fusion for Improved Retrieval, in Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 538-548.
Macdonald C, Plachouras V, He B, Lioma C, Ounis I, "University of Glasgow at WebCLEF 2005: Experiments in Per-Field Normalisation and Language Specific Stemming," in Accessing Multilingual Information Repositories, C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al., Eds., ed: Springer Berlin Heidelberg, 2006, pp. 898-907.
Plachouras V, Ounis I, Cacheda F. Selective Combination of Evidence for Topic Distillation Using Document and Aggregate-level Information, in Proceedings of the RIAO 2004 - Coupling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pp. 610-622.
Plachouras V, Cacheda F, Ounis I. A decision mechanism for the selective combination of evidence in topic distillation. Information Retrieval 2006; 9: 139-163.

There are 59 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	Ahmet Arslan 0000-0003-4376-2278
Publication Date	March 31, 2020
Published in Issue	Year 2020 Volume: 21 Issue: 1

Cite

AMA	Arslan A. ON THE USEFULNESS OF HTML META ELEMENTS FOR WEB RETRIEVAL. Eskişehir Technical University Journal of Science and Technology A - Applied Sciences and Engineering. March 2020;21(1):182-198. doi:10.18038/estubtda.615103

Cited By

Towards adaptive structured Dirichlet smoothing model for digital resource objects

Multimedia Tools and Applications

Wafa’ Za’al Alma’aitah

https://doi.org/10.1007/s11042-020-10305-w

Article Files

Full Text