Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2806416.2806613acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

The Influence of Pre-processing on the Estimation of Readability of Web Documents

Published: 17 October 2015 Publication History

Abstract

This paper investigates the effect that text pre-processing approaches have on the estimation of the readability of web pages. Readability has been highlighted as an important aspect of web search result personalisation in previous work. The most widely used text readability measures rely on surface level characteristics of text, such as the length of words and sentences. We demonstrate that different tools for extracting text from web pages lead to very different estimations of readability. This has an important implication for search engines because search result personalisation strategies that consider users reading ability may fail if incorrect text readability estimations are computed.

References

[1]
M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. Cleaneval: a competition for cleaning web pages. In LREC, 2008.
[2]
M. Coleman and T. L. Liau. A Computer Readability Formula Designed for Machine Scoring. JAP, 1975.
[3]
K. Collins-Thompson, P. N. Bennett, R. W. White, S. de la Chica, and D. Sontag. Personalizing Web Search Results by Reading Level. In CIKM, 2011.
[4]
W. H. Dubay. The principles of readability. Costa Mesa, CA: Impact Information, 2004.
[5]
S. Dumais. Putting searchers into search. In SIGIR, 2014.
[6]
L. Goeuriot, L. Kelly, et al. Share/clef ehealth evaluation lab 2014, task 3: User-centred health information retrieval. In Working Notes of CLEF, 2014.
[7]
A. Graesser, D. McNamara, M. Louwerse, and Z. Cai. Coh-Metrix: Analysis of Text on Cohesion and Language. Behav Res Meth Instrum Comput, 2004.
[8]
R. Gunning. The Technique of Clear Writing. McGraw-Hill, 1952.
[9]
A. Jatowt and K. Tanaka. Is wikipedia too difficult?: Comparative analysis of readability of wikipedia, simple wikipedia and britannica. In CIKM, 2012.
[10]
J. Kincaid, R. Fishburne, R. Rogers, and B. Chissom. Derivation of New Readability Formulas for Navy Enlisted Personnel. Technical report, 1975.
[11]
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In WSDM, 2010.
[12]
G. H. McLaughlin. SMOG Grading - a New Readability Formula. Journal of Reading, 1969.
[13]
J. Pomikálek. Removing Boilerplate and Duplicate Content from Web Corpora. PhD thesis, 2011.
[14]
E. A. Smith and R. J. Senter. Automated Readability Index. Technical report, 1967.
[15]
C. Tan, E. Gabrilovich, and B. Pang. To Each His Own: Personalized Content Selection based on Text Comprehensibility. In WSDM, 2012.
[16]
R. C. Wiener and R. Wiener-Pla. Literacy, pregnancy and potential oral health changes: The internet and readability levels. Maternal and child health journal, 2013.
[17]
X. Yan, D. Song, and X. Li. Concept-based document readability in domain specific information retrieval. In CIKM'06, 2006.
[18]
G. Zuccon and B. Koopman. Integrating Understandability in the Evaluation of Consumer Health Search Engines. In MedIR, 2014.

Cited By

View all
  • (2023)Readability Measures as Predictors of Understandability and Engagement in Searching to LearnLinking Theory and Practice of Digital Libraries10.1007/978-3-031-43849-3_15(173-181)Online publication date: 22-Sep-2023
  • (2022)Converting consumer-generated content into an innovation resource: A user ideas processing framework in online user innovation communitiesTechnological Forecasting and Social Change10.1016/j.techfore.2021.121266174(121266)Online publication date: Jan-2022
  • (2022)Automatic Text SimplificationundefinedOnline publication date: 22-Mar-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
October 2015
1998 pages
ISBN:9781450337946
DOI:10.1145/2806416
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. readability
  2. text pre-processing

Qualifiers

  • Short-paper

Funding Sources

Conference

CIKM'15
Sponsor:

Acceptance Rates

CIKM '15 Paper Acceptance Rate 165 of 646 submissions, 26%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Readability Measures as Predictors of Understandability and Engagement in Searching to LearnLinking Theory and Practice of Digital Libraries10.1007/978-3-031-43849-3_15(173-181)Online publication date: 22-Sep-2023
  • (2022)Converting consumer-generated content into an innovation resource: A user ideas processing framework in online user innovation communitiesTechnological Forecasting and Social Change10.1016/j.techfore.2021.121266174(121266)Online publication date: Jan-2022
  • (2022)Automatic Text SimplificationundefinedOnline publication date: 22-Mar-2022
  • (2020)Responsive and Responsible: Customizing Management Responses to Online Traveler ReviewsJournal of Travel Research10.1177/004728752097104661:1(120-135)Online publication date: 3-Dec-2020
  • (2019)Prediction of reading difficulty in Russian academic textsJournal of Intelligent & Fuzzy Systems10.3233/JIFS-17900736:5(4553-4563)Online publication date: 14-May-2019
  • (2019)Consumer Health Search on the Web: Study of Web Page Understandability and Its Integration in Ranking AlgorithmsJournal of Medical Internet Research10.2196/1098621:1(e10986)Online publication date: 30-Jan-2019
  • (2019)What's in a ReviewProceedings of the ACM on Human-Computer Interaction10.1145/33592423:CSCW(1-22)Online publication date: 7-Nov-2019
  • (2019)Enhancing Financial Communication in Quantity Surveying PracticeThe Construction Industry in the Fourth Industrial Revolution10.1007/978-3-030-26528-1_27(276-286)Online publication date: 10-Aug-2019
  • (2019)The Scholarly Impact and Strategic Intent of CLEF eHealth Labs from 2012 to 2017Information Retrieval Evaluation in a Changing World10.1007/978-3-030-22948-1_14(333-363)Online publication date: 14-Aug-2019
  • (2018)Description of Cardiological Apps From the German App Store: Semiautomated Retrospective App Store AnalysisJMIR mHealth and uHealth10.2196/117536:11(e11753)Online publication date: 20-Nov-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media