Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2124295.2124323acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Characterizing web content, user interests, and search behavior by reading level and topic

Published: 08 February 2012 Publication History

Abstract

A user's expertise or ability to understand a document on a given topic is an important aspect of that document's relevance. However, this aspect has not been well-explored in information retrieval systems, especially those at Web scale where the great diversity of content, users, and tasks presents an especially challenging search problem. To help improve our modeling and understanding of this diversity, we apply automatic text classifiers, based on reading difficulty and topic prediction, to estimate a novel type of profile for important entities in Web search -- users, websites, and queries. These profiles capture topic and reading level distributions, which we then use in conjunction with search log data to characterize and compare different entities.
We find that reading level and topic distributions provide an important new representation of Web content and user interests, and that using both together is more effective than using either one separately. In particular we find that: 1) the reading level of Web content and the diversity of visitors to a website can vary greatly by topic; 2) the degree to which a user's profile matches with a site's profile is closely correlated with the user's preference of the website in search results, and 3) site or URL profiles can be used to predict 'expertness' whether a given site or URL is oriented toward expert vs. non-expert users. Our findings provide strong evidence in favor of jointly incorporating reading level and topic distribution metadata into a variety of critical tasks in Web information systems.

Supplementary Material

JPG File (wsdm_day1_session2_2.jpg)
MP4 File (wsdm_day1_session2_2.mp4)

References

[1]
J. Allan. HARD Track Overview in TREC 2003: High accuracy retrieval from documents. In Proceedings of TREC 2003. NIST Special Publication.
[2]
P. Bennett, K. Svore, and S. T. Dumais. Classification-enhanced ranking. In Proceedings of WWW 2010.
[3]
M. J. Cole, J. Gwizdka, C. Liu, R. Bierig, N. Belkin, X. Zhang. Task and user effects on reading patterns in information search. Interacting with Computers (2011).
[4]
K. Collins-Thompson and J. Callan. A language modeling approach to predicting reading difficulty. In Proceedings of HLT 2004.
[5]
K. Collins-Thompson and J. Callan. Information retrieval for language tutoring: an overview of the REAP project. In Proceedings of SIGIR 2004. ACM, New York, USA.
[6]
K. Collins-Thompson, P.N. Bennett, R.W. White, S. de la Chica, D. Sontag. Personalizing web search results by reading level. In Proceedings of CIKM 2011. ACM, New York, USA.
[7]
I. Dagan, L. Lee, and F. Pereira. 1997. Similarity-based methods for word sense disambiguation. In Proceedings of EACL 1997. Association for Computational Linguistics, Stroudsburg, USA.
[8]
Z. Dou, R.Song, J. R. Wen. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of WWW 2007.
[9]
Freund, L., Toms, E. G., & Waterhouse, J. Modeling the information behaviour of software engineers using a work-task framework. In Proceedings of ASIST 2005, Charlotte, NC, USA.
[10]
D. Kelly and C. Cool. The effects of topic familiarity on information search and use behaviors. In Proceedings of JCDL 2002.
[11]
P. Kidwell, G. Lebanon, K. Collins-Thompson. Statistical estimation of word acquisition with application to readability prediction. In Proceedings of EMNLP 2009, Singapore.
[12]
G. Kumaran, R. Jones. Biasing web search results for topic familiarity. In Proceedings of CIKM 2005. ACM New York, USA.
[13]
F. Peng, N. Ahmed, X. Li, Y. Lu. Context sensitive stemming for web search. In Proceedings of SIGIR 2007. ACM New York, USA.
[14]
K. Wang, T. Walker, Z. Zheng. Estimating relevance ranking quality from web search clickthrough data. In Proceedings of SIGKDD 2009, 1355--1364.
[15]
Y. Song, N. Nguyen, L. He, S. Imig, and R. Rounthwaite. 2011. Searchable web sites recommendation. In Proceedings of WSDM 2011.
[16]
C. Tan, E. Gabrilovich, and B. Pang. To each his own: personalized content selection based on text comprehensibility. In Proceedings of WSDM 2012. ACM, New York, USA.
[17]
R. W. White, S. Dumais, J. Teevan. Characterizing the influence of domain expertise on Web search behavior. In Proceedings of WSDM 2009. ACM, New York, USA.
[18]
R. W. White, P. N. Bennett, S. Dumais. Predicting short-term interests using activity-based search context. In Proceedings of CIKM 2010. ACM, New York, USA.
[19]
Q. Wu, C.J.C. Burges, K. Svore, and J. Gao. Adapting boosting for information retrieval measures. Information Retrieval, 2009.

Cited By

View all
  • (2024)Readability prediction: How many features are necessary?The Annals of Applied Statistics10.1214/23-AOAS182018:2Online publication date: 1-Jun-2024
  • (2023)Cross-Corpus Readability Compatibility Assessment for English TextsIEEE Access10.1109/ACCESS.2023.331583411(101985-101997)Online publication date: 2023
  • (2022)Reducing the dependency of having prior domain knowledge for effective online information retrievalExpert Systems10.1111/exsy.1301440:4Online publication date: Jun-2022
  • Show More Cited By

Index Terms

  1. Characterizing web content, user interests, and search behavior by reading level and topic

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '12: Proceedings of the fifth ACM international conference on Web search and data mining
    February 2012
    792 pages
    ISBN:9781450307475
    DOI:10.1145/2124295
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 February 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. domain expertise
    2. log analysis
    3. reading level prediction
    4. topic prediction
    5. web search

    Qualifiers

    • Research-article

    Conference

    Acceptance Rates

    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 13 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Readability prediction: How many features are necessary?The Annals of Applied Statistics10.1214/23-AOAS182018:2Online publication date: 1-Jun-2024
    • (2023)Cross-Corpus Readability Compatibility Assessment for English TextsIEEE Access10.1109/ACCESS.2023.331583411(101985-101997)Online publication date: 2023
    • (2022)Reducing the dependency of having prior domain knowledge for effective online information retrievalExpert Systems10.1111/exsy.1301440:4Online publication date: Jun-2022
    • (2022)The effects of topic familiarity on college students' learning search processAslib Journal of Information Management10.1108/AJIM-09-2021-025274:6(1151-1173)Online publication date: 16-May-2022
    • (2022)Fuzzy ontology as a basis for recommendation Systems for Traveler’s preferenceMultimedia Tools and Applications10.1007/s11042-021-11780-5Online publication date: 15-Jan-2022
    • (2021)Are Topics Interesting or Not? An LDA-based Topic-graph Probabilistic Model for Web Search PersonalizationACM Transactions on Information Systems10.1145/347610640:3(1-24)Online publication date: 30-Dec-2021
    • (2020)Smart Technique for Cache-Assisted Device to Device CommunicationsIEEE Access10.1109/ACCESS.2020.30285658(181485-181499)Online publication date: 2020
    • (2020)Web behavior analysis in social life loggingThe Journal of Supercomputing10.1007/s11227-020-03304-z77:2(1301-1320)Online publication date: 14-May-2020
    • (2020)Exposing Students to New Terminologies While Collecting Browsing Search Data (Best Technical Paper)Artificial Intelligence XXXVII10.1007/978-3-030-63799-6_1(3-17)Online publication date: 8-Dec-2020
    • (2019)Collaborative Search Engine for Enhancing Personalized User Search Based on Domain KnowledgeJournal of Medical Systems10.1007/s10916-019-1350-143:8(1-9)Online publication date: 1-Aug-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media