Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2766462.2767699acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Assessor Differences and User Preferences in Tweet Timeline Generation

Published: 09 August 2015 Publication History

Abstract

In information retrieval evaluation, when presented with an effectiveness difference between two systems, there are three relevant questions one might ask. First, are the differences statistically significant? Second, is the comparison stable with respect to assessor differences? Finally, is the difference actually meaningful to a user? This paper tackles the last two questions about assessor differences and user preferences in the context of the newly-introduced tweet timeline generation task in the TREC 2014 Microblog track, where the system's goal is to construct an informative summary of non-redundant tweets that addresses the user's information need. Central to the evaluation methodology is human-generated semantic clusters of tweets that contain substantively similar information. We show that the evaluation is stable with respect to assessor differences in clustering and that user preferences generally correlate with effectiveness metrics even though users are not explicitly aware of the semantic clustering being performed by the systems. Although our analyses are limited to this particular task, we believe that lessons learned could generalize to other evaluations based on establishing semantic equivalence between information units, such as nugget-based evaluations in question answering and temporal summarization.

References

[1]
R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. WSDM, 2009.
[2]
A. Al-Maskari, M. Sanderson, P. Clough, and E. Airio. The good and the bad system: Does the test collection predict users' effectiveness? SIGIR, 2008.
[3]
J. Allan. Topic Detection and Tracking: Event-Based Information Organization. Kluwer, 2002.
[4]
J. Allan, B. Carterette, and J. Lewis. When will information retrieval be "good enough"? User effectiveness as a function of retrieval accuracy. SIGIR, 2005.
[5]
J. Aslam, M. Ekstrand-Abueg, V. Pavlu, R. McCreadie, F. Diaz, and T. Sakai. TREC 2014 temporal summarization track overview. TREC, 2014.
[6]
P. Bailey, N. Craswell, I. Soboroff, P. Thomas, A. de Vries, and E. Yilmaz. Relevance assessment: Are judges exchangeable and does it matter? SIGIR, 2008.
[7]
C. Clarke, M. Kolla, G. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. SIGIR, 2008.
[8]
H. Dang and J. Lin. Different structures for evaluating answers to complex questions: Pyramids won't topple, and neither will human assessors. ACL, 2007.
[9]
H. Dang, J. Lin, and D. Kelly. Overview of the TREC 2006 question answering track. TREC, 2006.
[10]
J. Fleiss. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378--382, 1971.
[11]
J. Fleiss, J. Cohen, and B. Everitt. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin, 72(5):323--327, 1969.
[12]
Q. Guo, F. Diaz, and E. Yom-Tov. Updating users about time critical events. ECIR, 2013.
[13]
W. Hersh, A. Turpin, S. Price, B. Chan, D. Kramer, L. Sacherek, and D. Olson. Do batch and user evaluations give the same results? SIGIR, 2000.
[14]
S. Huffman and M. Hochster. How well does result relevance predict session satisfaction? SIGIR, 2007.
[15]
J. Hurlock and M. Wilson. Searching Twitter: Separating the tweet from the chaff. ICWSM, 2011.
[16]
J. Landis and G. Koch. The measurement of observer agreement for categorical data. Biometrics, 33(1):159--174, 1977.
[17]
J. Lin and D. Demner-Fushman. Will pyramids built of nuggets topple over? NAACL, 2006.
[18]
J. Lin, M. Efron, Y. Wang, and G. Sherman. Overview of the TREC-2014 Microblog track. TREC, 2004.
[19]
J. Lin and M. Smucker. How do users find things with ? Towards automatic utility evaluation with user simulations. SIGIR, 2008.
[20]
J. Lin and P. Zhang. Deconstructing nuggets: The stability and reliability of complex question answering evaluation. SIGIR, 2007.
[21]
P. Over. TREC-6 interactive report. TREC, 1997.
[22]
V. Pavlu, S. Rajput, P. Golbus, and J. Aslam. IR system evaluation using nugget-based test collections. WSDM, 2012.
[23]
W. Rand. Objective criteria for the evaluation of clustering methods. JASA, 66(336):846--850, 1971.
[24]
S. Robertson, E. Kanoulas, and E. Yilmaz. Extending average precision to graded relevance judgments. SIGIR, 2010.
[25]
M. Sanderson, M. Paramita, P. Clough, and E. Kanoulas. Do user preferences and evaluation measures line up? SIGIR, 2010.
[26]
C. Smith and P. Kantor. User adaptation: Good results from poor systems. SIGIR, 2008.
[27]
M. Smucker and C. Jethani. Human performance and retrieval precision revisited. SIGIR, 2010.
[28]
K. Tao, C. Hauff, and G.-J. Houben. Building a microblog corpus for search result diversification. AIRS, 2013.
[29]
A. Turpin and W. R. Hersh. Why batch and user evaluations do not give the same results. SIGIR, 2001.
[30]
A. Turpin and F. Scholer. User performance versus precision measures for simple search tasks. SIGIR, 2006.
[31]
E. Voorhees. Variations in relevance judgments and the measurement of retrieval effectiveness. SIGIR, 1998.
[32]
E. Voorhees. Overview of the TREC 2004 question answering track. TREC, 2004.
[33]
C. Wayne. Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. LREC, 2000.
[34]
C. Zhai, W. Cohen, and J. Lafferty. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. SIGIR, 2003.

Cited By

View all
  • (2018)Towards Automatic Evaluation of Customer-Helpdesk DialoguesJournal of Information Processing10.2197/ipsjjip.26.76826(768-778)Online publication date: 2018
  • (2018)A Personal Privacy Preserving FrameworkThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3209995(295-304)Online publication date: 27-Jun-2018
  • (2017)A Comparison of Nuggets and Clusters for Evaluating Timeline SummariesProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133000(67-76)Online publication date: 6-Nov-2017
  • Show More Cited By

Index Terms

  1. Assessor Differences and User Preferences in Tweet Timeline Generation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
    August 2015
    1198 pages
    ISBN:9781450336215
    DOI:10.1145/2766462
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 August 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. microblog search
    2. trec evaluation
    3. user study

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGIR '15
    Sponsor:

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)8
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2018)Towards Automatic Evaluation of Customer-Helpdesk DialoguesJournal of Information Processing10.2197/ipsjjip.26.76826(768-778)Online publication date: 2018
    • (2018)A Personal Privacy Preserving FrameworkThe 41st International ACM SIGIR Conference on Research & Development in Information Retrieval10.1145/3209978.3209995(295-304)Online publication date: 27-Jun-2018
    • (2017)A Comparison of Nuggets and Clusters for Evaluating Timeline SummariesProceedings of the 2017 ACM on Conference on Information and Knowledge Management10.1145/3132847.3133000(67-76)Online publication date: 6-Nov-2017
    • (2017)Automatic Generation of Event Timelines from Social DataProceedings of the 2017 ACM on Web Science Conference10.1145/3091478.3091519(207-211)Online publication date: 25-Jun-2017
    • (2017)Online In-Situ Interleaved Evaluation of Real-Time Push Notification SystemsProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080808(415-424)Online publication date: 7-Aug-2017
    • (2017)EveTAR: building a large-scale multi-task test collection over Arabic tweetsInformation Retrieval Journal10.1007/s10791-017-9325-721:4(307-336)Online publication date: 21-Dec-2017
    • (2016)A Study of Realtime Summarization MetricsProceedings of the 25th ACM International on Conference on Information and Knowledge Management10.1145/2983323.2983653(2125-2130)Online publication date: 24-Oct-2016
    • (2016)Simple Dynamic Emission Strategies for Microblog FilteringProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914704(1009-1012)Online publication date: 7-Jul-2016
    • (2016)An Exploration of Evaluation Metrics for Mobile Push NotificationsProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2914694(741-744)Online publication date: 7-Jul-2016
    • (2016)Interleaved Evaluation for Retrospective Summarization and Prospective Notification on Document StreamsProceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval10.1145/2911451.2911494(175-184)Online publication date: 7-Jul-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media