Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3322640.3326713acmconferencesArticle/Chapter ViewAbstractPublication PagesicailConference Proceedingsconference-collections
research-article

A Regularization Approach to Combining Keywords and Training Data in Technology-Assisted Review

Published: 17 June 2019 Publication History

Abstract

Manual keyword queries and supervised learning (technology-assisted review) have been viewed as conflicting approaches to high recall retrieval tasks (such as civil discovery and sunshine law requests) in the law. We propose a synthesis that uses a keyword list as a regularizer when learning a logistic regression model from labeled examples. Balancing keywords against training data requires knowing how the regularization penalty should scale with training set size. We show, however, that advice on scaling from theory is contradictory, software defaults are inconsistent, and standard practice (validation-based tuning) is impractical in many high-recall retrieval settings. Through experiments on simulated e-discovery data sets, we show that the penalization scheme suggested by a Bayesian interpretation is substantially safer than alternatives from stochastic optimization and computational learning theory. Combining keywords and training data provides better effectiveness on our datasets than using either alone, showing that both approaches bring value.

References

[1]
Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. 2018. A System for Efficient High-Recall Retrieval. In SIGIR 2018l. ACM, New York, NY, USA, 1317--1320.
[2]
Mossaab Bagdouri, William Webber, David D Lewis, and Douglas W Oard. 2013. Towards minimizing the annotation cost of certified text classification. In CIKM 2013. ACM, 989--998.
[3]
David C Blair and Melvin E Maron. 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun. ACM 28, 3 (1985), 289--299.
[4]
Gordon F. Cormack and Maura F. Grossman. 2014. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. SIGIR 2014 (2014), 153--162.
[5]
Gordon V Cormack and Maura R Grossman. 2015. Autonomy and reliability of continuous active learning for technology-assisted review. arXiv:1504.06868 (2015).
[6]
Gordon V Cormack and Thomas R Lynam. 2006. Statistical precision of information retrieval evaluation. In SIGIR 2006. ACM, 533--540.
[7]
Aynur Dayanik, David D Lewis, David Madigan, Vladimir Menkov, and Alexander Genkin. 2006. Constructing informative prior distributions from domain knowledge in text classification. In SIGIR 2006. ACM, 493--500.
[8]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR 9, Aug (2008), 1871--1874.
[9]
Alexander Genkin, David D. Lewis, and David Madigan. 2007. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics 49, 3 (Aug. 2007), 291--304.
[10]
Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Aurélie Névéol, Aude Robert, Evangelos Kanoulas, Rene Spijker, Joao Palotti, and Guido Zuccon. 2017. Clef 2017 ehealth evaluation lab overview. In CLEF 2017. Springer, 291--303.
[11]
Gene H Golub, Michael Heath, and Grace Wahba. 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 2 (1979), 215--223.
[12]
Maura R Grossman, Gordon V Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview.
[13]
Lihong Li John Langford and Alexander Strehl. 2007. Vowpal wabbit open source project. Technical Report, Yahoo! (2007).
[14]
Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS 2013. 315--323.
[15]
Harold J Kushner and G George Yin. 1997. Applications to Learning, State Dependent Noise, and Queueing. In Stochastic Approximation Algorithms and Applications. Springer, 25--46.
[16]
David Dolan Lewis. 1992. Representation and learning in information retrieval. Ph.D. Dissertation. University of Massachusetts at Amherst.
[17]
David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR 5 (2004), 361--397.
[18]
Neal Parikh, Stephen Boyd, et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1, 3 (2014), 127--239.
[19]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. JMLR 12 (2011), 2825--2830.
[20]
Tamara Rader, Mala Mann, Claire Stansfield, Chris Cooper, and Margaret Sampson. 2014. Methods for documenting systematic review searches: a discussion of common issues. Research synthesis methods 5, 2 (2014), 98--115.
[21]
Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS 2011. 693--701.
[22]
Daniel Regard. 2013. A Re-Examination of Blair & Maron (1985). In DESI Workshop 2013.
[23]
Joseph John Rocchio. 1971. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing (1971), 313--323.
[24]
Adam Roegiest and Gordon V Cormack. 2015. TREC 2015 Total Recall Track Overview. (2015).
[25]
Adam Roegiest, Gordon V Cormack, Charles LA Clarke, and Maura R Grossman. 2015. Impact of surrogate assessments on high-recall retrieval. In SIGIR 2015. ACM, 555--564.
[26]
Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
[27]
Shai Shalev-Shwartz and Tong Zhang. 2012. Proximal stochastic dual coordinate ascent. arXiv:1211.2717 (2012).
[28]
Alexander Shapiro. 2009. Statistical Inference. In Lectures on Stochastic Programming: Modeling and Theory, Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski (Eds.). SIAM, 155--252.
[29]
Marc A. Suchard, Shawn E. Simpson, Ivan Zorych, Patrick Ryan, and David Madigan. 2013. Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model. ACM Trans. Model. Comput. Simul. 23, 1, Article 10 (Jan. 2013), 17 pages.
[30]
Andrei Nikolaevich Tikhonov, A Goncharsky, VV Stepanov, and Anatoly G Yagola. 2013. Numerical methods for the solution of ill-posed problems. Vol. 328. Springer Science & Business Media.
[31]
Yoshimasa Tsuruoka, Jun'ichi Tsujii, and Sophia Ananiadou. 2009. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In ACL-IJCNLP 2009. Association for Computational Linguistics, 477--485.
[32]
Byron C Wallace, Thomas A Trikalinos, Joseph Lau, Carla Brodley, and Christopher H Schmid. 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics 11, 1 (2010), 55.
[33]
E. Yang, D. Grossman, O. Frieder, and R. Yurchak. 2017. Effectiveness Results for Popular e-Discovery Algorithms. ICAIL 2017 (2017).
[34]
Haotian Zhang, Mustafa Abualsaud, Nimesh Ghelani, Mark D Smucker, Gordon V Cormack, and Maura R Grossman. 2018. Efective user interaction for high-recall retrieval: Less is more. In CIKM 2018. ACM, 187--196.
[35]
Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 2005. ACM, 116.
[36]
Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301--320.

Cited By

View all
  • (2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
  • (2023)Explainable e-Discovery (XeD) Using an Interpretable Fuzzy ARTMAP Neural Network for Technology-Assisted Review2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386391(2761-2766)Online publication date: 15-Dec-2023
  • (2022)TARexpProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531663(3256-3261)Online publication date: 6-Jul-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICAIL '19: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law
June 2019
312 pages
ISBN:9781450367547
DOI:10.1145/3322640
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • Univ. of Montreal: University of Montreal
  • AAAI
  • IAAIL: Intl Asso for Artifical Intel & Law

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bayesian priors
  2. informative priors
  3. keywords
  4. logistic regression
  5. regularization
  6. technology-assisted review
  7. text categorization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICAIL '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 69 of 169 submissions, 41%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
  • (2023)Explainable e-Discovery (XeD) Using an Interpretable Fuzzy ARTMAP Neural Network for Technology-Assisted Review2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386391(2761-2766)Online publication date: 15-Dec-2023
  • (2022)TARexpProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531663(3256-3261)Online publication date: 6-Jul-2022
  • (2022)Fuzzy Law: Towards Creating a Novel Explainable Technology-Assisted Review System for e-Discovery2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020503(1218-1223)Online publication date: 17-Dec-2022
  • (2022)Goldilocks: Just-Right Tuning of BERT for Technology-Assisted ReviewAdvances in Information Retrieval10.1007/978-3-030-99736-6_34(502-517)Online publication date: 5-Apr-2022
  • (2019)Text Retrieval Priors for Bayesian Logistic RegressionProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331299(1045-1048)Online publication date: 18-Jul-2019

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media