research-article

A Regularization Approach to Combining Keywords and Training Data in Technology-Assisted Review

Authors:

David D. Lewis,

Ophir FriederAuthors Info & Claims

ICAIL '19: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law

Pages 153 - 162

https://doi.org/10.1145/3322640.3326713

Published: 17 June 2019 Publication History

Abstract

Manual keyword queries and supervised learning (technology-assisted review) have been viewed as conflicting approaches to high recall retrieval tasks (such as civil discovery and sunshine law requests) in the law. We propose a synthesis that uses a keyword list as a regularizer when learning a logistic regression model from labeled examples. Balancing keywords against training data requires knowing how the regularization penalty should scale with training set size. We show, however, that advice on scaling from theory is contradictory, software defaults are inconsistent, and standard practice (validation-based tuning) is impractical in many high-recall retrieval settings. Through experiments on simulated e-discovery data sets, we show that the penalization scheme suggested by a Bayesian interpretation is substantially safer than alternatives from stochastic optimization and computational learning theory. Combining keywords and training data provides better effectiveness on our datasets than using either alone, showing that both approaches bring value.

References

[1]

Mustafa Abualsaud, Nimesh Ghelani, Haotian Zhang, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. 2018. A System for Efficient High-Recall Retrieval. In SIGIR 2018l. ACM, New York, NY, USA, 1317--1320.

Digital Library

[2]

Mossaab Bagdouri, William Webber, David D Lewis, and Douglas W Oard. 2013. Towards minimizing the annotation cost of certified text classification. In CIKM 2013. ACM, 989--998.

Digital Library

[3]

David C Blair and Melvin E Maron. 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun. ACM 28, 3 (1985), 289--299.

Digital Library

[4]

Gordon F. Cormack and Maura F. Grossman. 2014. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. SIGIR 2014 (2014), 153--162.

Digital Library

[5]

Gordon V Cormack and Maura R Grossman. 2015. Autonomy and reliability of continuous active learning for technology-assisted review. arXiv:1504.06868 (2015).

[6]

Gordon V Cormack and Thomas R Lynam. 2006. Statistical precision of information retrieval evaluation. In SIGIR 2006. ACM, 533--540.

Digital Library

[7]

Aynur Dayanik, David D Lewis, David Madigan, Vladimir Menkov, and Alexander Genkin. 2006. Constructing informative prior distributions from domain knowledge in text classification. In SIGIR 2006. ACM, 493--500.

Digital Library

[8]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR 9, Aug (2008), 1871--1874.

Digital Library

[9]

Alexander Genkin, David D. Lewis, and David Madigan. 2007. Large-Scale Bayesian Logistic Regression for Text Categorization. Technometrics 49, 3 (Aug. 2007), 291--304.

[10]

Lorraine Goeuriot, Liadh Kelly, Hanna Suominen, Aurélie Névéol, Aude Robert, Evangelos Kanoulas, Rene Spijker, Joao Palotti, and Guido Zuccon. 2017. Clef 2017 ehealth evaluation lab overview. In CLEF 2017. Springer, 291--303.

[11]

Gene H Golub, Michael Heath, and Grace Wahba. 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 2 (1979), 215--223.

[12]

Maura R Grossman, Gordon V Cormack, and Adam Roegiest. 2016. TREC 2016 Total Recall Track Overview.

[13]

Lihong Li John Langford and Alexander Strehl. 2007. Vowpal wabbit open source project. Technical Report, Yahoo! (2007).

[14]

Rie Johnson and Tong Zhang. 2013. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS 2013. 315--323.

Digital Library

[15]

Harold J Kushner and G George Yin. 1997. Applications to Learning, State Dependent Noise, and Queueing. In Stochastic Approximation Algorithms and Applications. Springer, 25--46.

[16]

David Dolan Lewis. 1992. Representation and learning in information retrieval. Ph.D. Dissertation. University of Massachusetts at Amherst.

[17]

David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. 2004. RCV1: A New Benchmark Collection for Text Categorization Research. JMLR 5 (2004), 361--397.

Digital Library

[18]

Neal Parikh, Stephen Boyd, et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1, 3 (2014), 127--239.

Digital Library

[19]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. JMLR 12 (2011), 2825--2830.

Digital Library

[20]

Tamara Rader, Mala Mann, Claire Stansfield, Chris Cooper, and Margaret Sampson. 2014. Methods for documenting systematic review searches: a discussion of common issues. Research synthesis methods 5, 2 (2014), 98--115.

[21]

Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS 2011. 693--701.

Digital Library

[22]

Daniel Regard. 2013. A Re-Examination of Blair & Maron (1985). In DESI Workshop 2013.

[23]

Joseph John Rocchio. 1971. Relevance feedback in information retrieval. The SMART retrieval system: experiments in automatic document processing (1971), 313--323.

[24]

Adam Roegiest and Gordon V Cormack. 2015. TREC 2015 Total Recall Track Overview. (2015).

[25]

Adam Roegiest, Gordon V Cormack, Charles LA Clarke, and Maura R Grossman. 2015. Impact of surrogate assessments on high-recall retrieval. In SIGIR 2015. ACM, 555--564.

Digital Library

[26]

Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.

Digital Library

[27]

Shai Shalev-Shwartz and Tong Zhang. 2012. Proximal stochastic dual coordinate ascent. arXiv:1211.2717 (2012).

[28]

Alexander Shapiro. 2009. Statistical Inference. In Lectures on Stochastic Programming: Modeling and Theory, Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczynski (Eds.). SIAM, 155--252.

[29]

Marc A. Suchard, Shawn E. Simpson, Ivan Zorych, Patrick Ryan, and David Madigan. 2013. Massive Parallelization of Serial Inference Algorithms for a Complex Generalized Linear Model. ACM Trans. Model. Comput. Simul. 23, 1, Article 10 (Jan. 2013), 17 pages.

Digital Library

[30]

Andrei Nikolaevich Tikhonov, A Goncharsky, VV Stepanov, and Anatoly G Yagola. 2013. Numerical methods for the solution of ill-posed problems. Vol. 328. Springer Science & Business Media.

[31]

Yoshimasa Tsuruoka, Jun'ichi Tsujii, and Sophia Ananiadou. 2009. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In ACL-IJCNLP 2009. Association for Computational Linguistics, 477--485.

Digital Library

[32]

Byron C Wallace, Thomas A Trikalinos, Joseph Lau, Carla Brodley, and Christopher H Schmid. 2010. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics 11, 1 (2010), 55.

[33]

E. Yang, D. Grossman, O. Frieder, and R. Yurchak. 2017. Effectiveness Results for Popular e-Discovery Algorithms. ICAIL 2017 (2017).

Digital Library

[34]

Haotian Zhang, Mustafa Abualsaud, Nimesh Ghelani, Mark D Smucker, Gordon V Cormack, and Maura R Grossman. 2018. Efective user interaction for high-recall retrieval: Less is more. In CIKM 2018. ACM, 187--196.

Digital Library

[35]

Tong Zhang. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML 2005. ACM, 116.

Digital Library

[36]

Hui Zou and Trevor Hastie. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 2 (2005), 301--320.

Cited By

Yang EHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657919
Courchaine CTabassum TWade CSethi R(2023)Explainable e-Discovery (XeD) Using an Interpretable Fuzzy ARTMAP Neural Network for Technology-Assisted Review2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386391(2761-2766)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386391
Yang ELewis DAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)TARexpProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531663(3256-3261)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531663
Show More Cited By

Index Terms

A Regularization Approach to Combining Keywords and Training Data in Technology-Assisted Review
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning algorithms
      1. Regularization
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification

Recommendations

Text Retrieval Priors for Bayesian Logistic Regression
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Discriminative learning algorithms such as logistic regression excel when training data are plentiful, but falter when it is meager. An extreme case is text retrieval (zero training data), where discriminative learning is impossible and heuristics such ...
Multiple instance learning with bag dissimilarities

Multiple instance learning (MIL) is concerned with learning from sets (bags) of objects (instances), where the individual instance labels are ambiguous. In this setting, supervised learning cannot be applied directly. Often, specialized MIL methods ...
Semi-supervised learning to rank with preference regularization
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

We propose a semi-supervised learning to rank algorithm. It learns from both labeled data (pairwise preferences or absolute labels) and unlabeled data. The data can consist of multiple groups of items (such as queries), some of which may contain only ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICAIL '19: Proceedings of the Seventeenth International Conference on Artificial Intelligence and Law

June 2019

312 pages

ISBN:9781450367547

DOI:10.1145/3322640

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAI: ACM Special Interest Group on Artificial Intelligence

In-Cooperation

Univ. of Montreal: University of Montreal
AAAI
IAAIL: Intl Asso for Artifical Intel & Law

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICAIL '19

Sponsor:

SIGAI

ICAIL '19: Seventeenth International Conference on Artificial Intelligence and Law

June 17 - 21, 2019

QC, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 69 of 169 submissions, 41%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
160
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yang EHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Contextualization with SPLADE for High Recall RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657919(2337-2341)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657919
Courchaine CTabassum TWade CSethi R(2023)Explainable e-Discovery (XeD) Using an Interpretable Fuzzy ARTMAP Neural Network for Technology-Assisted Review2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386391(2761-2766)Online publication date: 15-Dec-2023
https://doi.org/10.1109/BigData59044.2023.10386391
Yang ELewis DAmigo ECastells PGonzalo JCarterette BCulpepper JKazai G(2022)TARexpProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3477495.3531663(3256-3261)Online publication date: 6-Jul-2022
https://dl.acm.org/doi/10.1145/3477495.3531663
Courchaine CSethi R(2022)Fuzzy Law: Towards Creating a Novel Explainable Technology-Assisted Review System for e-Discovery2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020503(1218-1223)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020503
Yang EMacAvaney SLewis DFrieder O(2022)Goldilocks: Just-Right Tuning of BERT for Technology-Assisted ReviewAdvances in Information Retrieval10.1007/978-3-030-99736-6_34(502-517)Online publication date: 5-Apr-2022
https://doi.org/10.1007/978-3-030-99736-6_34
Yang ELewis DFrieder OPiwowarski BChevalier MGaussier EMaarek YNie JScholer F(2019)Text Retrieval Priors for Bayesian Logistic RegressionProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3331184.3331299(1045-1048)Online publication date: 18-Jul-2019
https://dl.acm.org/doi/10.1145/3331184.3331299

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents