Abstract
In this paper we take a fresh look at the information retrieval (IR) problem of balancing recall with precision in electronic document extraction. We examine the IR constructs of uncertainty, context and relevance, proposing a new process model for context learning, and introducing a new IT artifact designed to support user driven learning by leveraging explicit knowledge to discover implicit knowledge within a corpus of documents. The IT artifact is a prototype designed to present a small set of extracted documents from a targeted corpus based upon user inputted criteria. The prototype provides the user with the opportunity to balance exploration and exploitation, via iterative relevance feedback to address the problem of imprecision resulting from uncertainty. We model the problem as an exploration–exploitation dilemma and apply it to a specific case of IR called eDiscovery. We conduct a series of behavioral experiments to evaluate the model and the artifact. Our initial findings indicate that the proposed model and the artifact improve performance in the IR result.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Recall and Precision are measures of IR performance explained later in this paper.
We found in our early user interviews that many eDiscovery practitioners would like to use a tool that offered an easy way to “take a quick peek” inside a collection, without having to use a heavy processing application.
For more information on Zubulake and its effects the reader can consult the book written in 2012 by the plaintiff in the case.
References
Anderson TD, Bates MJ, Berryman J, Erdelez S, Heinstrom J (2006) Designing for uncertainty. Proc Am Soc Inf Sci Technol 43(1):1
Attfield S, Blandford A (2008) E-discovery viewed as integrated human–computer sensemaking: the challenge of ‘frames’. Second international workshop on supporting search and sensemaking for electronically stored information in discovery proceedings (DESI II, 2008)
Auer P (2002) Using confidence bounds for exploitation-exploration trade-offs. J Mach Learn Res 3:397
Barnett SA (1963) A study in behavior. Methuen, London
Baron J (2005) Toward a federal benchmarking standard for evaluating information retrieval products used in e-discovery. Sedona Conf J 6(1):237–246
Barto AG, Sutton RS, Brouwer PS (1981) Associative search network: a reinforcement learning associative memory. IEEE Trans Syst Man Cybern 40:201–211
Bates MJ (1979) Information search tactics. J Am Soc Inf Sci 30(4):205–214
Bates MJ (1986) Subject access in online catalogs: a design model. J Am Soc Inf Sci 37(6):357–376
Bates MJ (1989) The design of browsing and berry picking techniques for the online search interface. Online Rev 13(5):407–424
Berlyne DE (1960) Conflict, arousal and curiosity. McGraw Hill, New York
Berlyne DE (1963) Motivational problems raised by exploratory and epistemic behavior. In: Koch S (ed) Psychology: a study of science, vol 5. McGraw Hill, New York, pp 284–364
Blair DC, Maron ME (1985) An evaluation of retrieval effectiveness for a full-text document-retrieval system. Commun ACM 28(3):289–299
Broder A (2002) A taxonomy of web search,” IBM Research, SIGIR Forum, vol 36, no 2 (Fall, 2002)
Catledge LD, Pitkow JE (1995) Characterizing browsing strategies in the world-wide web. Comput Netw ISDN Syst 27:1065–1073
Chowdhury G (2012) Building environmentally sustainable information services: a green is research agenda. J Am Soc Inf Sci Technol 63(4):633–647
Chowdhury CR, Bhuyan P (2010) Information retrieval using fuzzy c-means clustering and modified vector space model. In: Computer science and information technology (July, 2010)
Cohen JD, McClure SM, Yu AJ (2007) Should I stay or should I go. In: Philosophical transactions: biological sciences, vol 362, no 1481, mental processes in the human brain (May, 2007), The Royal Society
Cormack GV, Mojdeh M (2009) Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks
Cove JF, Walsh BC (1988) Online text retrieval via browsing. Inf Process Manag 24(1):31–37
Debowski S, Wood RE, Bandura A (2001) Impact of guided exploration and enactive exploration on self-regulatory mechanisms and information acquisition through electronic search. J Appl Psychol 86(6):1129
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407
Demangeot C, Broderick AJ (2010) Exploration and its manifestations in the context of online shopping. J Mark Manag 26(13–14):1256–1278
Ding Y, Chowdhury G, Foo S, Qian W (2000) Bibliometric information retrieval system (BIRS): a web search interface utilizing bibliometric research results. J Am Soc Inf Sci 51(13):1190–1204
Faisal S, Attfield S, Blandford A (2009) A classification of sensemaking representations, workshop on sensemaking, CHI, 2009
Fordham GL (2009) Using keyword search terms in e-discovery and how they relate to issues of responsiveness, privilege, evidence standards and rube goldberg. Richmond J Law Technol 15:8–13
Grossman MR, Cormack GV (2011) Technology-assisted review in e-discovery can be more effective and more efficient than exhaustive manual review. Richmond J Law Technol 17:11–16
Grossman MR, Cormack GV (2013) The grossman-cormack glossary of technology-assisted review. Federal Courts Law Rev 7(1):1–34
Grossman MR, Cormack GV (2014) Evaluation of machine-learning protocols for technology-assisted review in electronic discovery, SIGIR’14
Heinstrom J (2006) Broad exploration or precise specificity: two basic information seeking patterns among students. J Am Soc Inf Sci Technol 57(11):1440–1450
Hernandez R, Kocieniewski D (2009) As new lawyer, senator was active in tobacco’s defense. New York Times, March 26, 2009
Hills TT (2010) The central executive as a search process: priming exploration and exploitation across domains. J Exp Psychol 139(4):590
Hofmann K, Whitson S, de Rijke M (2013) Balancing exploration and exploitation in listwise and pairwise online learning to rank for information. Inf Retr 16:63–90
Holscher C, Strube G (2000) Web search behavior of internet experts and newbies, Cite as: www9.org/w9cdrom/81/81.html
Hyman HS, Fridy III W (2010) Using bag of words (BOW) and standard deviations to represent expected structures for document retrieval: a way of thinking that leads to method choices. In: NIST special publication, proceedings: text retrieval conference (TREC) 2010
Hyman HS, Fridy III W (2011) Modeling concept and context to improve performance in eDiscovery. In: NIST special publication, proceedings: text retrieval conference (TREC) 2011
Ignat C, Steinberger R, Pouliquen B, Erjavec T (2006) A tool set for the quick and efficient exploration of large document collections. Institute for the Protection and Security of the Citizen Joint research Centre, European Commission (2006)
Kaelbling LP (1996) Special issue on reinforcement learning. Mach Learn 22:284
Kaplan S, Kaplan R (1982) Cognition and environment. Praeger, New York
Karimzadehgan M, Zhai CX (2010) Exploration–exploitation tradeoff in interactive relevance feedback. In: Conference on information and knowledge management (2010)
Kuhlthau CC (1991) Inside the search process: information seeking from the user’s perspective. J Am Soc Inf Sci 42:361–371
Lehman S, Schwanecke U, Dorner R (2010) Interactive visualization for opportunistic exploration of large document collections. Inf Syst 35:260–269
Liu TY (2009) Learning to rank information retrieval. Found Trends Inf Retr 3(3):225–331
Losey R (2013) www.e-discoveryteam.com
March JG (1991) Exploration and exploitation in organizational learning. Organ Sci 2(1):71–87
McKay D, Shukla P, Hunt R, Cunningham SJ (2004) Enhancing browsing in digital libraries: three new approaches to browsing in greenstone. Int J Dig Libr 4:283–297
Meuss H, Schulz KU, Wiegel F, Leonardi S, Bry F (2005) Visual exploration and retrieval of XML document collections with the generic system X2. Int J Dig Libr 5:3–17
Muramatsu J, Pratt W (2001) Transparent queries: investigating users’ mental models of search engines, SIGIR 2001. ACM, New York
Muylle S, Moenaert R, Despontin M (1999) A grounded theory of World Wide Web search behaviour. J Marketing Commun 5(3):143–155
Navarro-Prieto R, Scaife M, Rogers Y (1999) Cognitive strategies in web searching, Cited as: zing.ncsl.nist.gov/hfweb/proceedings/Navarro-Prieto/index.html (June 3, 1999)
Oard DW, Baron JR, Hedin B, Lewis DD, Tomlinson S (2010) Evaluation of information retrieval for E-discovery. Artif Intell Law 18:347
Oussalaleh M, Khan S, Nefti S (2008) Personalized information retrieval system in the framework of fuzzy logic. Expert Syst Appl 35:423
Pace N, Zakaras L (2012) Where the money goes: understanding litigant expenditures for producing electronic discovery. http://www.rand.org/pubs/monographs/MG1208.html
Paul GL, Baron JR (2007) Information inflation: can the legal system adapt? Richmond J Law Technol 13:10–17
Robbins H (1952) Some aspects of the sequential design of experiments. Bull Am Math Soc 58:527–535
Settles B (2010) Active learning literature survey. Univ Wis Madison 52(11):55–66
Schweighofer E, Geist A (2008) Legal query expansion using ontologies and relevance feedback, TREC conference 2008, proceedings
Scott SL (2010) A modern bayesian look at the multi-armed bandit. Appl Stoch Models Bus Ind 26:639–658
Sedona (2014) Conference Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery (2013 edition)
The Sedona Conference (2014) Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery, The Sedona Conference Commentary on Search and Retrieval (Volume XV)
Tredennick J (2014) Pioneering Cormack/Grossman study validates continuous learning, judgmental seeds and review team training for technology assisted review. http://www.catalystsecure.com/blog/2014/05/pioneering-cormackgrossman-study-validates-continuous-learning-judgmental-seeds-and-review-team-training-for-technology-assisted-review/
Van Rijsbergen CJ (1979) Information Retrieval. Butterworth, London
Vijayakumar P, Unnikrishnan PC (2012) Modified action value method applied to ‘n’—armed bandit problems using reinforcement learning. Int J Eng Sci Technol 4(12):4710–4716
Wang L, Oard DW (2008) Query expansion for noisy legal documents, Text Retrieval Conference (TREC) 2008 proceedings
Wang L, Lekadir K, Lee S, Merrifield R, Yang G (2013) A general framework for context-specific image segmentation using reinforcement learning. IEEE Trans Med Imaging 32(5):943–956
Weick KE, Sutcliffe KM, Obstfeld D (2005) Organizing and the process of sensemaking. Organ Sci 16(4):409–421
Zheng Z, Padmanabhan B (2006) Selectively acquiring customer information: a new data acquisition problem and an active learning-based solution. Manag Sci 52(5):697–712
Author information
Authors and Affiliations
Corresponding author
Appendix: eDiscovery IR request adapted from the TREC legal track 2011 conference problem set #401
Appendix: eDiscovery IR request adapted from the TREC legal track 2011 conference problem set #401
The purpose of this task is to retrieve documents that match the below request for production. The company in this case is Enron. The company is a now defunct energy trading company that was the subject of a large body of litigation both civil and criminal.
The following is the request for production:
You are requested to produce all documents or communications that describe, discuss, refer to, report on, or relate to the design, development, operation, or marketing of enrononline, or any other online service offered, provided, or used by the Company (or any of its subsidiaries, predecessors, or successors-in-interest), for the purchase, sale, trading, or exchange of financial or other instruments or products, including but not limited to, derivative instruments, commodities, futures, and swaps.
1.1 Additional guidance for relevance
The above request broadly seeks documents concerning Enron online, the Company’s general purpose trading system, or any other online financial or commodities services offered, provided, or used by the Company and its agents.
In this case attorney-client communication or otherwise privileged information is not an issue.
This request is seeking information specifically about an online system for trading financial instruments. A document is not relevant if it refers to the purchase, sale, trading, or exchange of a financial instrument or product, but does not involve the use of an online system.
A document is relevant if it describes, discusses, refers to, reports on, or relates to: the design, development, operation, or marketing of “enrononline,” or any other online services offered, provided or used. This includes, how the system was set up, how the system worked on a day-to-day basis, how the Company developed or modified the system, how the Company marketed or advertised the system, and the actual use of the system by the Company, its subsidiaries, predecessors, or successors in interest.
A relevant document can be for the purchase, sale, trading, or exchange of: financial instruments, financial products, including, derivative instruments, commodities, futures, or swaps. These instruments and products are distinguished from other goods and services by the fact that their value depends on future events and their purchase incurs financial risk.
A document is relevant even if it makes only implicit reference to these parameters. No particular transaction (i.e., purchase or sale) need be cited specifically. If the document generally references such activities, transactions, or a system whose function is to execute such transactions, and it otherwise meets the criteria, it is relevant.
Examples of responsive documents include: Correspondence, Policy statements, Press releases, Contact lists, or Enronline guest access emails.
1.2 Additional guidance for non-relevance
Examples of non-relevant documents include: Purchase, sale, trading or exchange of products or services other than financial instruments or products, or any documents referring to employee stock options or stock purchase plans offered as incentives or compensation, or the exercise thereof. Also documents relating to structured finance deals or swaps that are specified explicitly by written contracts, even if the contracts themselves are electronic or electronically signed are not relevant. Also documents related to the use of online systems by Enron employees for their personal use are outside this request and are not relevant.
Rights and permissions
About this article
Cite this article
Hyman, H., Sincich, T., Will, R. et al. A process model for information retrieval context learning and knowledge discovery. Artif Intell Law 23, 103–132 (2015). https://doi.org/10.1007/s10506-015-9165-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10506-015-9165-y