Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3331184.3331311acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

On Tradeoffs Between Document Signature Methods for a Legal Due Diligence Corpus

Published: 18 July 2019 Publication History

Abstract

While document signatures are a well established tool in IR, they have primarily been investigated in the context of web documents. Legal due diligence documents, by their nature, have more similar structure and language than we may expect out of standard web collections. Moreover, many due diligence systems strive to facilitate real-time interactions and so time from document ingestion to availability should be minimal. Such constraints further limit the possible solution space when identifying near duplicate documents. We present an examination of the tradeoffs that document signature methods face in the due diligence domain. In particular, we quantify the trade-off between signature length, time to compute, number of hash collisions, and number of nearest neighbours for a 90,000 document due diligence corpus.

References

[1]
A. Broder. 1997. On the Resemblance and Containment of Documents. In Proc. SEQUENCES '97.
[2]
Timothy Chappell, Shlomo Geva, Anthony Nguyen, and Guido Zuccon. 2013. Efficient Top-k Retrieval with Signatures. In Proc. ADCS '13.
[3]
Timothy Chappell, Shlomo Geva, and Guido Zuccon. 2015. Approximate nearest-neighbour search with inverted signature slice lists. In Proc. ECIR '15.
[4]
Moses S. Charikar. 2002. Similarity Estimation Techniques from Rounding Algorithms. In Proc. STOC '02.
[5]
Abdur Chowdhury, Ophir Frieder, David Grossman, and Mary Catherine McCabe. {n. d.}. Collection Statistics for Fast Duplicate Document Detection. ACM Trans. Inf. Syst., Vol. 20, 2 ({n. d.}).
[6]
Jack G. Conrad, Xi S. Guo, and Cindy P. Schriber. 2003. Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment. In Proc. CIKM '03.
[7]
Jack G. Conrad and Cindy P. Schriber. 2004. Constructing a Text Corpus for Inexact Duplicate Detection. In Proc. SIGIR '04.
[8]
Abhinandan S. Das, Mayur Datar, Ashutosh Garg, and Shyam Rajaram. 2007. Google News Personalization: Scalable Online Collaborative Filtering. In Proc. WWW '07.
[9]
Christopher M. De Vries and Shlomo Geva. 2012. Pairwise Similarity of TopSig Document Signatures. In Proc. ADCS '12.
[10]
C. Faloutsos. 1990. Signature-based Text Retrieval Methods: A Survey. Data Eng., Vol. 13, 1 (1990).
[11]
Shlomo Geva and Christopher M. De Vries. 2011. TOPSIG: Topology Preserving Document Signatures. In Proc. CIKM '11.
[12]
Bob Goodwin, Michael Hopcroft, Dan Luu, Alex Clemmer, Mihaela Curmei, Sameh Elnikety, and Yuxiong He. 2017. BitFunnel: Revisiting Signatures for Search. In Proc. SIGIR '17.
[13]
Monika Henzinger. 2006. Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In Proc. SIGIR '06.
[14]
A. Kent, R. Sacks-Davis, and K. Ramamohanarao. 1990. A signature file scheme based on multiple organizations for indexing very large text databases. J. Amer. Soc. Info. Sci., Vol. 41, 7 (1990).
[15]
Aleksander Kolcz, Abdur Chowdhury, and Joshua Alspector. 2004. Improved Robustness of Signature-based Near-replica Detection via Lexicon Randomization. In Proc. KDD '04.
[16]
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting Near-duplicates for Web Crawling. In Proc. WWW '07.
[17]
Adam Roegiest, Alexander K. Hudek, and Anne McNulty. 2018. A Dataset and an Examination of Identifying Passages for Due Diligence. In Proc. SIGIR 2018.
[18]
Enrique Vallés and Paolo Rosso. 2011. Detection of Near-duplicate User Generated Contents: The SMS Spam Collection. In Proc. SMUC '11.
[19]
Qifan Wang, Bin Shen, Zhiwei Zhang, and Luo Si. 2014. Sparse Semantic Hashing for Efficient Large Scale Similarity Search. In Proc. CIKM '14.
[20]
Yair Weiss, Antonio Torralba, and Rob Fergus. 2009. Spectral Hashing. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.).
[21]
Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu. 2010. Self-taught Hashing for Fast Similarity Search. In Proc. SIGIR 2010.

Cited By

View all
  • (2022)Toward automatic support for leading court debates: a novel task proposal & effective approach of judicial question generationNeural Computing and Applications10.1007/s00521-022-07588-534:19(16367-16385)Online publication date: 17-Aug-2022

Index Terms

  1. On Tradeoffs Between Document Signature Methods for a Legal Due Diligence Corpus

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2019
      1512 pages
      ISBN:9781450361729
      DOI:10.1145/3331184
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 July 2019

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Short-paper

      Conference

      SIGIR '19
      Sponsor:

      Acceptance Rates

      SIGIR'19 Paper Acceptance Rate 84 of 426 submissions, 20%;
      Overall Acceptance Rate 792 of 3,983 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)3
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Toward automatic support for leading court debates: a novel task proposal & effective approach of judicial question generationNeural Computing and Applications10.1007/s00521-022-07588-534:19(16367-16385)Online publication date: 17-Aug-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media