Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/2254129.2254153acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

On generating large-scale ground truth datasets for the deduplication of bibliographic records

Published: 13 June 2012 Publication History

Abstract

Mendeley's crowd-sourced catalogue of research papers forms the basis of features such as the ability to search for papers, finding papers related to one currently being viewed and personalised recommendations. In order to generate this catalogue it is necessary to deduplicate the records uploaded from users' libraries and imported from external sources such as PubMed and arXiv. This task has been achieved at Mendeley via an automated system.
However the quality of the deduplication needs to be improved. "Ground truth" data sets are thus needed for evaluating the system's performance but existing datasets are very small. In this paper, the problem of generating large scale data sets from Mendeley's database is tackled. An approach based purely on random sampling produced very easy data sets so approaches that focus on more difficult examples were explored. We found that selecting duplicates and non duplicates from documents with similar titles produced more challenging datasets. Additionally we established that a Solr-based deduplication system can achieve a similar deduplication quality to the fingerprint-based system currently employed. Finally, we introduce a large scale deduplication ground truth dataset that we hope will be useful to others tackling deduplication.

References

[1]
M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48. ACM New York, NY, USA, 2003.
[2]
M. Bilenko and R. Mooney. On evaluation and training-set construction for duplicate detection. In Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, pages 7--12, 2003.
[3]
M. Charikar. Similarity estimation techniques from rounding algorithms. Proceedings of the Thirty-Fourth Annual ACM Symposium on Theory of Computing, pages 380--388, 2002.
[4]
I. Councill, H. Li, Z. Zhuang, S. Debnath, L. Bolelli, W. Lee, A. Sivasubramaniam, and C. Giles. Learning metadata from the evidence in an on-line citation matching scheme. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital libraries, pages 276--285. ACM, 2006.
[5]
A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007.
[6]
H. Hajishirzi, W. Yih, and A. Kolcz. Adaptive near-duplicate detection via similarity learning. In Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in information retrieval, pages 419--426. ACM, 2010.
[7]
S. Lawrence, C. L. Giles, and K. D. Bollacker. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents. ACM Press, 1999.
[8]
S. Lawrence, L. C. Giles, and K. Bollacker. Digital libraries and autonomous citation indexing. IEEE Computer, 32(6):67--71, 1999.
[9]
G. Manku, A. Jain, and A. Sarma. Detecting near-duplicates for web crawling. In The 16th International Conference on World Wide Web, 2007.
[10]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems, pages 1425--1432, 2003.
[11]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Datamining - KDD '02, page 269, New York, New York, USA, 2002. ACM Press.
[12]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of SIGMOD '10, pages 495--506. ACM Press, 2010.
[13]
C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian, J. Xu, and R. Li. Mapdupreducer: detecting near duplicates over massive datasets. In Proceedings of the 2010 International Conference on Management of Data, pages 1119--1122. ACM, 2010.

Cited By

View all
  • (2019)Leveraging active learning to reduce human effort in the generation of ground‐truth for entity resolutionComputational Intelligence10.1111/coin.1226836:2(743-772)Online publication date: 27-Dec-2019
  • (2018)Mendeley's open data for science and learningInternational Journal of Technology Enhanced Learning10.1504/IJTEL.2012.0483094:1/2(31-46)Online publication date: 15-Dec-2018
  • (2018)Comparison of downloads, citations and readership data for two information systems journalsScientometrics10.1007/s11192-014-1365-9101:2(1113-1128)Online publication date: 27-Dec-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
WIMS '12: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
June 2012
571 pages
ISBN:9781450309158
DOI:10.1145/2254129
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • UCV: University of Craiova
  • WNRI: Western Norway Research Institute

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bibliographic metadata
  2. fingerprinting
  3. near duplicate detection
  4. nearest neighbor search

Qualifiers

  • Research-article

Funding Sources

Conference

WIMS '12
Sponsor:
  • UCV
  • WNRI

Acceptance Rates

Overall Acceptance Rate 140 of 278 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Leveraging active learning to reduce human effort in the generation of ground‐truth for entity resolutionComputational Intelligence10.1111/coin.1226836:2(743-772)Online publication date: 27-Dec-2019
  • (2018)Mendeley's open data for science and learningInternational Journal of Technology Enhanced Learning10.1504/IJTEL.2012.0483094:1/2(31-46)Online publication date: 15-Dec-2018
  • (2018)Comparison of downloads, citations and readership data for two information systems journalsScientometrics10.1007/s11192-014-1365-9101:2(1113-1128)Online publication date: 27-Dec-2018
  • (2016)Evaluation of unique identifiers used as keys to match identical publications in Pure and SciVal – a case study from health scienceF1000Research10.12688/f1000research.8913.25(1539)Online publication date: 6-Sep-2016
  • (2016)Evaluation of unique identifiers used for citation linkingF1000Research10.12688/f1000research.8913.15(1539)Online publication date: 29-Jun-2016
  • (2016)De-duplicating a large crowd-sourced catalogue of bibliographic recordsProgram10.1108/PROG-02-2015-002150:2(138-156)Online publication date: 4-Apr-2016
  • (2015)Author gender metadata augmentation of hathitrust digital libraryProceedings of the American Society for Information Science and Technology10.1002/meet.2014.1450510109851:1(1-4)Online publication date: 24-Apr-2015
  • (2012)Harnessing user library statistics for research evaluation and knowledge domain visualizationProceedings of the 21st International Conference on World Wide Web10.1145/2187980.2188236(1017-1024)Online publication date: 16-Apr-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media