Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1363686.1364118acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Replica identification using genetic programming

Published: 16 March 2008 Publication History

Abstract

Identifying and handling replicas are important to guarantee the quality of the information made available by modern data storage services. There has been a large investment from companies and governments in the development of effective methods for removing replicas from large databases. Typically, this investment has produced significant results, since cleaned replica-free databases not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process and maintaining this data. In this paper, we propose a GP-based approach to automatic replica identification that combines evidence based on the data content in order to find a similarity function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an SVM-based method used as baseline by at least 6.5%. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our approach is capable to automatically adapt to any given replica identification boundary.

References

[1]
Baeza-Yates, R. A., and Ribeiro-Neto, B. A. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[2]
Banzhaf, W., Nordin, P., Keller, R. E., and Francone, F. D. Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers, 1998.
[3]
Bell, R., and Dravis, F. Is you data dirty? and does that matter? Accenture Whiter Paper, 2006 -http://www.accenture.com.
[4]
Bhattacharya, I., and Getoor, L. Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2004), pp. 11--18.
[5]
Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. Adaptive name matching in information integration. IEEE Intelligent Systems 18, 5 (September/October 2003), 16--23.
[6]
Bilenko, M., and Mooney, R. J. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003), pp. 39--48.
[7]
Carvalho, J. C. P., and Silva, A. S. Finding similar identities among objects from multiple web sources. In Proceedings of the fifth ACM International Workshop on Web Information and Data Management (2003), pp. 90 - 93.
[8]
Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (2003), pp. 313--324.
[9]
Cohen, W. W. Data integration using similarity joins and a word-based information representation language. ACM TOIS 18, 3 (2000), 288--321.
[10]
Cohen, W. W., and Richman, J. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 475--480.
[11]
De Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., and Da Silva, A. S. Learning to deduplicate. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (2006), pp. 41--50.
[12]
Freely Extensible Biomedical Record Linkage. http://sourceforge.net/projects/febrl.
[13]
Fellegi, I. P., and Sunter, A. B. A theory for record linkage. Journal of American Statistical Association 66, 1 (1969), 1183--1210.
[14]
Koudas, N., Sarawagi, S., and Srivastava, D. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of data (2006), pp. 802--803.
[15]
Koza, J. R. Gentic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.
[16]
Lawrence, S., Giles, C. L., and Bollacker, K. D. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents (1999), pp. 392--393.
[17]
Lawrence, S., Giles, C. L., and Bollacker, K. D. Digital libraries and autonomous citation indexing. IEEE Computer 32, 6 (1999), 67--71.
[18]
Tejada, S., Knoblock, C. A., and Minton, S. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.
[19]
Verykios, V. S., Moustakides, G. V., and Elfeky, M. G. A bayesian decision model for cost optimal record matching. The VLDB Journal 12, 1 (2003), 28--40.
[20]
Wheatley, M. Operation clean data. CIO Asia Magazine, August 2004 (http://www.cio-asia.com).

Cited By

View all
  • (2019)Do your Resources Sound Similar?Proceedings of the 10th International Conference on Knowledge Capture10.1145/3360901.3364426(53-60)Online publication date: 23-Sep-2019
  • (2019)LSVS: Link Specification Verbalization and SummarizationNatural Language Processing and Information Systems10.1007/978-3-030-23281-8_6(66-78)Online publication date: 26-Jun-2019
  • (2017)A genetic algorithm based entity resolution approach with active learningFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5276-611:1(147-159)Online publication date: 1-Feb-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SAC '08: Proceedings of the 2008 ACM symposium on Applied computing
March 2008
2586 pages
ISBN:9781595937537
DOI:10.1145/1363686
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. genetic programming
  2. replica identification

Qualifiers

  • Research-article

Funding Sources

Conference

SAC '08
Sponsor:
SAC '08: The 2008 ACM Symposium on Applied Computing
March 16 - 20, 2008
Fortaleza, Ceara, Brazil

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25
The 40th ACM/SIGAPP Symposium on Applied Computing
March 31 - April 4, 2025
Catania , Italy

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Do your Resources Sound Similar?Proceedings of the 10th International Conference on Knowledge Capture10.1145/3360901.3364426(53-60)Online publication date: 23-Sep-2019
  • (2019)LSVS: Link Specification Verbalization and SummarizationNatural Language Processing and Information Systems10.1007/978-3-030-23281-8_6(66-78)Online publication date: 26-Jun-2019
  • (2017)A genetic algorithm based entity resolution approach with active learningFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5276-611:1(147-159)Online publication date: 1-Feb-2017
  • (2017)Dynamic Secure Deduplication in Cloud Using Genetic ProgrammingData Engineering and Intelligent Computing10.1007/978-981-10-3223-3_48(493-502)Online publication date: 1-Jun-2017
  • (2016)Contacts Deduplication in Mobile Devices Using Textual Similarity and Machine LearningProceedings of the XII Brazilian Symposium on Information Systems on Brazilian Symposium on Information Systems: Information Systems in the Cloud Computing Era - Volume 110.5555/3021955.3021983(160-167)Online publication date: 17-May-2016
  • (2015)Citation Enrichment Improves Deduplication of Primary EvidenceRevised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Mining - Volume 944110.1007/978-3-319-25660-3_20(237-244)Online publication date: 19-May-2015
  • (2014)ERGPProceedings of the 2014 11th Web Information System and Application Conference10.1109/WISA.2014.46(215-220)Online publication date: 12-Sep-2014
  • (2013)Interlinking educational resources and the web of dataProgram10.1108/0033033121129631247:1(60-91)Online publication date: 8-Feb-2013
  • (2013)An evolutionary approach to complex schema matchingInformation Systems10.1016/j.is.2012.10.00238:3(302-316)Online publication date: 1-May-2013
  • (2012)Learning expressive linkage rules using genetic programmingProceedings of the VLDB Endowment10.14778/2350229.23502765:11(1638-1649)Online publication date: 1-Jul-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media