research-article

Replica identification using genetic programming

Authors:

Moisés G. Carvalho,

Albero H. F. Laender,

Marcos André Gonçalves,

Altigran S. da SilvaAuthors Info & Claims

SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

Pages 1801 - 1806

https://doi.org/10.1145/1363686.1364118

Published: 16 March 2008 Publication History

Abstract

Identifying and handling replicas are important to guarantee the quality of the information made available by modern data storage services. There has been a large investment from companies and governments in the development of effective methods for removing replicas from large databases. Typically, this investment has produced significant results, since cleaned replica-free databases not only allow the retrieval of higher-quality information but also lead to a more concise data representation and to potential savings in computational time and resources to process and maintaining this data. In this paper, we propose a GP-based approach to automatic replica identification that combines evidence based on the data content in order to find a similarity function that is able to identify whether two entries in a repository are replicas or not. As shown by our experiments, our approach outperforms an SVM-based method used as baseline by at least 6.5%. Moreover, the suggested functions are computationally less demanding since they use fewer evidence. In addition, our approach is capable to automatically adapt to any given replica identification boundary.

References

[1]

Baeza-Yates, R. A., and Ribeiro-Neto, B. A. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Digital Library

[2]

Banzhaf, W., Nordin, P., Keller, R. E., and Francone, F. D. Genetic Programming - An Introduction: On the Automatic Evolution of Computer Programs and Its Applications. Morgan Kaufmann Publishers, 1998.

Digital Library

[3]

Bell, R., and Dravis, F. Is you data dirty? and does that matter? Accenture Whiter Paper, 2006 -http://www.accenture.com.

[4]

Bhattacharya, I., and Getoor, L. Iterative record linkage for cleaning and integration. In Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (2004), pp. 11--18.

Digital Library

[5]

Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., and Fienberg, S. Adaptive name matching in information integration. IEEE Intelligent Systems 18, 5 (September/October 2003), 16--23.

Digital Library

[6]

Bilenko, M., and Mooney, R. J. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003), pp. 39--48.

Digital Library

[7]

Carvalho, J. C. P., and Silva, A. S. Finding similar identities among objects from multiple web sources. In Proceedings of the fifth ACM International Workshop on Web Information and Data Management (2003), pp. 90 - 93.

Digital Library

[8]

Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (2003), pp. 313--324.

Digital Library

[9]

Cohen, W. W. Data integration using similarity joins and a word-based information representation language. ACM TOIS 18, 3 (2000), 288--321.

Digital Library

[10]

Cohen, W. W., and Richman, J. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002), pp. 475--480.

Digital Library

[11]

De Carvalho, M. G., Gonçalves, M. A., Laender, A. H. F., and Da Silva, A. S. Learning to deduplicate. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (2006), pp. 41--50.

Digital Library

[12]

Freely Extensible Biomedical Record Linkage. http://sourceforge.net/projects/febrl.

[13]

Fellegi, I. P., and Sunter, A. B. A theory for record linkage. Journal of American Statistical Association 66, 1 (1969), 1183--1210.

[14]

Koudas, N., Sarawagi, S., and Srivastava, D. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of data (2006), pp. 802--803.

Digital Library

[15]

Koza, J. R. Gentic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.

Digital Library

[16]

Lawrence, S., Giles, C. L., and Bollacker, K. D. Autonomous citation matching. In Proceedings of the Third International Conference on Autonomous Agents (1999), pp. 392--393.

Digital Library

[17]

Lawrence, S., Giles, C. L., and Bollacker, K. D. Digital libraries and autonomous citation indexing. IEEE Computer 32, 6 (1999), 67--71.

Digital Library

[18]

Tejada, S., Knoblock, C. A., and Minton, S. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.

Digital Library

[19]

Verykios, V. S., Moustakides, G. V., and Elfeky, M. G. A bayesian decision model for cost optimal record matching. The VLDB Journal 12, 1 (2003), 28--40.

Digital Library

[20]

Wheatley, M. Operation clean data. CIO Asia Magazine, August 2004 (http://www.cio-asia.com).

Cited By

Ahmed ASherif MNgonga Ngomo AKejriwal MSzekely PTroncy R(2019)Do your Resources Sound Similar?Proceedings of the 10th International Conference on Knowledge Capture10.1145/3360901.3364426(53-60)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1145/3360901.3364426
Ahmed ASherif MNgomo A(2019)LSVS: Link Specification Verbalization and SummarizationNatural Language Processing and Information Systems10.1007/978-3-030-23281-8_6(66-78)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1007/978-3-030-23281-8_6
Sun CShen DKou YNie TYu G(2017)A genetic algorithm based entity resolution approach with active learningFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5276-611:1(147-159)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s11704-015-5276-6
Show More Cited By

Index Terms

Replica identification using genetic programming
1. Computing methodologies
  1. Artificial intelligence
    1. Philosophical/theoretical foundations of artificial intelligence
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Consistent and automatic replica regeneration

Reducing management costs and improving the availability of large-scale distributed systems require automatic replica regeneration, that is, creating new replicas in response to replica failures. A major challenge to regeneration is maintaining ...
Replica Management in Object-based Systems
ICOIN '01: Proceedings of the The 15th International Conference on Information Networking

In object-based systems, objects are encapsulations of data and procedures named methods and methods are invoked in a nested manner. We discuss how to lock replicated objects by using the quorum-based scheme. If a pair of methods op1 and op2 are ...
Estimating the Reliability of Regeneration-Based Replica Control Protocols

The accessibility of vital information can be enhanced by replicating the data on several sites and employing a consistency control protocol to manage the replicas. The reliability of a replicated data object depends on maintaining a viable set of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

March 2008

2586 pages

ISBN:9781595937537

DOI:10.1145/1363686

Conference Chairs:
Roger L. Wainwright
University of Tulsa
,
Hisham M. Haddad
Kennesaw State University

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGAPP: ACM Special Interest Group on Applied Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

MCT/CNPq/CT-INFO
Conselho Nacional de Desenvolvimento Científico e Tecnológico

Conference

SAC '08

Sponsor:

SIGAPP

SAC '08: The 2008 ACM Symposium on Applied Computing

March 16 - 20, 2008

Fortaleza, Ceara, Brazil

Acceptance Rates

Overall Acceptance Rate 1,650 of 6,669 submissions, 25%

Upcoming Conference

SAC '25

Sponsor:
sigapp

The 40th ACM/SIGAPP Symposium on Applied Computing

March 31 - April 4, 2025

Catania , Italy

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
275
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ahmed ASherif MNgonga Ngomo AKejriwal MSzekely PTroncy R(2019)Do your Resources Sound Similar?Proceedings of the 10th International Conference on Knowledge Capture10.1145/3360901.3364426(53-60)Online publication date: 23-Sep-2019
https://dl.acm.org/doi/10.1145/3360901.3364426
Ahmed ASherif MNgomo A(2019)LSVS: Link Specification Verbalization and SummarizationNatural Language Processing and Information Systems10.1007/978-3-030-23281-8_6(66-78)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1007/978-3-030-23281-8_6
Sun CShen DKou YNie TYu G(2017)A genetic algorithm based entity resolution approach with active learningFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-015-5276-611:1(147-159)Online publication date: 1-Feb-2017
https://dl.acm.org/doi/10.1007/s11704-015-5276-6
Pandu Ranga Rao KKrishna Reddy VYakoob S(2017)Dynamic Secure Deduplication in Cloud Using Genetic ProgrammingData Engineering and Intelligent Computing10.1007/978-981-10-3223-3_48(493-502)Online publication date: 1-Jun-2017
https://doi.org/10.1007/978-981-10-3223-3_48
Machado RPinheiro RMachado KBorges ESiqueira FVilain PCappelli CWazlawick R(2016)Contacts Deduplication in Mobile Devices Using Textual Similarity and Machine LearningProceedings of the XII Brazilian Symposium on Information Systems on Brazilian Symposium on Information Systems: Information Systems in the Cloud Computing Era - Volume 110.5555/3021955.3021983(160-167)Online publication date: 17-May-2016
https://dl.acm.org/doi/10.5555/3021955.3021983
Choong MThorning STsafnat G(2015)Citation Enrichment Improves Deduplication of Primary EvidenceRevised Selected Papers of the PAKDD 2015 Workshops on Trends and Applications in Knowledge Discovery and Data Mining - Volume 944110.1007/978-3-319-25660-3_20(237-244)Online publication date: 19-May-2015
https://dl.acm.org/doi/10.1007/978-3-319-25660-3_20
Sun CShen DKou YNie TYu G(2014)ERGPProceedings of the 2014 11th Web Information System and Application Conference10.1109/WISA.2014.46(215-220)Online publication date: 12-Sep-2014
https://dl.acm.org/doi/10.1109/WISA.2014.46
Dietze SSanchez‐Alonso SEbner HQing Yu HGiordano DMarenzi IPereira Nunes B(2013)Interlinking educational resources and the web of dataProgram10.1108/0033033121129631247:1(60-91)Online publication date: 8-Feb-2013
https://doi.org/10.1108/00330331211296312
De Carvalho MLaender AGonçAlves MDa Silva A(2013)An evolutionary approach to complex schema matchingInformation Systems10.1016/j.is.2012.10.00238:3(302-316)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1016/j.is.2012.10.002
Isele RBizer C(2012)Learning expressive linkage rules using genetic programmingProceedings of the VLDB Endowment10.14778/2350229.23502765:11(1638-1649)Online publication date: 1-Jul-2012
https://dl.acm.org/doi/10.14778/2350229.2350276
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents