demonstration

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Author:

Peter ChristenAuthors Info & Claims

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1065 - 1068

https://doi.org/10.1145/1401890.1402020

Published: 24 August 2008 Publication History

Abstract

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.

References

[1]

A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI'05, pages 30--39, Tokyo, 2005.

Digital Library

[2]

R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.

[3]

P. Christen. A comparison of personal name matching: Techniques and practical issues. In MCD'06, held at IEEE ICDM'06, Hong Kong, 2006.

Digital Library

[4]

P. Christen. Towards parameter-free blocking for scalable record linkage. Technical Report TR-CS-07-03, The Australian National University, Canberra, 2007.

[5]

P. Christen. A two-step classification approach to unsupervised record linkage. In AusDM'07, pages 111--119, Gold Coast, Australia, 2007.

Digital Library

[6]

P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM SIGKDD'08, Las Vegas, 2008.

Digital Library

[7]

P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008.

Digital Library

[8]

P. Christen. Febrl - A freely available record linkage system with a graphical user interface. In HDKM'08, CRPIT vol. 80, pages 17--25, Wollongong, Australia, 2008.

Digital Library

[9]

P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In AusDM'05, Sydney, 2005.

[10]

P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.

[11]

T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.

[12]

W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM SIGKDD'02, Edmonton, 2002.

Digital Library

[13]

I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.

[14]

K. Goiser and P. Christen. Towards automated record linkage. In AusDM'06, pages 23--31, Sydney, 2006.

Digital Library

[15]

M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD'95, pages 127--138, San Jose, 1995.

Digital Library

[16]

L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA'03, Tokyo, 2003.

Digital Library

[17]

G. J. Williams. Data mining with Rattle and R. Togaware, Canberra, 2008. Software available at: http://datamining.togaware.com/survivor/.

Cited By

Sagi TZaga MRusinek SFekete MBjerva JHose K(2025)Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototypeLanguage Resources and Evaluation10.1007/s10579-025-09812-9Online publication date: 26-Feb-2025
https://doi.org/10.1007/s10579-025-09812-9
Buoncristiano MMecca GSantoro DVeltri E(2024)Detective Gadget: Generic Iterative Entity Resolution over Dirty DataData10.3390/data91201399:12(139)Online publication date: 25-Nov-2024
https://doi.org/10.3390/data9120139
Haddock BPletcher ABlair-Stahn NKeyes OKappel MBachmeier SLutze SAlbright JBowman AKinuthia CBurke-Conte ZMudambi RFlaxman A(2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.28(36)Online publication date: 18-Oct-2024
https://doi.org/10.12688/gatesopenres.15418.2
Show More Cited By

Index Terms

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Automatic record linkage using seeded nearest neighbour and support vector machine classification
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

The task of linking databases is an important step in an increasing number of data mining projects, because linked data can contain information that is not available otherwise, or that would require time-consuming and expensive collection of specific ...
Development and user experiences of an open source data cleaning, deduplication and record linkage system

Record linkage, also known as database matching or entity resolution, is now recognised as a core step in the KDD process. Data mining projects increasingly require that information from several sources is combined before the actual mining can be ...
Febrl: a freely available record linkage system with a graphical user interface
HDKM '08: Proceedings of the second Australasian workshop on Health data and knowledge management - Volume 80

Record or data linkage is an important enabling technology in the health sector, as linked data is a cost-effective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2008

1116 pages

ISBN:9781605581934

DOI:10.1145/1401890

General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Demonstration

Conference

KDD08

Sponsor:

KDD08: The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2008

Nevada, Las Vegas, USA

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

141
Total Citations
View Citations
1,351
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)4

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sagi TZaga MRusinek SFekete MBjerva JHose K(2025)Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototypeLanguage Resources and Evaluation10.1007/s10579-025-09812-9Online publication date: 26-Feb-2025
https://doi.org/10.1007/s10579-025-09812-9
Buoncristiano MMecca GSantoro DVeltri E(2024)Detective Gadget: Generic Iterative Entity Resolution over Dirty DataData10.3390/data91201399:12(139)Online publication date: 25-Nov-2024
https://doi.org/10.3390/data9120139
Haddock BPletcher ABlair-Stahn NKeyes OKappel MBachmeier SLutze SAlbright JBowman AKinuthia CBurke-Conte ZMudambi RFlaxman A(2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.28(36)Online publication date: 18-Oct-2024
https://doi.org/10.12688/gatesopenres.15418.2
Haddock BPletcher ABlair-Stahn NKeyes OKappel MBachmeier SLutze SAlbright JBowman AKinuthia CBurke-Conte ZMudambi RFlaxman A(2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.18(36)Online publication date: 3-May-2024
https://doi.org/10.12688/gatesopenres.15418.1
Pereira JFonseca MLopes AGalhardas H(2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
https://doi.org/10.1145/3648476
Low JFung BXiong P(2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024
https://dl.acm.org/doi/10.1016/j.knosys.2024.111678
Nananukul NSisaengsuwanchai KKejriwal M(2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
https://doi.org/10.1007/s44163-024-00159-8
Zeakis APapadakis GSkoutas DKoubarakis M(2024)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal10.1007/s00778-024-00879-434:1Online publication date: 4-Dec-2024
https://doi.org/10.1007/s00778-024-00879-4
Zeakis APapadakis GSkoutas DKoubarakis M(2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.14778/3598581.3598594
Rogers JAygun REtzkorn L(2023)TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal DataInternational Journal of Semantic Computing10.1142/S1793351X2350001017:02(309-343)Online publication date: 18-Apr-2023
https://doi.org/10.1142/S1793351X23500010
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten