Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1401890.1402020acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
demonstration

Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

Published: 24 August 2008 Publication History

Abstract

Matching records that refer to the same entity across data-bases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality. Significant advances in record linkage techniques have been made in recent years. However, many new techniques are either implemented in research proof-of-concept systems only, or they are hidden within expensive 'black box' commercial software. This makes it difficult for both researchers and practitioners to experiment with new record linkage techniques, and to compare existing techniques with new ones. The Febrl (Freely Extensible Biomedical Record Linkage) system aims to fill this gap. It contains many recently developed techniques for data cleaning, deduplication and record linkage, and encapsulates them into a graphical user interface (GUI). Febrl thus allows even inexperienced users to learn and experiment with both traditional and new record linkage techniques. Because Febrl is written in Python and its source code is available, it is fairly easy to integrate new record linkage techniques into it. Therefore, Febrl can be seen as a tool that allows researchers to compare various existing record linkage techniques with their own ones, enabling the record linkage research community to conduct their work more efficiently. Additionally, Febrl is suitable as a training tool for new record linkage users, and it can also be used for practical linkage projects with data sets that contain up to several hundred thousand records.

References

[1]
A. Aizawa and K. Oyama. A fast linkage detection scheme for multi-source information integration. In WIRI'05, pages 30--39, Tokyo, 2005.
[2]
R. Baxter, P. Christen, and T. Churches. A comparison of fast blocking methods for record linkage. In ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25--27, Washington DC, 2003.
[3]
P. Christen. A comparison of personal name matching: Techniques and practical issues. In MCD'06, held at IEEE ICDM'06, Hong Kong, 2006.
[4]
P. Christen. Towards parameter-free blocking for scalable record linkage. Technical Report TR-CS-07-03, The Australian National University, Canberra, 2007.
[5]
P. Christen. A two-step classification approach to unsupervised record linkage. In AusDM'07, pages 111--119, Gold Coast, Australia, 2007.
[6]
P. Christen. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In ACM SIGKDD'08, Las Vegas, 2008.
[7]
P. Christen. Automatic training example selection for scalable unsupervised record linkage. In PAKDD'08, Springer LNAI 5012, pages 511--518, Osaka, Japan, 2008.
[8]
P. Christen. Febrl - A freely available record linkage system with a graphical user interface. In HDKM'08, CRPIT vol. 80, pages 17--25, Wollongong, Australia, 2008.
[9]
P. Christen and D. Belacic. Automated probabilistic address standardisation and verification. In AusDM'05, Sydney, 2005.
[10]
P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In F. Guillet and H. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence. Springer, 2007.
[11]
T. Churches, P. Christen, K. Lim, and J. X. Zhu. Preparation of name and address data for record linkage using hidden Markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002.
[12]
W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In ACM SIGKDD'02, Edmonton, 2002.
[13]
I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183--1210, 1969.
[14]
K. Goiser and P. Christen. Towards automated record linkage. In AusDM'06, pages 23--31, Sydney, 2006.
[15]
M. A. Hernandez and S. J. Stolfo. The merge/purge problem for large databases. In ACM SIGMOD'95, pages 127--138, San Jose, 1995.
[16]
L. Jin, C. Li, and S. Mehrotra. Efficient record linkage in large data sets. In DASFAA'03, Tokyo, 2003.
[17]
G. J. Williams. Data mining with Rattle and R. Togaware, Canberra, 2008. Software available at: http://datamining.togaware.com/survivor/.

Cited By

View all
  • (2025)Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototypeLanguage Resources and Evaluation10.1007/s10579-025-09812-9Online publication date: 26-Feb-2025
  • (2024)Detective Gadget: Generic Iterative Entity Resolution over Dirty DataData10.3390/data91201399:12(139)Online publication date: 25-Nov-2024
  • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.28(36)Online publication date: 18-Oct-2024
  • Show More Cited By

Index Terms

  1. Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2008
    1116 pages
    ISBN:9781605581934
    DOI:10.1145/1401890
    • General Chair:
    • Ying Li,
    • Program Chairs:
    • Bing Liu,
    • Sunita Sarawagi
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Python
    2. data cleaning
    3. data linkage
    4. data matching
    5. deduplication
    6. open source software

    Qualifiers

    • Demonstration

    Conference

    KDD08

    Acceptance Rates

    KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Utilizing phonetic similarity for cross-source and cross-language toponym matching: a benchmark and prototypeLanguage Resources and Evaluation10.1007/s10579-025-09812-9Online publication date: 26-Feb-2025
    • (2024)Detective Gadget: Generic Iterative Entity Resolution over Dirty DataData10.3390/data91201399:12(139)Online publication date: 25-Nov-2024
    • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.28(36)Online publication date: 18-Oct-2024
    • (2024)Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modelingGates Open Research10.12688/gatesopenres.15418.18(36)Online publication date: 3-May-2024
    • (2024)Cleenex: Support for User Involvement During an Iterative Data Cleaning ProcessJournal of Data and Information Quality10.1145/3648476Online publication date: 15-Feb-2024
    • (2024)Better entity matching with transformers through ensemblesKnowledge-Based Systems10.1016/j.knosys.2024.111678293:COnline publication date: 7-Jun-2024
    • (2024)Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domainDiscover Artificial Intelligence10.1007/s44163-024-00159-84:1Online publication date: 16-Aug-2024
    • (2024)An in-depth analysis of pre-trained embeddings for entity resolutionThe VLDB Journal10.1007/s00778-024-00879-434:1Online publication date: 4-Dec-2024
    • (2023)Pre-Trained Embeddings for Entity Resolution: An Experimental AnalysisProceedings of the VLDB Endowment10.14778/3598581.359859416:9(2225-2238)Online publication date: 1-May-2023
    • (2023)TemporalDedup: Domain-Independent Deduplication of Redundant and Errant Temporal DataInternational Journal of Semantic Computing10.1142/S1793351X2350001017:02(309-343)Online publication date: 18-Apr-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media