Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1963405.1963439acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Semi-supervised truth discovery

Published: 28 March 2011 Publication History

Abstract

Accessing online information from various data sources has become a necessary part of our everyday life. Unfortunately such information is not always trustworthy, as different sources are of very different qualities and often provide inaccurate and conflicting information. Existing approaches attack this problem using unsupervised learning methods, and try to infer the confidence of the data value and trustworthiness of each source from each other by assuming values provided by more sources are more accurate. However, because false values can be widespread through copying among different sources and out-of-date data often overwhelm up-to-date data, such bootstrapping methods are often ineffective.
In this paper we propose a semi-supervised approach that finds true values with the help of ground truth data. Such ground truth data, even in very small amount, can greatly help us identify trustworthy data sources. Unlike existing studies that only provide iterative algorithms, we derive the optimal solution to our problem and provide an iterative algorithm that converges to it. Experiments show our method achieves higher accuracy than existing approaches, and it can be applied on very huge data sets when implemented with MapReduce.

References

[1]
J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. WWW'06.
[2]
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu and Y. Zhang. WebTables: Exploring the Power of Tables on the Web. VLDB'08.
[3]
A. Celikyilmaz, M. Thint, Z. Huang. A graph-based semi-supervised learning for question answering. IJCNLP'09.
[4]
E. Crestan and P. Pantel. Web-scale knowledge extraction from semi-structured tables. WWW'10.
[5]
X. L. Dong, L. Berti-Equille, Y. Hu and D. Srivastava. Global detection of complex copying relationships between sources. In VLDB'10.
[6]
X. L. Dong, L. Berti-Equille and D. Srivastava. Integrating conflicting data: The role of source dependence. VLDB'09.
[7]
X. L. Dong, L. Berti-Equille and D. Srivastava. Truth discovery and copying detection in a dynamic world. VLDB'09.
[8]
X. L. Dong. Presentation for {6}. http://www2.research.att.com/~lunadong/talks/depenDetection.pptx
[9]
A. Enright. Consumers trust information found online less than offline messages. Internet Retailer, Aug 25, 2010.
[10]
A. Galland, S. Abiteboul, A. Marian and P. Senellart. Corroborating information from disagreeing views. WSDM'10.
[11]
A. B. Goldberg, X. Zhu and S. Wright. Dissimilarity in graph-based semi-supervised classification. AISTATS'07.
[12]
M. Isard, M. Budiu, Y. Yu, A. Birrell and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. Operating Systems Review, 41(3), 2007.
[13]
G. Miao, J. Tatemura, W.-P. Hsiung, A. Sawires and L. E. Moser. Extracting data records from the web using tag path clustering. WWW'09.
[14]
J. Tang, H. Li, Q.-J. Qi and T.-S. Chua. Integrated graph-based semi-supervised multiple/single instance learning framework for image annotation. ACM Multimedia'08.
[15]
M. Wu and A. Marian. Corroborating answers from multiple web sources. WebDB'07.
[16]
X. Yin, J. Han and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. KDD'07.
[17]
X. Yin, W. Tan, X. Li and Y.-C. Tu. Automatic Extraction of Clickable Structured Web Contents for Name Entity Queries. WWW'10.
[18]
D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Schölkopf. Learning with local and global consistency. NIPS'04.
[19]
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Carnegie Mellon University Technical Report Carnegie Mellon University-CALD-02-107, 2002.
[20]
X. Zhu, Z. Ghahramani and J. Lafferty. Semi-supervised learning using Gaussian fields and harmonic functions. ICML'03.

Cited By

View all
  • (2024)Hypergraph-based Truth Discovery for Sparse Data in Mobile CrowdsensingACM Transactions on Sensor Networks10.1145/364989420:3(1-23)Online publication date: 28-Feb-2024
  • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
  • (2024)Hybrid privacy preserving federated learning against irregular users in Next-generation Internet of ThingsJournal of Systems Architecture10.1016/j.sysarc.2024.103088(103088)Online publication date: Feb-2024
  • Show More Cited By
  1. Semi-supervised truth discovery

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '11: Proceedings of the 20th international conference on World wide web
    March 2011
    840 pages
    ISBN:9781450306324
    DOI:10.1145/1963405
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 March 2011

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data quality
    2. semi-supervised
    3. truth discovery

    Qualifiers

    • Research-article

    Conference

    WWW '11
    WWW '11: 20th International World Wide Web Conference
    March 28 - April 1, 2011
    Hyderabad, India

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)30
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 16 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Hypergraph-based Truth Discovery for Sparse Data in Mobile CrowdsensingACM Transactions on Sensor Networks10.1145/364989420:3(1-23)Online publication date: 28-Feb-2024
    • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
    • (2024)Hybrid privacy preserving federated learning against irregular users in Next-generation Internet of ThingsJournal of Systems Architecture10.1016/j.sysarc.2024.103088(103088)Online publication date: Feb-2024
    • (2023)Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-ChiefProceedings of the ACM on Management of Data10.1145/35889191:1(1-26)Online publication date: 30-May-2023
    • (2023)Discovery and Matching Numerical Attributes in Data Lakes2023 IEEE International Conference on Big Data (BigData)10.1109/BigData59044.2023.10386080(423-432)Online publication date: 15-Dec-2023
    • (2022)Multi-round Data Poisoning Attack and Defense against Truth Discovery in Crowdsensing Systems2022 23rd IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM55031.2022.00036(109-118)Online publication date: Jun-2022
    • (2022)Reputation-Based Truth Discovery With Long-Term Quality of Source in Internet of ThingsIEEE Internet of Things Journal10.1109/JIOT.2021.31105119:7(5410-5421)Online publication date: 1-Apr-2022
    • (2022)Towards an axiomatic approach to truth discoveryAutonomous Agents and Multi-Agent Systems10.1007/s10458-022-09569-336:2Online publication date: 1-Oct-2022
    • (2022)Truth validation with evidenceKnowledge and Information Systems10.1007/s10115-022-01663-y64:5(1187-1209)Online publication date: 15-Mar-2022
    • (2022)Enhancing domain-aware multi-truth data fusion using copy-based source authority and value similarityThe VLDB Journal10.1007/s00778-022-00757-x32:3(475-500)Online publication date: 19-Jul-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media