Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Survey on Truth Discovery

Published: 25 February 2016 Publication History

Abstract

Thanks to information explosion, data for the objects of interest can be collected from increasingly more sources. However, for the same object, there usually exist conflicts among the collected multi-source information. To tackle this challenge, truth discovery, which integrates multi-source noisy information by estimating the reliability of each source, has emerged as a hot topic. Several truth discovery methods have been proposed for various scenarios, and they have been successfully applied in diverse application domains. In this survey, we focus on providing a comprehensive overview of truth discovery methods, and summarizing them from different aspects. We also discuss some future directions of truth discovery research. We hope that this survey will promote a better understanding of the current progress on truth discovery, and offer some guidelines on how to apply these approaches in application domains.

References

[1]
Amazon mechanical turk. https://www.mturk.com/mturk/welcome.
[2]
Freebase. https://www.freebase.com/.
[3]
Google knowledge graph. http://www.google.com/insidesearch/features/search/knowledge.html.
[4]
Yago. http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/.
[5]
C. C. Aggarwal and T. Abdelzaher. Social sensing. In Managing and mining sensor data, pages 237--297. 2013.
[6]
B. Aydin, Y. Yilmaz, Y. Li, Q. Li, J. Gao, and M. Demirbas. Crowdsourcing for multiple-choice question answering. In Proc. of the Conference on Innovative Applications of Artificial Intelligence (IAAI'14), pages 2946--2953, 2014.
[7]
D. P. Bertsekas. Non-linear Programming. Athena Scientific, 2nd edition, 1999.
[8]
S. Bickel and T. Scheffer. Multi-view clustering. In Proc. of the IEEE International Conference on Data Mining (ICDM'04), pages 19--26, 2004.
[9]
J. Bleiholder and F. Naumann. Conflict handling strategies in an integrated information system. In Proc. of the International Workshop on Information Integration on the Web (IIWeb'06), 2006.
[10]
J. Bleiholder and F. Naumann. Data fusion. ACM Computing Surveys, 41(1):1:1--1:41, 2009.
[11]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. of the annual conference on Computational learning theory (COLT'98), pages 92--100, 1998.
[12]
A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20--28, 1979.
[13]
R. DerSimonian and N. Laird. Meta-analysis in clinical trials. Controlled clinical trials, 7(3):177--188, 1986.
[14]
X. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 3(1):1358--1369, 2010.
[15]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: The role of source dependence. PVLDB, 2(1):550--561, 2009.
[16]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1):562--573, 2009.
[17]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14), pages 601--610, 2014.
[18]
X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang. From data fusion to knowledge fusion. PVLDB, 7(10):881--892, 2014.
[19]
X. L. Dong, E. Gabrilovich, K. Murphy, V. Dang, W. Horn, C. Lugaresi, S. Sun, and W. Zhang. Knowledge-based trust: Estimating the trustworthiness of web sources. PVLDB, 8(9):938--949, 2015.
[20]
X. L. Dong and F. Naumann. Data fusion: Resolving data conflicts for integration. PVLDB, 2(2):1654--1655, 2009.
[21]
X. L. Dong, B. Saha, and D. Srivastava. Less is more: Selecting sources wisely for integration. PVLDB, 6(2):37--48, 2012.
[22]
X. L. Dong and D. Srivastava. Compact explanation of data fusion decisions. In Proc. of the International Conference on World Wide Web (WWW'13), pages 379--390, 2013.
[23]
C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregation methods for the web. In Proc. of the International Conference on World Wide Web (WWW'01), pages 613--622, 2001.
[24]
A. Galland, S. Abiteboul, A. Marian, and P. Senellart. Corroborating information from disagreeing views. In Proc. of the ACM International Conference on Web Search and Data Mining (WSDM'10), pages 131--140, 2010.
[25]
M. Gupta, Y. Sun, and J. Han. Trust analysis with clustering. In Proc. of the International Conference on World Wide Web (WWW'11), pages 53--54, 2011.
[26]
H. Le, D. Wang, H. Ahmadi, Y. S. Uddin, B. Szymanski, R. Ganti, and T. Abdelzaher. Demo: Distilling likely truth from noisy streaming data with apollo. In Proc. of the ACM International Conference on Embedded Networked Sensor Systems (Sensys'11), pages 417--418, 2011.
[27]
F. Li, M. L. Lee, and W. Hsu. Entity profiling with varying source reliabilities. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14), pages 1146--1155, 2014.
[28]
H. Li, B. Zhao, and A. Fuxman. The wisdom of minority: discovering and targeting the right group of workers for crowdsourcing. In Proc. of the International Conference on World Wide Web (WWW'14), pages 165--176, 2014.
[29]
Q. Li, Y. Li, J. Gao, L. Su, B. Zhao, D. Murat, W. Fan, and J. Han. A confidence-aware approach for truth discovery on long-tail data. PVLDB, 8(4):425--436, 2015.
[30]
Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, and J. Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), pages 1187--1198, 2014.
[31]
X. Li, X. L. Dong, K. B. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, 6(2):97--108, 2012.
[32]
Y. Li, Q. Li, J. Gao, L. Su, B. Zhao, W. Fan, and J. Han. On the discovery of evolving truth. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), pages 675--684, 2015.
[33]
S. Lin. Rank aggregation methods. Wiley Interdisciplinary Reviews: Computational Statistics, 2(5):555--570, 2010.
[34]
M. W. Lipsey and D. B. Wilson. Practical metaanalysis, volume 49. 2001.
[35]
X. Liu, X. L. Dong, B. C. Ooi, and D. Srivastava. Online data fusion. PVLDB, 4(11):932--943, 2011.
[36]
R. C. Luo, C.-C. Yih, and K. L. Su. Multisensor fusion and integration: approaches, applications, and future research directions. IEEE Sensors Journal, 2(2):107--119, 2002.
[37]
F. Ma, Y. Li, Q. Li, M. Qiu, J. Gao, S. Zhi, L. Su, B. Zhao, H. Ji, and J. Han. Faitcrowd: Fine grained truth discovery for crowdsourced data aggregation. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), pages 745--754, 2015.
[38]
A. Marian and M. Wu. Corroborating information from web sources. IEEE Data Engineering Bulletin, 34(3):11--17, 2011.
[39]
C. Meng, W. Jiang, Y. Li, J. Gao, L. Su, H. Ding, and Y. Cheng. Truth discovery on crowd sensing of correlated entities. In Proc. of the ACM International Conference on Embedded Networked Sensor Systems (Sensys'15), 2015.
[40]
C. Miao, W. Jiang, L. Su, Y. Li, S. Guo, Z. Qin, H. Xiao, J. Gao, and K. Ren. Cloud-enabled privacypreserving truth discovery in crowd sensing systems. In Proc. of the ACM International Conference on Embedded Networked Sensor Systems (Sensys'15), 2015.
[41]
H. B. Mitchell. Multi-sensor data fusion: an introduction. Springer Science & Business Media, 2007.
[42]
S. Mukherjee, G. Weikum, and C. Danescu-Niculescu- Mizil. People on drugs: credibility of user statements in health communities. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'14), pages 65--74, 2014.
[43]
V.-A. Nguyen, E.-P. Lim, J. Jiang, and A. Sun. To trust or not to trust? predicting online trusts using trust antecedent framework. In Proc. of the IEEE International Conference on Data Mining (ICDM'09), pages 896--901, 2009.
[44]
J. O'Donovan and B. Smyth. Trust in recommender systems. In Proc. of the international conference on Intelligent user interfaces (IUI'05), pages 167--174, 2005.
[45]
J. Pasternack and D. Roth. Comprehensive trust metrics for information networks. In Army Science Conference, 2010.
[46]
J. Pasternack and D. Roth. Knowing what to believe (when you already know something). In Proc. of the International Conference on Computational Linguistics (COLING'10), pages 877--885, 2010.
[47]
J. Pasternack and D. Roth. Making better informed trust decisions with generalized fact-finding. In Proc. of the International Jont Conference on Artifical Intelligence (IJCAI'11), pages 2324--2329, 2011.
[48]
J. Pasternack and D. Roth. Latent credibility analysis. In Proc. of the International Conference on World Wide Web (WWW'13), pages 1009--1020, 2013.
[49]
R. Pochampally, A. D. Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), pages 433--444, 2014.
[50]
G.-J. Qi, C. C. Aggarwal, J. Han, and T. Huang. Mining collective intelligence in diverse groups. In Proc. of the International Conference on World Wide Web (WWW'13), pages 1041--1052, 2013.
[51]
V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In Proc. of the International Conference on Machine Learning (ICML'09), pages 889--896, 2009.
[52]
T. Rekatsinas, X. L. Dong, and D. Srivastava. Characterizing and selecting fresh data sources. In Proc. of the ACM SIGMOD International Conference on Management of Data (SIGMOD'14), pages 919--930, 2014.
[53]
A. D. Sarma, X. L. Dong, and A. Halevy. Data integration with dependent sources. In Proc. of the International Conference on Extending Database Technology (EDBT'11), pages 401--412, 2011.
[54]
G. Seni and J. F. Elder. Ensemble methods in data mining: improving accuracy through combining predictions. nSynthesis Lectures on Data Mining and Knowledge Discovery, 2(1):1--126, 2010.
[55]
V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labels. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08), pages 614--622, 2008.
[56]
P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems (NIPS'95), pages 1085--1092, 1995.
[57]
R. Snow, B. O'Connor, D. Jurafsky, and A. Ng. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP'08), pages 254--263, 2008.
[58]
A. Sorokin and D. Forsyth. Utility data annotation with amazon mechanical turk. In Proc. of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'08), pages 1--8, 2008.
[59]
M. Spain and P. Perona. Some objects are more equal than others: Measuring and predicting importance. In Proc. European Conference on Computer Vision (ECCV'08), pages 523--536, 2008.
[60]
L. Su, Q. Li, S. Hu, S. Wang, J. Gao, H. Liu, T. Abdelzaher, J. Han, X. Liu, Y. Gao, and L. Kaplan. Generalized decision aggregation in distributed sensing systems. In Proc. of the IEEE Real-Time Systems Symposium (RTSS'14), pages 1--10, 2014.
[61]
J. Tang and H. Liu. Trust in social computing. In Proc. of the international conference on World wide web companion, pages 207--208, 2014.
[62]
L.-A. Tang, X. Yu, S. Kim, Q. Gu, J. Han, A. Leung, and T. La Porta. Trustworthiness analysis of sensor data in cyber-physical systems. Journal of Computer and System Sciences, 79(3):383--401, 2013.
[63]
P. Victor, M. De Cock, and C. Cornelis. Trust and recommendations. In Recommender systems handbook, pages 645--675. Springer, 2011.
[64]
D. Wang, T. Abdelzaher, L. Kaplan, and C. Aggarwal. Recursive fact-finding: A streaming approach to truth estimation in crowdsourcing applications. In Proc. of the International Conference on Distributed Computing Systems (ICDCS'13), pages 530--539, 2013.
[65]
D. Wang, M. T. Amin, S. Li, T. Abdelzaher, L. Kaplan, S. Gu, C. Pan, H. Liu, C. C. Aggarwal, R. Ganti, et al. Using humans as sensors: An estimation-theoretic perspective. In Proc. of the International Conference on Information Processing in Sensor Networks (IPSN'14), pages 35--46, 2014.
[66]
D. Wang, L. Kaplan, and T. F. Abdelzaher. Maximum likelihood analysis of conflicting observations in social sensing. ACM Transactions on Sensor Networks (ToSN), 10(2):30, 2014.
[67]
D. Wang, L. Kaplan, H. Le, and T. Abdelzaher. On truth discovery in social sensing: A maximum likelihood estimation approach. In Proc. of the International Conference on Information Processing in Sensor Networks (IPSN'12), pages 233--244, 2012.
[68]
S. Wang, L. Su, S. Li, S. Yao, S. Hu, L. Kaplan, T. Amin, T. Abdelzaher, and W. Hongwei. Scalable social sensing of interdependent phenomena. In Proc. of the International Conference on Information Processing in Sensor Networks (IPSN'15), pages 202--213, 2015.
[69]
S. Wang, D. Wang, L. Su, L. Kaplan, and T. Abdelzaher. Towards cyber-physical systems in social spaces: The data reliability challenge. In Proc. of the IEEE Real-Time Systems Symposium (RTSS'14), pages 74--85, 2014.
[70]
P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems (NIPS'10), pages 2424--2432, 2010.
[71]
J. Whitehill, P. Ruvolo, T. Wu, J. Bergsma, and J. Movellan. Whose vote should count more: Optimal integration of labelers of unknown expertise. In Advances in Neural Information Processing Systems (NIPS'09), pages 2035--2043, 2009.
[72]
M. Wu and A. Marian. A framework for corroborating answers from multiple web sources. Information Systems, 36(2):431--449, 2011.
[73]
C. Xu, D. Tao, and C. Xu. A survey on multi-view learning. arXiv preprint arXiv:1304.5634, 2013.
[74]
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07), pages 1048--1052, 2007.
[75]
X. Yin and W. Tan. Semi-supervised truth discovery. In Proc. of the International Conference on World Wide Web (WWW'11), pages 217--226, 2011.
[76]
D. Yu, H. Huang, T. Cassidy, H. Ji, C. Wang, S. Zhi, J. Han, C. Voss, and M. Magdon-Ismail. The wisdom of minority: Unsupervised slot filling validation based on multi-dimensional truth-finding. In Proc. of the International Conference on Computational Linguistics (COLING'14), 2014.
[77]
B. Zhao and J. Han. A probabilistic model for estimating real-valued truth from conflicting sources. In Proc. of the VLDB workshop on Quality in Databases (QDB'12), 2012.
[78]
B. Zhao, B. I. P. Rubinstein, J. Gemmell, and J. Han. A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550--561, 2012.
[79]
Z. Zhao, J. Cheng, and W. Ng. Truth discovery in data streams: A single-pass probabilistic approach. In Proc. of the ACM Conference on Information and Knowledge Management (CIKM'14), pages 1589--1598, 2014.
[80]
S. Zhi, B. Zhao, W. Tong, J. Gao, D. Yu, H. Ji, and J. Han. Modeling truth existence in truth discovery. In Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'15), pages 1543--1552, 2015.
[81]
D. Zhou, J. C. Platt, S. Basu, and Y. Mao. Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems (NIPS'12), pages 2204--2212, 2012.
[82]
Z.-H. Zhou. Ensemble methods: foundations and algorithms. Chapman & Hall/CRC Machine Learning & Pattern Recognition Series, 2012.

Cited By

View all
  • (2025)Multiple financial analyst opinions aggregation based on uncertainty-aware quality evaluationEuropean Journal of Operational Research10.1016/j.ejor.2024.08.024320:3(720-738)Online publication date: Feb-2025
  • (2024)A Privacy-Preserving and Quality-Aware User Selection Scheme for IoTMathematics10.3390/math1219296112:19(2961)Online publication date: 24-Sep-2024
  • (2024)Detect-Then-Resolve: Enhancing Knowledge Graph Conflict Resolution with Large Language ModelMathematics10.3390/math1215231812:15(2318)Online publication date: 24-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 17, Issue 2
December 2015
41 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2897350
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2016
Published in SIGKDD Volume 17, Issue 2

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)184
  • Downloads (Last 6 weeks)18
Reflects downloads up to 21 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2025)Multiple financial analyst opinions aggregation based on uncertainty-aware quality evaluationEuropean Journal of Operational Research10.1016/j.ejor.2024.08.024320:3(720-738)Online publication date: Feb-2025
  • (2024)A Privacy-Preserving and Quality-Aware User Selection Scheme for IoTMathematics10.3390/math1219296112:19(2961)Online publication date: 24-Sep-2024
  • (2024)Detect-Then-Resolve: Enhancing Knowledge Graph Conflict Resolution with Large Language ModelMathematics10.3390/math1215231812:15(2318)Online publication date: 24-Jul-2024
  • (2024)Efficient and Reliable Estimation of Knowledge Graph AccuracyProceedings of the VLDB Endowment10.14778/3665844.366586517:9(2392-2403)Online publication date: 1-May-2024
  • (2024)FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous DataProceedings of the VLDB Endowment10.14778/3648160.364817417:6(1337-1349)Online publication date: 3-May-2024
  • (2024)Modality Deep-learning Frameworks for Fake News Detection on Social Networks: A Systematic Literature ReviewACM Computing Surveys10.1145/370074857:3(1-50)Online publication date: 22-Nov-2024
  • (2024)Hypergraph-based Truth Discovery for Sparse Data in Mobile CrowdsensingACM Transactions on Sensor Networks10.1145/364989420:3(1-23)Online publication date: 28-Feb-2024
  • (2024)Report on the 1st Workshop on Diffusion of Harmful Content on Online Web (DHOW) at WebSci 2024Companion Publication of the 16th ACM Web Science Conference10.1145/3630744.3665312(60-64)Online publication date: 21-May-2024
  • (2024)Oasis: Online All-Phase Quality-Aware Incentive Mechanism for MCSIEEE Transactions on Services Computing10.1109/TSC.2024.335424017:2(589-603)Online publication date: Mar-2024
  • (2024)Crowd Bus Sensing: Resolving Conflicts Between the Ground Truth and Map AppsIEEE Transactions on Mobile Computing10.1109/TMC.2022.323108523:2(1097-1111)Online publication date: 1-Feb-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media