Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Global detection of complex copying relationships between sources

Published: 01 September 2010 Publication History

Abstract

Web technologies have enabled data sharing between sources but also simplified copying (and often publishing without proper attribution). The copying relationships can be complex: some sources copy from multiple sources on different subsets of data; some co-copy from the same source, and some transitively copy from another. Understanding such copying relationships is desirable both for business purposes and for improving many key components in data integration, such as resolving conflicts across various sources, reconciling distinct references to the same real-world entity, and efficiently answering queries over multiple sources. Recent works have studied how to detect copying between a pair of sources, but the techniques can fall short in the presence of complex copying relationships.
In this paper we describe techniques that discover global copying relationships between a set of structured sources. Towards this goal we make two contributions. First, we propose a global detection algorithm that identifies co-copying and transitive copying, returning only source pairs with direct copying. Second, global detection requires accurate decisions on copying direction; we significantly improve over previous techniques on this by considering various types of evidence for copying and correlation of copying on different data items. Experimental results on real-world data and synthetic data show high effectiveness and efficiency of our techniques.

References

[1]
L. Berti-Equille, A. D. Sarma, X. L. Dong, A. Marian, and D. Srivastava. Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In CIDR, 2009.
[2]
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Probabilistic models to reconcile complex data from inaccurate data sources. In CAiSE, 2010.
[3]
P. Buneman. The recovery of trees from measures of dissimilarity. Mathematics the Archeological and Historical Sciences, pages 387--395, 1971.
[4]
T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc., New York, 1991.
[5]
X. L. Dong, L. Berti-Equille, Y. Hu, and D. Srivastava. Global detection of complex copying relationships between sources. PVLDB, 2010.
[6]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Integrating conflicting data: the role of source dependence. PVLDB, 2(1), 2009.
[7]
X. L. Dong, L. Berti-Equille, and D. Srivastava. Truth discovery and copying detection in a dynamic world. PVLDB, 2(1), 2009.
[8]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007.
[9]
E. Gansner, Y. Hu, and S. Kobourov. GMap: Drawing graphs and clusters as map. In IEEE Pacific Visualization Symposium, 2010.
[10]
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, 2006.
[11]
J. Law-To, L. Chen, A. Joly, I. Laptev, O. Buisson, V. Gouet-Brunet, N. Boujemaa, and F. Stentiford. Video copy detection: a comparative study. In CIVR, pages 371--378, 2007.
[12]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: Local algorithms for document fingerprinting. In Proc. of SIGMOD, 2003.
[13]
X. Yin, J. Han, and P. S. Yu. Truth discovery with multiple conflicting information providers on the web. In Proc. of SIGKDD, 2007.

Cited By

View all
  • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
  • (2024)Generalizing truth discovery by incorporating multi-truth featuresComputing10.1007/s00607-024-01288-9106:5(1557-1583)Online publication date: 1-May-2024
  • (2024)Efficient Privacy-Preserving Truth Discovery and Copy Detection in CrowdsourcingMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70352-2_22(368-385)Online publication date: 8-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
  • Editors:
  • Elisa Bertino,
  • Paolo Atzeni,
  • Kian Lee Tan,
  • Yi Chen,
  • Y. C. Tay
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010
Published in PVLDB Volume 3, Issue 1-2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Record Fusion via Inference and Data AugmentationACM / IMS Journal of Data Science10.1145/35935791:1(1-23)Online publication date: 16-Jan-2024
  • (2024)Generalizing truth discovery by incorporating multi-truth featuresComputing10.1007/s00607-024-01288-9106:5(1557-1583)Online publication date: 1-May-2024
  • (2024)Efficient Privacy-Preserving Truth Discovery and Copy Detection in CrowdsourcingMachine Learning and Knowledge Discovery in Databases. Research Track10.1007/978-3-031-70352-2_22(368-385)Online publication date: 8-Sep-2024
  • (2020)From Appearance to EssenceACM Transactions on Intelligent Systems and Technology10.1145/341174911:6(1-24)Online publication date: 11-Sep-2020
  • (2020)MultiImport: Inferring Node Importance in a Knowledge Graph from Multiple Input SignalsProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining10.1145/3394486.3403093(503-512)Online publication date: 23-Aug-2020
  • (2019)Dynamic Source Weight Computation for Truth Inference over Data StreamsProceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems10.5555/3306127.3331704(277-285)Online publication date: 8-May-2019
  • (2019)Review of the Complexity of Managing Big Data of the Internet of ThingsComplexity10.1155/2019/45929022019Online publication date: 3-Feb-2019
  • (2019)Truth discovery on multi-dimensional properties of data sourcesProceedings of the ACM Turing Celebration Conference - China10.1145/3321408.3326692(1-8)Online publication date: 17-May-2019
  • (2019)SmartVoteWorld Wide Web10.1007/s11280-018-0629-322:4(1855-1885)Online publication date: 1-Jul-2019
  • (2018)Harnessing Truth Discovery Algorithms On The Topic Labelling ProblemProceedings of the 20th International Conference on Information Integration and Web-based Applications & Services10.1145/3282373.3282390(8-14)Online publication date: 19-Nov-2018
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media