research-article

Mobile Access Record Resolution on Large-Scale Identifier-Linkage Graphs

Authors:

Can WangAuthors Info & Claims

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 886 - 894

https://doi.org/10.1145/3219819.3219916

Published: 19 July 2018 Publication History

Abstract

The e-commerce era is witnessing a rapid increase of mobile Internet users. Major e-commerce companies nowadays see billions of mobile accesses every day. Hidden in these records are valuable user behavioral characteristics such as their shopping preferences and browsing patterns. And, to extract these knowledge from the huge dataset, we need to first link records to the corresponding mobile devices. This Mobile Access Records Resolution (MARR) problem is confronted with two major challenges: (1) device identifiers and other attributes in access records might be missing or unreliable; (2) the dataset contains billions of access records from millions of devices. To the best of our knowledge, as a novel challenge industrial problem of mobile Internet, no existing method has been developed to resolve entities using mobile device identifiers in such a massive scale. To address these issues, we propose a SParse Identifier-linkage Graph (SPI-Graph) accompanied with the abundant mobile device profiling data to accurately match mobile access records to devices. Furthermore, two versions (unsupervised and semi-supervised) of Parallel Graph-based Record Resolution (PGRR) algorithm are developed to effectively exploit the advantages of the large-scale server clusters comprising of more than 1,000 computing nodes. We empirically show superior performances of PGRR algorithms in a very challenging and sparse real data set containing 5.28 million nodes and 31.06 million edges from 2.15 billion access records compared to other state-of-the-arts methodologies.

References

[1]

Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 783--794.

Digital Library

[2]

David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.

Digital Library

[3]

Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means++. Proceedings of the VLDB Endowment 5, 7 (2012), 622--633.

Digital Library

[4]

Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal At The International Journal on Very Large Data Bases 18, 1 (2009), 255--276.

Digital Library

[5]

Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1, 1 (2007), 5.

Digital Library

[6]

Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Data Mining, 2006. ICDM'06. Sixth International Conference on. IEEE, 87--96.

Digital Library

[7]

Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 39--48.

Digital Library

[8]

Guy E Blelloch. 1996. Programming parallel algorithms. Commun. ACM 39, 3 (1996), 85--97.

Digital Library

[9]

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3, 1 (2011), 1--122.

Digital Library

[10]

Jiajun Bu, Xin Shen, Bin Xu, Chun Chen, Xiaofei He, and Deng Cai. 2016. Improving collaborative recommendation via user-item subgroups. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2363--2375.

Digital Library

[11]

Zhaoqi Chen, Dmitri V Kalashnikov, and Sharad Mehrotra. 2009. Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 207--218.

Digital Library

[12]

Eric C Chi and Kenneth Lange. 2015. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics 24, 4 (2015), 994--1013.

[13]

Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 151--159.

Digital Library

[14]

Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. 2001. Efficient data reconciliation. Information Sciences 137, 1 (2001), 1--15.

Digital Library

[15]

Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, and Philip Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 1055--1064.

Digital Library

[16]

Mukund Deshpande and George Karypis. 2004. Selective markov models for predicting web page accesses. ACM Transactions on Internet Technology (TOIT) 4, 2 (2004), 163--184.

Digital Library

[17]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, Vol. 96. 226--231.

Digital Library

[18]

Christie I Ezeife and Yi Lu. 2005. Mining web log sequential patterns with position coded pre-order linked wap-tree. Data Mining and Knowledge Discovery 10, 1 (2005), 5--38.

Digital Library

[19]

Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1527--1527.

Digital Library

[20]

Michael Greenacre. 2017. Correspondence analysis in practice. CRC press.

[21]

David Hallac, Jure Leskovec, and Stephen Boyd. 2015. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 387--396.

Digital Library

[22]

David Harris and Sarah Harris. 2010. Digital design and computer architecture. Morgan Kaufmann.

Digital Library

[23]

John A Hartigan and JA Hartigan. 1975. Clustering algorithms. Vol. 209. Wiley New York.

Digital Library

[24]

Toby Dylan Hocking, Armand Joulin, Francis Bach, and Jean-Philippe Vert. 2011. Clusterpath an algorithm for clustering using convex fusion penalties. In 28th international conference on machine learning. 1.

Digital Library

[25]

Frank Höppner. 1999. Fuzzy cluster analysis: methods for classification, data analysis and image recognition. John Wiley &Sons.

[26]

Ida Mele. 2013. Web usage mining for enhancing search-result delivery and helping users to find interesting web content. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 765--770.

Digital Library

[27]

Matthew Michelson and Craig A Knoblock. 2006. Learning blocking schemes for record linkage. In AAAI. 440--445.

Digital Library

[28]

Bamshad Mobasher, Robert Cooley, and Jaideep Srivastava. 2000. Automatic personalization based on web usage mining. Commun. ACM 43, 8 (2000), 142--151.

Digital Library

[29]

Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849--856.

Digital Library

[30]

Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. 2010. Efficient and robust feature selection via joint, 1-norms minimization. In Advances in neural information processing systems. 1813--1821.

Digital Library

[31]

Neal Parikh, Stephen Boyd, et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1, 3 (2014), 127--239.

Digital Library

[32]

Kristiaan Pelckmans, Joseph De Brabanter, JAK Suykens, and B De Moor. 2005. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop.

[33]

Dimitrios Pierrakos, Georgios Paliouras, Christos Papatheodorou, and Constantine D Spyropoulos. 2003. Web usage mining as a tool for personalization: A survey. User modeling and user-adapted interaction 13, 4 (2003), 311--372.

Digital Library

[34]

Pradeep Ravikumar and William W Cohen. 2004. A hierarchical graphical model for record linkage. In Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 454--461.

Digital Library

[35]

Sheila Tejada, Craig A Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.

Digital Library

[36]

Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment 7, 12 (2014), 1071--1082.

Digital Library

[37]

Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483--1494.

Digital Library

[38]

Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering 25, 5 (2013), 1111--1124.

Digital Library

[39]

Xinhua Zhuang, Yan Huang, Kannappan Palaniappan, and Yunxin Zhao. 1996. Gaussian mixture density modeling, decomposition, and applications. IEEE Transactions on Image Processing 5, 9 (1996), 1293--1302.

Digital Library

Cited By

Xin SLi ZZou PLong CZhang JBu JZhou J(2021)ATNN: Adversarial Two-Tower Neural Network for New Item’s Popularity Prediction in E-commerce2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00282(2499-2510)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00282

Index Terms

Mobile Access Record Resolution on Large-Scale Identifier-Linkage Graphs
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms

Recommendations

PRES $^{3}$ : Private Record Linkage Using Services, Spark and Soundex
Service-Oriented Computing – ICSOC 2022 Workshops
Abstract
One of the most challenging tasks that emerged in the last few years is linking records from distinct organizations, while maintaining privacy. Private Record Linkage is by definition a resource demanding task. Considering the continuously ... $^{}$
Leveraging Social Media Signals for Record Linkage
WWW '18: Proceedings of the 2018 World Wide Web Conference

Many data-intensive applications collect (structured) data from a variety of sources. A key task in this process is record linkage, which is the problem of determining the records from these sources that refer to the same real-world entities. ...
Incremental record linkage

Record linkage clusters records such that each cluster corresponds to a single distinct real-world entity. It is a crucial step in data cleaning and data integration. In the big data era, the velocity of data updates is often high, quickly making ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

July 2018

2925 pages

ISBN:9781450355520

DOI:10.1145/3219819

General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Zhejiang Provincial Key Research and Development Plan

Conference

KDD '18

Sponsor:

KDD '18: The 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 19 - 23, 2018

London, United Kingdom

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
746
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)1

Reflects downloads up to 19 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xin SLi ZZou PLong CZhang JBu JZhou J(2021)ATNN: Adversarial Two-Tower Neural Network for New Item’s Popularity Prediction in E-commerce2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00282(2499-2510)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00282

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents