Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3219819.3219916acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Mobile Access Record Resolution on Large-Scale Identifier-Linkage Graphs

Published: 19 July 2018 Publication History

Abstract

The e-commerce era is witnessing a rapid increase of mobile Internet users. Major e-commerce companies nowadays see billions of mobile accesses every day. Hidden in these records are valuable user behavioral characteristics such as their shopping preferences and browsing patterns. And, to extract these knowledge from the huge dataset, we need to first link records to the corresponding mobile devices. This Mobile Access Records Resolution (MARR) problem is confronted with two major challenges: (1) device identifiers and other attributes in access records might be missing or unreliable; (2) the dataset contains billions of access records from millions of devices. To the best of our knowledge, as a novel challenge industrial problem of mobile Internet, no existing method has been developed to resolve entities using mobile device identifiers in such a massive scale. To address these issues, we propose a SParse Identifier-linkage Graph (SPI-Graph) accompanied with the abundant mobile device profiling data to accurately match mobile access records to devices. Furthermore, two versions (unsupervised and semi-supervised) of Parallel Graph-based Record Resolution (PGRR) algorithm are developed to effectively exploit the advantages of the large-scale server clusters comprising of more than 1,000 computing nodes. We empirically show superior performances of PGRR algorithms in a very challenging and sparse real data set containing 5.28 million nodes and 31.06 million edges from 2.15 billion access records compared to other state-of-the-arts methodologies.

References

[1]
Arvind Arasu, Michaela Götz, and Raghav Kaushik. 2010. On active learning of record matching packages. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 783--794.
[2]
David Arthur and Sergei Vassilvitskii. 2007. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.
[3]
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means++. Proceedings of the VLDB Endowment 5, 7 (2012), 622--633.
[4]
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB Journal At The International Journal on Very Large Data Bases 18, 1 (2009), 255--276.
[5]
Indrajit Bhattacharya and Lise Getoor. 2007. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1, 1 (2007), 5.
[6]
Mikhail Bilenko, Beena Kamath, and Raymond J Mooney. 2006. Adaptive blocking: Learning to scale up record linkage. In Data Mining, 2006. ICDM'06. Sixth International Conference on. IEEE, 87--96.
[7]
Mikhail Bilenko and Raymond J Mooney. 2003. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 39--48.
[8]
Guy E Blelloch. 1996. Programming parallel algorithms. Commun. ACM 39, 3 (1996), 85--97.
[9]
Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3, 1 (2011), 1--122.
[10]
Jiajun Bu, Xin Shen, Bin Xu, Chun Chen, Xiaofei He, and Deng Cai. 2016. Improving collaborative recommendation via user-item subgroups. IEEE Transactions on Knowledge and Data Engineering 28, 9 (2016), 2363--2375.
[11]
Zhaoqi Chen, Dmitri V Kalashnikov, and Sharad Mehrotra. 2009. Exploiting context analysis for combining multiple entity resolution systems. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data. ACM, 207--218.
[12]
Eric C Chi and Kenneth Lange. 2015. Splitting methods for convex clustering. Journal of Computational and Graphical Statistics 24, 4 (2015), 994--1013.
[13]
Peter Christen. 2008. Automatic record linkage using seeded nearest neighbour and support vector machine classification. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 151--159.
[14]
Munir Cochinwala, Verghese Kurien, Gail Lalk, and Dennis Shasha. 2001. Efficient data reconciliation. Information Sciences 137, 1 (2001), 1--15.
[15]
Anish Das Sarma, Ankur Jain, Ashwin Machanavajjhala, and Philip Bohannon. 2012. An automatic blocking mechanism for large-scale de-duplication tasks. In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 1055--1064.
[16]
Mukund Deshpande and George Karypis. 2004. Selective markov models for predicting web page accesses. ACM Transactions on Internet Technology (TOIT) 4, 2 (2004), 163--184.
[17]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, Vol. 96. 226--231.
[18]
Christie I Ezeife and Yi Lu. 2005. Mining web log sequential patterns with position coded pre-order linked wap-tree. Data Mining and Knowledge Discovery 10, 1 (2005), 5--38.
[19]
Lise Getoor and Ashwin Machanavajjhala. 2013. Entity resolution for big data. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1527--1527.
[20]
Michael Greenacre. 2017. Correspondence analysis in practice. CRC press.
[21]
David Hallac, Jure Leskovec, and Stephen Boyd. 2015. Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, 387--396.
[22]
David Harris and Sarah Harris. 2010. Digital design and computer architecture. Morgan Kaufmann.
[23]
John A Hartigan and JA Hartigan. 1975. Clustering algorithms. Vol. 209. Wiley New York.
[24]
Toby Dylan Hocking, Armand Joulin, Francis Bach, and Jean-Philippe Vert. 2011. Clusterpath an algorithm for clustering using convex fusion penalties. In 28th international conference on machine learning. 1.
[25]
Frank Höppner. 1999. Fuzzy cluster analysis: methods for classification, data analysis and image recognition. John Wiley &Sons.
[26]
Ida Mele. 2013. Web usage mining for enhancing search-result delivery and helping users to find interesting web content. In Proceedings of the sixth ACM international conference on Web search and data mining. ACM, 765--770.
[27]
Matthew Michelson and Craig A Knoblock. 2006. Learning blocking schemes for record linkage. In AAAI. 440--445.
[28]
Bamshad Mobasher, Robert Cooley, and Jaideep Srivastava. 2000. Automatic personalization based on web usage mining. Commun. ACM 43, 8 (2000), 142--151.
[29]
Andrew Y Ng, Michael I Jordan, and Yair Weiss. 2002. On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems. 849--856.
[30]
Feiping Nie, Heng Huang, Xiao Cai, and Chris H Ding. 2010. Efficient and robust feature selection via joint, 1-norms minimization. In Advances in neural information processing systems. 1813--1821.
[31]
Neal Parikh, Stephen Boyd, et al. 2014. Proximal algorithms. Foundations and Trends® in Optimization 1, 3 (2014), 127--239.
[32]
Kristiaan Pelckmans, Joseph De Brabanter, JAK Suykens, and B De Moor. 2005. Convex clustering shrinkage. In PASCAL Workshop on Statistics and Optimization of Clustering Workshop.
[33]
Dimitrios Pierrakos, Georgios Paliouras, Christos Papatheodorou, and Constantine D Spyropoulos. 2003. Web usage mining as a tool for personalization: A survey. User modeling and user-adapted interaction 13, 4 (2003), 311--372.
[34]
Pradeep Ravikumar and William W Cohen. 2004. A hierarchical graphical model for record linkage. In Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 454--461.
[35]
Sheila Tejada, Craig A Knoblock, and Steven Minton. 2001. Learning object identification rules for information integration. Information Systems 26, 8 (2001), 607--633.
[36]
Norases Vesdapunt, Kedar Bellare, and Nilesh Dalvi. 2014. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment 7, 12 (2014), 1071--1082.
[37]
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483--1494.
[38]
Steven Euijong Whang, David Marmaros, and Hector Garcia-Molina. 2013. Pay-as-you-go entity resolution. IEEE Transactions on Knowledge and Data Engineering 25, 5 (2013), 1111--1124.
[39]
Xinhua Zhuang, Yan Huang, Kannappan Palaniappan, and Yunxin Zhao. 1996. Gaussian mixture density modeling, decomposition, and applications. IEEE Transactions on Image Processing 5, 9 (1996), 1293--1302.

Cited By

View all
  • (2021)ATNN: Adversarial Two-Tower Neural Network for New Item’s Popularity Prediction in E-commerce2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00282(2499-2510)Online publication date: Apr-2021

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data
  2. graph algorithms
  3. mobile access record resolution
  4. scalable algorithms

Qualifiers

  • Research-article

Funding Sources

  • Zhejiang Provincial Key Research and Development Plan

Conference

KDD '18
Sponsor:

Acceptance Rates

KDD '18 Paper Acceptance Rate 107 of 983 submissions, 11%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2021)ATNN: Adversarial Two-Tower Neural Network for New Item’s Popularity Prediction in E-commerce2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00282(2499-2510)Online publication date: Apr-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media