Abstract
Modern information networks, such as social networks, communication networks, and citation networks, are often characterized by very large sizes and dynamically changing structures. Common solutions to graph mining tasks (e.g., node classification) usually employ an unrestricted sampling-then-mining paradigm to reduce a large network to a manageable size, followed by subsequent mining tasks. However, real-world networks may be unaccessible at once and must be crawled progressively. This can be due to the fact that the size of the network is too large, or some privacy/legal concerns. In this paper, we propose an Active Exploration framework for large graphs, where the goal is to simultaneously carry out network sampling and node labeling in order to build a sampled network from which the trained classifier can have the maximum node classification accuracy. To achieve this goal, we consider a network as a Markov chain and compute the stationary distribution of the nodes by deriving supervised random walks. The stationary distribution helps identify specific nodes to be sampled in the next step, and the labeling process labels the most informative node which in turn strengthens the sampling of the network. To improve the scalability of active exploration for large graphs, we also propose a more efficient multi-seed algorithm that simultaneously runs multiple, parallel exploration processes, and makes joint decisions to determine which nodes are to be sampled and labeled next. The simultaneous, mutually enhanced sampling and labeling processes ensure that the final sampled network contains a maximum number of nodes directly related to the underlying mining tasks. Experiments on both synthetic and real-world networks demonstrate that our active exploration algorithms have much better chance to include target nodes in the sampled networks than baseline methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ahmed N, Neville J, Kompella R (2014) Network sampling: from static to streaming graphs. ACM Trans Knowl Discov Data 8(2):1–56
Ahn Y-Y, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of the 16th international conference on world wide web, pp 835–844. ACM, Banff, Alberta
Alamgir M, Von Luxburg U (2010) Multi-agent random walks for local clustering on graphs. In: Proceedings of the 10th IEEE international conference on data mining, pp 18–27. IEEE, Sydney
Alon N, Avin C, Koucky M, Kozma G, Lotker Z, Tuttle MR (2008) Many random walks are faster than one. In: Proceedings of the 20th annual symposium on parallelism in algorithms and architectures, pp. 119–128. ACM, Munich, Germany
Backstrom L, Leskovec J (2011) Supervised random walks: predicting and recommending links in social networks. In: Proceedings of the 4th ACM international conference on web search and data mining, pp 635–644. ACM, Hong Kong
Becchetti L, Castillo C, Donato D, Fazzone A (2006) A comparison of sampling techniques for web graph characterization. In: Proceedings of the 2006 workshop on link analysis: dynamics and static of large networks, Philadelphia, PA
Belkin M, Matveeva I, Niyogi P (2004) Regularization and semi-supervised learning on large graphs. Learning theory. Springer, Berlin, pp 624–638
Bilgic M, Getoor L (2008) Effective label acquisition for collective classification. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, pp 43–51. ACM
Bilgic M, Mihalkova L, Getoor L (2010) Active learning for networked data. In: Proceedings of the 27th international conference on machine learning, pp 79–86. Haifa, Israel
Catanese S, Meo P, Ferrara E, Fiumara G, Provetti A (2011) Crawling facebook for social network analysis purposes. In: Proceedings of the 1st international conference on web intelligence, mining and semantics, Article No. 52. ACM, Sogndal
Cesa-Bianchi N, Gentile C, Vitale F, Zappella G (2010) Active learning on trees and graphs. In: Proceedings of the 23rd annual conference on learning theory, pp 320–332. Haifa, Israel
Erdős P, Rényi A (1959) On random graphs I. Publ Math Debrecen 6:290–297
Fang M, Tao D (2014) Networked bandits with disjoint linear payoffs. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1106–1115. ACM
Fang M, Yin J, Zhu X (2013) Active exploration: Simultaneous sampling and labeling for large graphs. In: Proceedings of the 22nd ACM international conference on information and knowledge management, pp 829–834. ACM, Burlingame, CA
Gjoka M, Kurant M, Butts C, Markopoulou A (2010) Walking in facebook: a case study of unbiased sampling of osns. In: Proceedings of the 29th conference on computer communications, pp 1–9. San Diego, CA
Gkantsidis C, Mihail M, Saberi A (2004) Random walks in peer-to-peer networks. In: Proceedings of the 23rd annual joint conference of the IEEE computer and communications societie, pp 120–130. IEEE
Gyongyi Z, Garcia-Molina H, Pedersen J (2006) Web content categorization using link information. Technical report, Stanford
Halperin S, Zwick U (1994) An optimal randomized logarithmic time connectivity algorithm for the EREW PRAM. In: Proceedings of the 6th annual ACM symposium on parallel algorithms and architectures, pp 1–10. ACM, Cape May, NJ
He J, Carbonell JG, Liu Y (2007) Graph-based semi-supervised learning as a generative model. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2492–2497
Henzinger MR, Heydon A, Mitzenmacher M, Najork M (2000) On near-uniform URL sampling. Comput Netw 33(1):295–308
Hübler C, Kriegel H-P, Borgwardt K, Ghahramani Z (2008) Metropolis algorithms for representative subgraph sampling. In: Proceedings of the 8th IEEE international conference on data mining, pp 283–292. Pisa
Karger DR, Nisan N, Parnas M (1992) Fast connected components algorithms for the erew pram. In: Proceedings of the 4th annual ACM symposium on parallel algorithms and architectures, pp 373–381. ACM, San Diego, CA
Kuwadekar A, Neville J (2011) Relational active learning for joint collective classification models. In: Proceedings of the 28th international conference on machine learning, pp 385–392. Bellevue, WA
Lee S, Kim P, Jeong H (2006) Statistical properties of sampled networks. Phys Rev E 73(1):016102
Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 631–636. Philadelphia, PA
Lin F, Cohen WW (2010) Semi-supervised classification of network data using very few labels. In: Proceedings of the 2010 international conference on advances in social networks analysis and mining, pp 192–199. IEEE, Adense
Lovász L (1993) Random walks on graphs: a survey. Combinatorics 2(1):1–46
Macskassy SA (2007) Improving learning in networked data by combining explicit and mined links. In: Proceedings of the 22nd conference on artificial intelligence, pp. 590–595. AAAI Press, Vancouver, BC
Maiya A, Berger-Wolf T (2010) Online sampling of high centrality individuals in social networks. In: Proceedings of the 14th Pacific-Asia conference on knowledge discovery and data mining, pp 91–98. Hyderabad, India
Mislove A, Koppula H, Gummadi K, Druschel P, Bhattacharjee B (2008) Growth of the flickr social network. In: Proceedings of the 1st workshop on online social networks, pp 25–30. ACM, Seattle, WA
Mislove A, Marcon M, Gummadi K, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, pp 29–42. ACM, San Diego, CA
Namata G, Sen P, Bilgic M, Getoor L, Sahami M, Srivastava A (2009) Collective classification for text classification. In: Text mining: classification, clustering, and applications. CRC Press, Boca Rotan, pp 51–69
Papagelis M, Das G, Koudas N (2013) Sampling online social networks. IEEE Trans Knowl Data Eng 25:662–676
Pfeiffer III J, Neville J, Bennett P (2012) Active sampling of networks. In:Proceedings of the ICML workshop on mining and learning with graphs, Edinburgh, Scotland
Rafiei D, Curial S (2005) Effectively visualizing large networks through sampling. In: Proceedings of IEEE visualization, pp 375–382. IEEE, Minneapolis, MN
Rasti A, Torkjazi M, Rejaie R, Duffield N, Willinger W, Stutzbach D (2009) Respondent-driven sampling for characterizing unstructured overlays. In: Proceedings of the 28th conference on computer communications, pp 2701–2705. IEEE, Rio de Janeiro
Sarma A, Collapudi S, Panigrahy R (2011) Estimating pagerank on graph streams. J ACM 58(3):13
Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T (2008) Collective classification in network data. AI Mag 29(3):93
Stanton I, Kliot G (2012) Streaming graph partitioning for large distributed graphs. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Beijing
Stutzbach D, Rejaie R, Duffield N, Sen S, Willinger W (2009) On unbiased sampling for unstructured peer-to-peer networks. IEEE/ACM Trans Netw 17(2):377–390
Tong H, Faloutsos C, Pan J-Y (2006) Fast random walk with restart and its applications. In: Proceedings of the 6th international conference on data mining, pp 613–622. IEEE
Viswanath B, Mislove A, Cha M, Gummadi K (2009) On the evolution of user interaction in facebook. In: Proceedings of the 2nd ACM SIGCOMM workshop on online social networks, pp 37–42. ACM, Barcelona
Wasserman S, Faust K (1995) Social network analysis: methods and applications. Cambridge University Press, Cambridge
Wilson C, Boe B, Sala A, Puttaswamy K, Zhao B (2009) User interactions in social networks and their implications. In: Proceedings of the 4th ACM European conference on computer systems, pp 205–218. ACM, Nuremberg
Ye S, Lang J, Wu F (2010) Crawling online social graphs. In: Proceedings of the 12th International Asia-Pacific conference on web conference, pp 236–242. IEEE, Busan
Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. Adv Neural Inf Process Syst 16(16):321–328
Zhou Z, Zhang N, Gong Z, Das G (2013) Faster random walks by rewiring online social networks on-the-fly. In: Proceedings of the 29th IEEE international conference on data engineering, pp 769–780. IEEE, Brisbane
Zhu X, Ghahramani Z (2002) Learning from labeled and unlabeled data with label propagation. Technical report, CMU-CALD-02-107, Carnegie Mellon University
Zhu X, Ghahramani Z, Lafferty J (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proceedings of the 20th International Conference on Machine Learning, vol 3, pp 912–919. Washington, DC
Zhu X, Lafferty J, Ghahramani Z (2003) Combining active learning and semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 2003 ICML workshop on the continuum from labeled to unlabeled data in machine learning and data mining, pp 58–65. Washington, DC
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Thomas Seidl.
Most of this research was done while Meng Fang was at the University of Technology, Sydney.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fang, M., Yin, J. & Zhu, X. Active exploration for large graphs. Data Min Knowl Disc 30, 511–549 (2016). https://doi.org/10.1007/s10618-015-0424-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0424-z