Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

An efficient similarity search framework for SimRank over large dynamic graphs

Published: 01 April 2015 Publication History

Abstract

SimRank is an important measure of vertex-pair similarity according to the structure of graphs. The similarity search based on SimRank is an important operation for identifying similar vertices in a graph and has been employed in many data analysis applications. Nowadays, graphs in the real world become much larger and more dynamic. The existing solutions for similarity search are expensive in terms of time and space cost. None of them can efficiently support similarity search over large dynamic graphs. In this paper, we propose a novel two-stage random-walk sampling framework (TSF) for SimRank-based similarity search (e.g., top-k search). In the preprocessing stage, TSF samples a set of one-way graphs to index raw random walks in a novel manner within O(NRg) time and space, where N is the number of vertices and Rg is the number of one-way graphs. The one-way graph can be efficiently updated in accordance with the graph modification, thus TSF is well suited to dynamic graphs. During the query stage, TSF can search similar vertices fast by naturally pruning unqualified vertices based on the connectivity of one-way graphs. Furthermore, with additional Rq samples, TSF can estimate the SimRank score with probability [EQUATION] if the error of approximation is bounded by 1 -- ε. Finally, to guarantee the scalability of TSF, the one-way graphs can also be compactly stored on the disk when the memory is limited. Extensive experiments have demonstrated that TSF can handle dynamic billion-edge graphs with high performance.

References

[1]
I. Antonellis, H. G. Molina, and C. C. Chang. Simrank++: Query rewriting through link analysis of the click graph. In PVLDB, pages 408--421, 2008.
[2]
L. Arge. The buffer tree: A new technique for optimal i/o-algorithms. In WADS, volume 955, pages 334--345, 1995.
[3]
T. I. M. Association. Standard midi-file format spec. 1.1.
[4]
B. Balkenhol and Y. M. Shtarkov. One attempt of a compression algorithm using the bwt, 1999.
[5]
A. A. Benczúr, K. Csalogóny, and T. Sarlós. Link-based similarity search to fight web spam. In AIRWEB, pages 9--16, 2006.
[6]
L. Cao, B. Cho, H. D. Kim, Z. Li, M.-H. Tsai, and I. Gupta. Delta-simrank computing on mapreduce. In BigMine, pages 28--35, 2012.
[7]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. 2nd edition, 2001.
[8]
B. Cui, H. Mei, and B. C. Ooi. Big data: the driver for innovation in databases. National Science Review, 1(1): 27--30, 2014.
[9]
D. Fogaras and B. Rócz. Scaling link-based similarity search. In WWW, pages 641--650, 2005.
[10]
Y. Fujiwara, M. Nakatsuji, H. Shiokawa, and M. Onizuka. Efficient search algorithm for simrank. In ICDE, pages 589--600, 2013.
[11]
D. A. Harville. Matrix Algebra From a Statistician's Perspective. Springer, corrected edition, Nov. 2000.
[12]
G. He, H. Feng, C. Li, and H. Chen. Parallel simrank computation on large graphs with iterative aggregation. In KDD, pages 543--552, 2010.
[13]
W. Hoeffding. Probability inequalities for sums of bounded random variables, 1962.
[14]
G. Jeh and J. Widom. Simrank: A measure of structural-context similarity. In KDD, pages 538--543, 2002.
[15]
M. Kusumoto, T. Maehara, and K.-i. Kawarabayashi. Scalable similarity search for simrank. In SIGMOD, pages 325--336, 2014.
[16]
P. Lee, L. V. S. Lakshmanan, and J. X. Yu. On top-k structural similarity search. In ICDE, pages 774--785, 2012.
[17]
J. Leskovec and C. Faloutsos. Sampling from large graphs. In KDD, pages 631--636, 2006.
[18]
C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, and T. Wu. Fast computation of simrank for static and dynamic information networks. In EDBT, pages 465--476, 2010.
[19]
P. Li, Y. Cai, H. Liu, J. He, and X. Du. Exploiting the block structure of link graph for efficient similarity computation. In PAKDD, pages 389--400, 2009.
[20]
P. Li, H. Liu, J. Xu, Y. Jun, and H. X. Du. Fast single-pair simrank computation. In SDM, pages 571--582, 2010.
[21]
D. Lizorkin, P. Velikhov, M. Grinev, and D. Turdakov. Accuracy estimate and optimization techniques for simrank computation. In PVLDB, pages 45--66, 2010.
[22]
S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In ICDE, pages 117--128, 2002.
[23]
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes 3rd edition: The art of scientific computing. 2007.
[24]
C. P. Robert and G. Casella. Monte carlo statistical methods (springer texts in statistics). 2005.
[25]
W. Xi, E. A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, and D. Zhuang. Simfusion: Measuring similarity using unified relationship matrix. In SIGIR, pages 130--137, 2005.
[26]
W. Yu, X. Lin, and J. Le. A space and time efficient algorithm for simrank computation. In APWEB, pages 164--170, 2010.
[27]
W. Yu, X. Lin, and J. Le. Taming computational complexity: Efficient and parallel simrank optimizations on undirected graphs. In WAIM, pages 280--296, 2010.
[28]
W. Yu, X. Lin, and W. Zhang. Towards efficient simrank computation on large networks. In ICDE, pages 601--612, 2013.
[29]
W. Yu, X. Lin, and W. Zhang. Fast incremental simrank on link-evolving graphs. In ICDE, pages 304--315, 2014.
[30]
W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is simpler: Effectively and efficiently assessing node-pair similarities based on hyperlinks. In PVLDB, pages 13--24, 2013.
[31]
W. Zheng, L. Zou, Y. Feng, L. Chen, and D. Zhao. Efficient simrank-based similarity join over large graphs. In PVLDB, pages 493--504, 2013.

Cited By

View all
  • (2024)I-CoSim: Efficient Dynamic CoSimRank Retrieval on Evolving NetworksCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651523(923-926)Online publication date: 13-May-2024
  • (2023)Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and ImprovementProceedings of the VLDB Endowment10.14778/3636218.363621917:4(617-629)Online publication date: 1-Dec-2023
  • (2023)Efficient Single-Source SimRank Query by Path AggregationProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599328(3342-3352)Online publication date: 6-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 8
April 2015
72 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2015
Published in PVLDB Volume 8, Issue 8

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)16
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)I-CoSim: Efficient Dynamic CoSimRank Retrieval on Evolving NetworksCompanion Proceedings of the ACM Web Conference 202410.1145/3589335.3651523(923-926)Online publication date: 13-May-2024
  • (2023)Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and ImprovementProceedings of the VLDB Endowment10.14778/3636218.363621917:4(617-629)Online publication date: 1-Dec-2023
  • (2023)Efficient Single-Source SimRank Query by Path AggregationProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599328(3342-3352)Online publication date: 6-Aug-2023
  • (2023)Hierarchical All-Pairs SimRank CalculationDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_17(252-268)Online publication date: 17-Apr-2023
  • (2021)DiskProceedings of the VLDB Endowment10.14778/3430915.343092514:3(351-363)Online publication date: 9-Dec-2021
  • (2021)RoleSim*: Scaling axiomatic role-based similarity ranking on large graphsWorld Wide Web10.1007/s11280-021-00925-z25:2(785-829)Online publication date: 11-Aug-2021
  • (2021)ExactSim: benchmarking single-source SimRank algorithms with high-precision ground truthsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00672-730:6(989-1015)Online publication date: 5-Jun-2021
  • (2021)Memory-aware framework for fast and scalable second-order random walk over billion-edge natural graphsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00669-230:5(769-797)Online publication date: 7-May-2021
  • (2020)SimTabProceedings of the VLDB Endowment10.14778/3407790.340781913:12(2202-2214)Online publication date: 14-Sep-2020
  • (2020)Realtime index-free single source SimRank processing on web-scale graphsProceedings of the VLDB Endowment10.14778/3384345.338434713:7(966-980)Online publication date: 26-Mar-2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media