Nothing Special   »   [go: up one dir, main page]

skip to main content
article

On unbiased sampling for unstructured peer-to-peer networks

Published: 01 April 2009 Publication History

Abstract

This paper presents a detailed examination of how the dynamic and heterogeneous nature of real-world peer-to-peer systems can introduce bias into the selection of representative samples of peer properties (e.g., degree, link bandwidth, number of files shared). We propose the Metropolized Random Walk with Backtracking (MRWB) as a viable and promising technique for collecting nearly unbiased samples and conduct an extensive simulation study to demonstrate that our technique works well for a wide variety of commonly-encountered peer-to-peer network conditions. We have implemented the MRWB algorithm for selecting peer addresses uniformly at random into a tool called ion-sampler. Using the Gnutella network, we empirically show that ion-sampler. yields more accurate samples than tools that rely on commonly-used sampling techniques and results in dramatic improvements in efficiency and scalability compared to performing a full crawl.

References

[1]
I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan, "Chord: A scalable peer-to-peer lookup protocol for Internet applications," IEEE/ACM Trans. Networking, vol. 11, no. 1, pp. 17-32, Feb. 2002.
[2]
S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker, "A scalable content-addressable network," presented at the ACM SIGCOMM 2001, San Diego, CA.
[3]
S. Saroiu, P. K. Gummadi, and S. D. Gribble, "Measuring and analyzing the characteristics of Napster and Gnutella hosts," Multimedia Syst. J., vol. 9, no. 2, pp. 170-184, Aug. 2003.
[4]
R. Bhagwan, S. Savage, and G. Voelker, "Understanding availability," presented at the 2003 Int. Workshop on Peer-to-Peer Systems, Berkeley, CA.
[5]
D. Stutzbach and R. Rejaie, "Understanding churn in peer-to-peer networks," presented at the 2006 Internet Measurement Conf., Rio de Janeiro, Brazil.
[6]
S. Chib and E. Greenberg, "Understanding the Metropolis-Hastings algorithm," The Americian Statistician, vol. 49, no. 4, pp. 327-335, Nov. 1995.
[7]
W. Hastings, "Monte carlo sampling methods using Markov chains and their applications," Biometrika, vol. 57, pp. 97-109, 1970.
[8]
N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller, "Equations of state calculations by fast computing machines," J. Chem. Phys., vol. 21, pp. 1087-1092, 1953.
[9]
A. Awan, R. A. Ferreira, S. Jagannathan, and A. Grama, "Distributed uniform sampling in unstructured peer-to-peer networks," presented at the 2006 Hawaii Int. Conf. System Sciences, Kauai, HI, Jan. 2006.
[10]
Z. Bar-Yossef and M. Gurevich, "Random sampling from a search engine's index," presented at the 2006 WWW Conf., Edinburgh, Scotland.
[11]
D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. Willinger, "Sampling techniques for large, dynamic graphs," presented at the 2006 Global Internet Symp., Barcelona, Spain, Apr. 2006.
[12]
D. Stutzbach and R. Rejaie, "Capturing accurate snapshots of the Gnutella network," in Proc. 2005 Global Internet Symp., Miami, FL, Mar. 2005, pp. 127-132.
[13]
B. Bollobás, "A probabilistic proof of an asymptotic formula for the number of labelled regular graphs," Eur. J. Combinator., vol. 1, pp. 311-316, 1980.
[14]
M. Jerrum and A. Sinclair, "Fast uniform generation of regular graphs," Theoret. Comput. Sci., vol. 73, pp. 91-100, 1990.
[15]
C. Cooper, M. Dyer, and C. Greenhill, "Sampling regular graphs and a peer-to-peer network," in Proc. Symp. Discrete Algorithms, 2005, pp. 980-988.
[16]
V. Krishnamurthy, J. Sun, M. Faloutsos, and S. Tauro, "Sampling Internet topologies: How small can we go?," in Proc. 2003 Int. Conf. Internet Computing, Las Vegas, NV, Jun. 2003, pp. 577-580.
[17]
V. Krishnamurthy, M. Faloutsos, M. Chrobak, L. Lao, and J.-H. C. G. Percus, "Reducing large Internet topologies for faster simulations," presented at the 2005 IFIP Networking Conf., Waterloo, Ontario, Canada, May 2005.
[18]
M. P. H. Stumpf, C. Wiuf, and R. M. May, "Subnets of scale-free networks are not scale-free: Sampling properties of networks," Proc. National Academy of Sciences, vol. 102, no. 12, pp. 4221-4224, Mar. 2005.
[19]
A. A. Tsay, W. S. Lovejoy, and D. R. Karger, "Random sampling in cut, flow, and network design problems," Math. Oper. Res., vol. 24, no. 2, pp. 383-413, Feb. 1999.
[20]
A. Lakhina, J. W. Byers, M. Crovella, and P. Xie, "Sampling biases in IP topology measurements," presented at the IEEE INFOCOM 2003, San Francisco, CA.
[21]
D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, "On the bias of traceroute sampling; or, power-law degree distributions in regular graphs," presented at the 2005 Symp. Theory of Computing, Baltimore, MD, May 2005.
[22]
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles, "Methods for sampling pages uniformly from the World Wide Web," in Proc. AAAI Fall Symp. Using Uncertainty Within Computation, 2001, pp. 121-128.
[23]
M. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, "On near-uniform URL sampling," in Proc. Int. World Wide Web Conf., May 2001, pp. 295-308.
[24]
L. Lovász, "Random walks on graphs: A survey," Combinatorics: Paul Erdös is Eighty, vol. 2, pp. 1-46, 1993.
[25]
Q. Lv, P. Cao, E. Cohen, K. Li, and S. Shenker, "Search and replication in unstructured peer-to-peer networks," presented at the 2002 Int. Conf. Supercomputing, New York, NY.
[26]
Y. Chawathe, S. Ratnasamy, and L. Breslau, "Making Gnutella-like P2P systems scalable," presented at the ACM SIGCOMM 2003, Karlsruhe, Germany.
[27]
C. Gkantsidis, M. Mihail, and A. Saberi, "Random walks in peer-to-peer networks," presented at the IEEE INFOCOM 2004, Hong Kong.
[28]
V. Vishnumurthy and P. Francis, "On heterogeneous overlay construction and random node selection in unstructured P2P networks," presented at the IEEE INFOCOM 2006, Barcelona, Spain, Apr. 2006.
[29]
M. Salganik and D. Heckathorn, "Sampling and estimation in hidden populations using respondent-driven sampling," Sociological Methodology , vol. 34, pp. 193-239, 2004.
[30]
D. Heckathorn, "Respondent-driven sampling: a new approach to the study of hidden populations," Social Problems, vol. 44, no. 2, pp. 174-199, 1997.
[31]
L. Goodman, "Snowball sampling," Ann. Math. Statist., vol. 32, pp. 148-170, 1961.
[32]
E. Deaux and J. Callaghan, "Key informant versus self-report estimates of health behavior," Eval. Rev., vol. 9, pp. 365-368, 1985.
[33]
J. Watters and P. Biernack, "Targeted sampling: Options for the study of hidden populations," Annu. Rev. Sociol., vol. 12, pp. 401-429, 1989.
[34]
M. Salganik, "Variance estimation, design effects, and sample size calculation for respondent-driven sampling," J. Urban Health, vol. 83, pp. 98-112, 2006, Suppl. 1.
[35]
J. Leskovec, J. Kleinberg, and C. Faloutsos, "Graphs over time: Densification laws, shrinking diameters and possible explanations," presented at the ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Chicago, IL, Aug. 2005.
[36]
A. Bonato, "A survey of models of the web graph," Combinatorial and Algorithmic Aspects of Networking, pp. 159-172, 2004.
[37]
J. Pouwelse, P. Garbacki, D. Epema, and H. Sips, "The BitTorrent P2P file-sharing system: Measurements and analysis," presented at the 2005 Int. Workshop on Peer-to-Peer Systems (IPTPS), Ithaca, NY.
[38]
M. Izal, G. Urvoy-Keller, E. W. Biersack, P. A. Felber, A. A. Hamra, and L. Garces-Erice, "Dissecting BitTorrent: Five months in a torrent's lifetime," presented at the Passive and Active Measurement Conf. (PAM), Antibes Juan-les-Pins, France, Apr. 2004.
[39]
D. Stutzbach, R. Rejaie, and S. Sen, "Characterizing unstructured overlay topologies in modern P2P file-sharing systems," in Proc. Internet Measurement Conf., Berkeley, CA, Oct. 2005, pp. 49-62.
[40]
K. P. Gummadi, S. Saroiu, and S. D. Gribble, "King: Estimating latency between arbitrary Internet end hosts," presented at the Internet Measurement Workshop, Marseille, France, Nov. 2002.
[41]
D. Liben-Nowell, H. Balakrishnan, and D. Karger, "Analysis of the evolution of peer-to-peer systems," in Principles of Distributed Computing , Monterey, CA, Jul. 2002.
[42]
S. Rhea, D. Geels, and J. Kubiatowicz, "Handling Churn in a DHT," in Proc. USENIX, 2004, pp. 127-140.
[43]
J. Li, J. Stribling, F. Kaashoek, R. Morris, and T. Gil, "A performance vs. cost framework for evaluating DHT design tradeoffs under churn," presented at the IEEE INFOCOM 2005, Miami, FL.
[44]
F. E. Bustamante and Y. Qiao, "Friendships that last: Peer lifespan and its role in P2P protocols," in Proc. Int. Workshop on Web Content Caching and Distribution, 2003, p. 2.
[45]
S. Sen and J. Wang, "Analyzing peer-to-peer traffic across large networks," IEEE/ACM Trans. Networking, vol. 12, pp. 219-232, Apr. 2004.
[46]
K. P. Gummadi, R. J. Dunn, S. Saroiu, S. D. Gribble, H. M. Levy, and J. Zahorjan, "Measurement, modeling, and analysis of a peer-to-peer file-sharing workload," in Proc. 19th ACM Symp. Operating Systems Principles (SOSP 2003), Bolton Landing, NY, 2003, pp. 314-329.
[47]
D. Leonard, V. Rai, and D. Loguinov, "On lifetime-based node failure and stochastic resilience of decentralized peer-to-peer networks," presented at the 2005 ACM SIGMETRICS, Banff, Alberta, Canada.
[48]
H. Dämpfling, Gnutella Web Caching System: Version 2 Specifications Client Developers' Guide. Jun. 2003 {Online}. Available: http://www. gnucleus.com/gwebcache/newgwc.html
[49]
P. Karbhari, M. Ammar, A. Dhamdhere, H. Raj, G. Riley, and E. Zegura, "Bootstrapping in Gnutella: A measurement study," presented at the 2004 Passive and Active Measurement Conf. (PAM) Antibes Juan-les-Pins, France, Apr. 2004.
[50]
R. Srinivasan, Importance Sampling--Application in Communications and Detection. Berlin, Germany: Springer-Verlag, 2002.
[51]
D. Stutzbach and R. Rejaie, "Improving lookup performance over a widely-deployed DHT," presented at the IEEE INFOCOM 2006, Barcelona, Spain, Apr. 2006.

Cited By

View all
  • (2024)A Sampling-Based Framework for Hypothesis Testing on Large Attributed GraphsProceedings of the VLDB Endowment10.14778/3681954.368199317:11(3192-3200)Online publication date: 1-Jul-2024
  • (2022)Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costsKnowledge and Information Systems10.1007/s10115-022-01691-864:7(1909-1935)Online publication date: 1-Jul-2022
  • (2021)PR-sketchProceedings of the VLDB Endowment10.14778/3467861.346786814:10(1783-1796)Online publication date: 26-Oct-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Networking
IEEE/ACM Transactions on Networking  Volume 17, Issue 2
April 2009
319 pages

Publisher

IEEE Press

Publication History

Published: 01 April 2009
Revised: 23 January 2008
Received: 25 March 2007
Published in TON Volume 17, Issue 2

Author Tags

  1. peer-to-peer
  2. sampling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Sampling-Based Framework for Hypothesis Testing on Large Attributed GraphsProceedings of the VLDB Endowment10.14778/3681954.368199317:11(3192-3200)Online publication date: 1-Jul-2024
  • (2022)Random walk on node cliques for high-quality samples to estimate large graphs with high accuracies and low costsKnowledge and Information Systems10.1007/s10115-022-01691-864:7(1909-1935)Online publication date: 1-Jul-2022
  • (2021)PR-sketchProceedings of the VLDB Endowment10.14778/3467861.346786814:10(1783-1796)Online publication date: 26-Oct-2021
  • (2021)Enabling Privacy-Preserving Rule Mining in Decentralized Social NetworksProceedings of the 16th International Conference on Availability, Reliability and Security10.1145/3465481.3465482(1-11)Online publication date: 17-Aug-2021
  • (2021)Noise Corrected Sampling of Online Social NetworksACM Transactions on Knowledge Discovery from Data10.1145/343474915:2(1-21)Online publication date: 5-Mar-2021
  • (2020)Little Ball of FurProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412758(3133-3140)Online publication date: 19-Oct-2020
  • (2019)Measuring the sampling robustness of complex networksProceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining10.1145/3341161.3342873(294-301)Online publication date: 27-Aug-2019
  • (2019)Non-Markovian Monte Carlo on Directed GraphsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3322205.33110863:1(1-31)Online publication date: 26-Mar-2019
  • (2019)Characterizing Directed and Undirected Networks via Multidimensional Walks with JumpsACM Transactions on Knowledge Discovery from Data10.1145/329987713:1(1-33)Online publication date: 23-Jan-2019
  • (2019)Fast crawling methods of exploring content distributed over large graphsKnowledge and Information Systems10.1007/s10115-018-1178-x59:1(67-92)Online publication date: 1-Apr-2019
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media