Abstract
Given a social network, which of its nodes have a stronger impact in determining its structure? More precisely, which node-removal order has the greatest impact on the network structure? We approach this well-known problem for the first time in a setting that combines both web graphs and social networks. Our experiments are performed on datasets that are of orders of magnitude larger than those appearing in the previous literature: this is possible, thanks to some recently developed algorithms and software tools that approximate accurately the number of reachable pairs and the distribution of distances in large graphs. Our experiments highlight deep differences in the structure of social networks and web graphs, show significant limitations of previous experimental results; at the same time, they reveal clustering by label propagation as a new and very effective way of locating nodes that are important from a structural viewpoint.
Similar content being viewed by others
Notes
The reader might find this definition a bit vague, and some variants are often spotted in the literature: this is a general problem, also highlighted recently by Li et al. (2005).
It should be remarked that the Milgram’s experiment tried to prove two properties at the same time. First, the average distance between individuals is much smaller than expected; second, the individuals are able to exploit such a feature to route messages along short paths, albeit they only possess local information about the network they live in. This second property is, in a sense, not only more interesting than the former, but also more difficult to describe and study, because it has to do with some information that the nodes possess about the environment they inhabit.
A reachable pair is a pair of nodes \(\langle x, y\rangle\) such that there is a directed path from x to y; the distance distribution of a graph is a discrete distribution that gives, for every t, the fraction of reachable pairs of nodes that are at distance t.
Observe that we delete nodes but count the percentage of arcs (rather than nodes) that have been removed: this choice is justified by the fact that otherwise node orderings that put large-degree nodes first would certainly be considered (unfairly) more disruptive.
We remark that several proposals have been made to find features that highlight such structural differences in a computationwise-feasible way (e.g., assortative mixing (Newman and Park 2003)), but all instances we are aware of have been questioned by the subsequent literature, so no clear-cut results are known so far. An exception is the idea of considering the spid [shortest-path index of dispersion (Boldi et al. 2011a)], which is experimentally larger than the one for web graphs and smaller than the one for social networks. For instance, the spid of the entire Facebook graph is 0.09 (Backstrom et al. 2012).
Actually, the notion had been introduced before by Marchiori and Latora (2000) and named connectivity length, but we find the name “harmonic diameter” much more insightful.
We purposely use the word “divergence” between distributions, instead of “distance”, to avoid confusion with the notion of distance in a graph.
It is mostly a matter of taste whether to use directed symmetric graphs or simple undirected graphs. In our case, since we have to cope with both directed and undirected graph, we prefer to speak of directed graphs that are symmetric, that is, for every arc x→ y, there is a symmetric arc y→ x.
Label propagation has been independently proposed under the name of peer pressure clustering by Gilbert et al. (2007).
There are sampled variants of Brandes’s algorithm (Brandes and Erlebach 2005a), but the Hoeffding bound providing precision guarantees requires \(\Uptheta(n^4\log n/\varepsilon^2)\) visits to obtain absolute precision \(\varepsilon. \)
In Boldi et al. (2011b), we also presented the outcomes of similar experiments performed on other networks (Amazon, Enron and . it ) that agree with the ones shown here. Our tables and graphs slightly differ from those previously published in (Boldi et al. 2011b), because we had time to generate more runs, and thus, increase the precision of our results.
http://law.di.unimi.it/. In particular, the graphs we used are the datasets named hollywood-2011, ljournal-2008, orkut-2007, in-2004 and uk-2007-05. Note that isolated nodes have been removed from hollywood-2011.
We remark that in some cases, the measure is negative or does not decrease monotonically. This is sometimes an artifact of the probabilistic technique used to estimate our measures—small relative errors are unavoidable.
In this paper, like in all the other experimental research on the same topic, conclusions about social networks should be taken with a grain of salt, due to the heterogeneity of such networks and the lack of a large repertoire of examples.
References
Albert R, Jeong H, Barabási A-L (2000) Error and attack tolerance of complex networks. Nature 406:378–382
Anthonisse JM (1971) The rush in a graph. Technical report, University of Amsterdam Mathematical Centre, Amsterdam
Backstrom L, Boldi P, Rosa M, Ugander J, Vigna S (2012) Four degrees of separation. In: ACM Web Science 2012: Conference Proceedings, pp 45–54 (Best paper award)
Bavelas A (1950) Communication patterns in task-oriented groups. J Acoust Soc Am 57:271–282
Boldi P, Vigna S (2012a) Four degrees of separation, really. Arxiv preprint arxiv:1205.5509
Boldi P, Vigna S (2012b). Harmonic centrality (in preparation)
Boldi P, Santini M, Vigna S (2009) Page Rank: functional dependencies. ACM Trans Inf Sys 27(4):1–23
Boldi P, Rosa M, Vigna S (2011a) HyperANF: approximating the neighbourhood function of very large graphs on a budget. In: Srinivasan S, Ramamritham S, Kumar A, Ravindra MP, Bertino E, Kumar R (eds) Proceedings of the 20th international conference on World Wide Web. ACM, pp 625–634
Boldi P, Rosa M, Vigna S (2011b) Robustness of social networks: Comparative results based on distance distributions. In: Social Informatics, Third International Conference, SocInfo 2011. Lecture Notes in Computer Science, vol 6894. Springer, Berlin, pp 8–21
Borgatti SP (2005) Centrality and network flow. Soc Netw 27(1):55–71
Borgatti SP (2006) Identifying sets of key players in a social network. Comput Math Organ Theory 12:21–34
Borgatti SP, Carley KM, Krackhardt D (2006) On the robustness of centrality measures under conditions of imperfect data. Soc Netw 28(2):124–136
Brandes U (2001) A faster algorithm for betweenness centrality. J Math Sociol 25(2):163–177
Brandes U, Erlebach T (eds) (2005a) Network analysis: methodological foundations. Lecture Notes in Computer Science, vol 3418. Springer, Berlin
Brandes U, Erlebach T (2005b) Network analysis: methodological foundations (Lecture Notes in Computer Science). Number 3418 in Lecture Notes in Computer Science. Springer, Berlin
Chierichetti F, Kumar R, Lattanzi S, Mitzenmacher M, Panconesi A, Raghavan P (2009) On compressing social networks. In: KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 219–228
Cohen E (1997) Size-estimation framework with applications to transitive closure and reachability. J Comput Syst Sci 55:441–453
Cohen R, Havlin S (2010) Complex networks: structure, robustness and function. Cambridge University Press, Cambridge
Donato D, Leonardi S, Millozzi S, Tsaparas P (2008) Mining the inner structure of the web graph. J Phys A: Math Theor 41(22):224017
Efron B, Gong G (1983) A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Stat 37(1):36–48
Flajolet P, Fusy É, Gandouet O, Meunier F (2007) HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In: Proceedings of the 13th conference on analysis of algorithm (AofA 07), Juan-les-Pins, pp 127–146
Fogaras D (2003) Where to start browsing the web? In: Innovative Internet Community Systems, Third International Workshop, IICS 2003. Lecture Notes in Computer Science, vol 2877. Springer, Leipzig, pp 65–79
Freeman LC (1977) A set of measures of centrality based on betweenness. Sociometry 40(1):35–41
Gilbert JR, Reinhardt S, Shah V (2007) High-performance graph algorithms from parallel sparse matrices. In: Kagstrom B, Elmroth E, Dongarra J, Wasniewski J (eds) Applied parallel computing. State of the Art in Scientific Computing (8th PARA’06). Lecture Notes in Computer Science, vol 4699. Springer, New York, pp 260–269
Kendall MG (1945) The treatment of ties in ranking problems. Biometrika 33(3):239–251
Langville AN, Meyer CD (2004) Deeper inside Page Rank. Internet Math 1(3):355–400
Li L, Alderson DL, Doyle J, Willinger W (2005) Towards a theory of scale-free graphs: definition, properties, and implications. Internet Math 2(4):431–523
Marchiori M, Latora V (2000) Harmony in the small-world. Physica A: Stat Mech Appl 285(3-4):539–546
Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07), San Diego
Newman MEJ, Park J (2003) Why social networks are different from other types of networks. Phys Rev E 68(3):036122
Page L, Brin S, Motwani R, Winograd T (1998) The Page Rank citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project, Stanford University, Stanford
Palmer CR, Gibbons PB, Faloutsos C (2002) Anf: a fast and scalable tool for data mining in massive graphs. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 81–90
Raghavan UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in large-scale networks. Phys Rev E 76(3):036106. doi:10.1103/PhysRevE.76.036106
Travers J, Milgram S (1969) An experimental study of the small world problem. Sociometry 32(4):425–443
Author information
Authors and Affiliations
Corresponding author
Additional information
Paolo Boldi and Sebastiano Vigna have been partially supported by a Yahoo! faculty grant, by MIUR (Italian Ministry of University and Research) and by the EU-FET grant NADINE (GA 288956).
Rights and permissions
About this article
Cite this article
Boldi, P., Rosa, M. & Vigna, S. Robustness of social and web graphs to node removal. Soc. Netw. Anal. Min. 3, 829–842 (2013). https://doi.org/10.1007/s13278-013-0096-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13278-013-0096-x