Nothing Special   »   [go: up one dir, main page]

Paper The following article is Open access

Zipf's law, power laws and maximum entropy

Published 16 April 2013 © IOP Publishing and Deutsche Physikalische Gesellschaft
, , Citation Matt Visser 2013 New J. Phys. 15 043021 DOI 10.1088/1367-2630/15/4/043021

1367-2630/15/4/043021

Abstract

Zipf's law, and power laws in general, have attracted and continue to attract considerable attention in a wide variety of disciplines—from astronomy to demographics to software structure to economics to linguistics to zoology, and even warfare. A recent model of random group formation (RGF) attempts a general explanation of such phenomena based on Jaynes' notion of maximum entropy applied to a particular choice of cost function. In the present paper I argue that the specific cost function used in the RGF model is in fact unnecessarily complicated, and that power laws can be obtained in a much simpler way by applying maximum entropy ideas directly to the Shannon entropy subject only to a single constraint: that the average of the logarithm of the observable quantity is specified.

Export citation and abstract BibTeX RIS

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Zipf's law [13], and power laws in general [46], have and continue to attract considerable attention in a wide variety of disciplines—from astronomy to demographics to software structure to economics to zoology, and even to warfare [7]. Typically one is dealing with integer-valued observables (numbers of objects, people, cities, words, animals, corpses), with n∈{1,2,3,...}. Sometimes the range of values is allowed to be infinite (at least in principle), sometimes a hard upper bound N is fixed (e.g., total population if one is interested in subdividing a fixed population into sub-classes). Particularly interesting probability distributions are probability laws of the form

  • Zipf's law: pn∝1/n;
  • power laws: pn∝1/nz;
  • hybrid geometric/power laws: pnwn/nz.

Specifically, a recent model of random group formation (RGF), see [8], attempts a general explanation of such phenomena based on Jaynes' notion of maximum entropy [913] applied to a particular choice of cost function [8]. (For recent related work largely in a demographic context see [1418]. For related work in a fractal context implemented using an iterative framework see [19].)

In the present paper I shall argue that the specific cost function used in the RGF model is in fact unnecessarily complicated (in fact RGF most typically leads to a hybrid geometric/power law, not a pure power law), and that power laws can be obtained in a much simpler way by applying maximum entropy ideas directly to the Shannon entropy itself [20, 21] subject only to a single constraint: that the average of the logarithm of the observable quantity is specified. Similarly, I would argue that (at least as long as the main issue one is interested in is 'merely' the minimum requirements for obtaining a power law) the appeal to a fractal framework and the iterative model adopted by Pastor-Satorras and Wagensberg [19] is also unnecessarily complicated.

To place this observation in perspective, I will explore several variations on this theme, modifying both the relevant state space and the number of constraints, and will briefly discuss the relevant special functions of mathematical physics that one encounters (zeta functions, harmonic series, poly-logarithms). I shall also discuss an extremely general Gibbs-like model, and the use of non-Shannon entropies (the Rényi [22] and Tsallis [23] entropies and their generalizations). There is a very definite trade-off between simplicity and generality, and I shall very much focus on keeping the discussion as technically simple as possible, and on identifying the simplest model with minimalist assumptions.

2. Power laws in infinite state space

Let us define the set of observable quantities to be positive integers n∈{1,2,3,...}, without any a priori upper bound. The maximum entropy approach [913] seeks to estimate the probabilities pn by maximizing the Shannon entropy [20, 21]

Equation (1)

subject to a (small) number of constraints/cost functions—representing our limited state of knowledge regarding the underlying process. For example, the RGF model of Baek et al [8] uses one constraint and one (relatively complicated) cost function [8], in addition to the 'trivial' normalization constraint $\sum _n p_n = 1$ . Instead, let us consider the single constraint

Equation (2)

Let us now maximize the Shannon entropy subject to this constraint. This is best done by introducing a Lagrange multiplier z corresponding to the constraint 〈ln n〉, plus a second Lagrange multiplier λ corresponding to the 'trivial' normalization constraint, and considering the quantity

Equation (3)

Of course there is no loss of generality in redefining the Lagrange multiplier λ → Z as follows:

Equation (4)

Varying with respect to the pn yields the extremality condition

Equation (5)

with explicit solution

Equation (6)

Here ζ(z) is the Riemann zeta function [2427], a relatively common and well-known special function, and the condition z > 1 is required to make the sum $\sum _{n=1}^\infty n^{-z} = \zeta (z)$ converge. This is enough to tell you that one will never exactly reproduce Zipf's law (z = 1) in this particular manner, though one can get arbitrarily close. The value of the Lagrange multiplier z (which becomes the exponent in the power law) is determined self-consistently in terms of χ by demanding

Equation (7)

Then z∈(1,) while χ∈(0,). For practical calculations it is often best to view the exponent z as the single free parameter and χ(z) as the derived quantity, but this viewpoint can easily be inverted if desired. Near z = 1 we have the analytic estimate

Equation (8)

where γ denotes Euler's constant. At maximum entropy, imposing the extremality condition and summing, we have

Equation (9)

Near z = 1 we have the analytic estimate

Equation (10)

For future reference, it is useful to observe that [27, see p 118]

Equation (11)

where the Stieltjes constants γm satisfy

Equation (12)

A better estimate, using Euler–Mclaurin summation, is

Equation (13)

The quick lesson to take is this: by applying maximum entropy considerations to the single constraint 〈ln n〉 = χ you can get a pure power law with any exponent z > 1. Furthermore, note that the quantity

Equation (14)

is the geometric mean of the integers {1,2,3,...} with the exponents weighted by the probabilities pn. So one can just as easily obtain the pure power laws considered above by maximizing the entropy subject to the constraint that this geometric mean takes on a specified value.

3. Power laws in finite state space

If one desires an exact Zipf law (exponent z = 1, so that pn∝1/n), then because the harmonic series diverges, $\sum _{n=1}^\infty 1/n = \infty $ , it is clear that something in the above formulation needs to change. Perhaps the easiest thing to do is to introduce an explicit maximum value of n, call it N, so that we take the set of observables to be positive integers n∈{1,2,3,...,N}. (Physicists would call this an infra-red cutoff, or large-distance cutoff.) The maximum entropy approach now amounts to considering

Equation (15)

Varying with respect to the pn and maximizing again yields the same extremality condition

Equation (16)

but now implying

Equation (17)

Here HN(z) is the (reasonably well known) generalized harmonic function [27]

Equation (18)

Compared with the previous case the only real difference lies in the normalization function. However, because the sum is now always finite, there is no longer any constraint on the value of the exponent z, in fact we can have z∈(−,). The case z = 1 is Zipf's law, while z = 0 is a uniform distribution, and z < 0 corresponds to an 'inverted hierarchy' where large values are more common than small values. The price paid for this extra flexibility is that the model now has two free parameters, which can be chosen to be z and N. One has the self-consistency constraint

Equation (19)

It is easy to check that χ is now bounded by χ∈(0,ln N). At maximum entropy we now have

Equation (20)

Because this is now a two-parameter model, it will always (naively) be a 'better' fit to observational data than a single-parameter model. Sometimes (for z ⩽ 1) retreating to this two-parameter model is necessary, but for z > 1 the one-parameter model of the previous section (N → ) should be preferred.

4. Zipf's law in finite state space

If for observational or theoretical reasons one is certain that z = 1 (Zipf's law), then the model reduces as follows: the state space is n∈{1,2,3,...,N} where N is now the only free parameter. Then explicitly forcing z → 1 one considers

Equation (21)

This is completely equivalent to considering

Equation (22)

but writing the quantity to be maximized in this way hides the role of the Shannon entropy. Varying with respect to the pn and maximizing now yields a (very) slightly different extremality condition

Equation (23)

and so

Equation (24)

Here HN is the (ordinary) harmonic number [27]

Equation (25)

Then

Equation (26)

Furthermore, at maximum entropy

Equation (27)

Now we have already seen

Equation (28)

and

Equation (29)

Therefore

Equation (30)

and at maximum entropy

Equation (31)

Note that you can use this to estimate the size N of the state space one needs to adopt in order to be compatible with the (observed) logarithmic average 〈ln n〉. Indeed

Equation (32)

This relates the size of the required state space N to the square of the geometric mean $\exp \langle \ln n \rangle $ .

5. Hybrid geometric/power models in infinite state space

To generate a hybrid geometric/power law model, similar in output to the RGF model [8], but with considerably simpler input assumptions, simply take both the logarithmic average 〈ln n〉 = χ, and the arithmetic average 〈n〉 = μ, to be specified—and then maximize the Shannon entropy subject to these two constraints, (plus the trivial normalization constraint). That is, introduce two Lagrange multipliers z and w, and maximize

Equation (33)

Varying with respect to the pn yields

Equation (34)

with solution

Equation (35)

Here the normalizing constant is the well-known poly-logarithm function [2427]

Equation (36)

Then

Equation (37)

while

Equation (38)

Furthermore, at maximum entropy

Equation (39)

Note that the probability function arising in this model is fully as general as that arising in the RGF model [8], but with what is perhaps a somewhat clearer interpretation.

This is because the RGF model uses what may be viewed as an unnecessarily complicated 'cost function', with an unnecessary degeneracy in the parameters. Indeed, from [8] one sees (in their notation, slightly different from current notation)

Equation (40)

That is

Equation (41)

Equation (42)

Equation (43)

So the RGF cost function [8] is simply a linear combination of Shannon entropy, the logarithmic mean 〈ln k〉, and a redundant constant offset ln N. (Unfortunately the N of Baek et al [8] is not the same as the N used in this paper, the RGF parameter N corresponds to the number of independent realizations of the underlying statistical process one considers—it is the number of simulations, or number of universes in the statistical ensemble.) The additional complexity implicit in the RGF model can to some extent be viewed as a side-effect of forcing data into discrete ordinal boxes when one does not necessarily have good physical/mathematical/demographic reasons for knowing which particular boxes are 'best', how the boxes are to be assigned ordinal numbers, and where the box boundaries should most profitably be placed. Apart from the issues raised above, one could in addition explicitly restrict the state space to be finite, adding yet another free parameter (M in the language of Baek et al [8], N in the language of this note), but there is little purpose in doing so—the key insight is this: once the data are assigned to ordinal boxes, hybrid geometric/power laws drop out automatically and straightforwardly by maximizing the Shannon entropy subject to the two very simple constraints 〈ln n〉 = χ and 〈n〉 = μ.

6. Zipf's law: geometric version in infinite state space

If for observational or theoretical reasons one is certain that z = 1 (Zipf's law), but for whatever reason feels a finite state space cutoff N is inappropriate, then a geometric version of Zipf's law can be extracted from the hybrid model. Setting z = 1 the model reduces as follows: the state space is now n∈{1,2,3,...} while

Equation (44)

This is completely equivalent to maximizing

Equation (45)

but writing the quantity to be maximized in this way hides the role of the Shannon entropy. Varying with respect to the pn yields

Equation (46)

with solution

Equation (47)

Note the normalizing function is now extremely simple—the natural logarithm. Then

Equation (48)

while

Equation (49)

Furthermore, at maximum entropy

Equation (50)

This is a one-parameter model with a geometrical cutoff, which for w ≲ 1 (or more precisely w → 1), approximates the naive un-normalizable Zipf law with arbitrary accuracy.

7. Very general Gibbs-like model

Let us now consider an arbitrary number of constraints of the form

Equation (51)

and

Equation (52)

One could always transform a g-type constraint into an f-type constraint or vice versa, but as we shall soon see there are advantages to keeping the logarithm explicit. Applying the maximum entropy principle amounts to considering

Equation (53)

where with malice aforethought we have now relabelled the Lagrange multipliers for the f constraints as follows: ln w → − β. Then maximizing over the pn we have the extremality condition

Equation (54)

with explicit solution

Equation (55)

where now the normalizing constant is

Equation (56)

This can be viewed as a generalization/modification of the Gibbs distribution where we have explicitly pulled out some of the constraints (the g-type constraints) to make them look power-law-like, while the remaining constraints (the f-type constraints) are left in Boltzmann-like form. Then

Equation (57)

while

Equation (58)

Furthermore, at maximum entropy

Equation (59)

It is only once one specifies particular choices for the functions fa and gi that the model becomes concrete, and only at that stage might one need to focus on particular special functions of mathematical physics. The model is extremely general—the drawback is that, because it can fit almost anything, it can 'explain' almost anything, and so can predict almost nothing.

8. Non-Shannon entropies

Shannon's entropy is by far the best motivated of the entropy functions infesting the literature. Without making any particular commitment to the advisability of doing so, we can certainly ask what happens if we apply maximum entropy ideas to non-Shannon entropies (such as the Rényi [22] or Tsallis [23] entropies and their generalizations). Let us define an entropic zeta function by

Equation (60)

which certainly converges for s ⩾ 1 and may converge on a wider region. Then

Equation (61)

where in both cases the Shannon entropy is recovered in the limit a → 0. More generally let us consider a generalized entropy of the form

Equation (62)

for an arbitrary smooth function f(·). Let us further impose a constraint on the bth moment

Equation (63)

With power laws being explicitly built in as input into both the generalized entropy and the constraint, it is perhaps not too surprising that we will manage to get power laws dropping out. One is now interested in maximizing

Equation (64)

Varying the pn leads to

Equation (65)

with solution

Equation (66)

(An overall factor of (1 + a)f'(ζS(1 + a)) simply drops out of the calculation.) If the number of states is finite, then we cannot a priori discard the parameter w, and we have derived a distorted power law. (The probability distribution then interpolates between a pure power law for w = 0 and a uniform distribution for w = .) If on the other hand, the number of states is infinite then normalizability enforces w → 0, and λ factors out. We then have a pure power law

Equation (67)

If the state space is the positive integers n∈{1,2,3,...} then Z → ζ(−b/a), the Riemann zeta function, and the sum converges only for −b/a > 1. So one of the two parameters (a, b) must be negative. In this situation

Equation (68)

now requiring both −b − b/a > 1 and −b/a > 1, while at maximum entropy

Equation (69)

So yes, one can also extract power laws from maximum entropy applied to non-Shannon entropies (in particular, the generalized Rényi–Tsallis entropies), but the process is (at best) rather clumsy, and seems an exercise in overkill. Apart from the whole question of whether or not non-Shannon entropies are particularly interesting, one should ask whether the result is particularly useful? This derivation does not seem to be in any way an improvement over the simpler one based directly on the Shannon entropy, so its utility is dubious.

9. Summary and discussion

The main point of this paper is that power laws (and their variants, including hybrid geometric/power laws) have a very natural and straightforward interpretation in terms of the maximum entropy formalism pioneered by Jaynes. The key to obtaining a pure power law in the simplest possible manner lies in maximizing the Shannon entropy while imposing the simple constraint 〈ln n〉 = χ. Depending on other features of the specific model under consideration, detailed analysis leads to certain of the special functions of mathematics (the Riemann zeta function, generalized harmonic functions, poly-logarithms or even ordinary logarithms), but these are relatively well-known mathematical objects, which are still tolerably simple to deal with.

Adding additional features (finite size state space, extra constraints) can (and typically will) modify both the functional form and the normalization constants appearing in the probability distribution. As always there is a trade-off between simplicity and flexibility. A more complicated model (with more free parameters) has a more flexible probability distribution, but this comes at a real (if often unacknowledged) cost in terms of internal complexity. A rather general Gibbs-like model is laid out and briefly discussed. We also briefly discuss applying maximum entropy ideas to non-Shannon entropies (such as the Rényi and Tsallis entropies). There is very definitely a trade-off in both elegance and plausibility, and I would argue strongly that the simplest and most elegant model consists of the Shannon entropy, a constraint on 〈ln n〉 = χ, and a trivial normalization constraint on the sum of probabilities.

The fact that the logarithmic average 〈ln n〉 plays such an important role in power laws seems to have a connection with the fact that logarithmic scales are ubiquitous in classifying various natural and social phenomena. For instance

  • stellar magnitudes are logarithmic in stellar luminosity,
  • earthquake magnitudes (modified Richter scale) are logarithmic in energy release,
  • sound intensity decibels are logarithmic in pressure,
  • the acidity/alkalinity pH scale is logarithmic in hydrogen ion concentration,
  • musical octaves are logarithmic in frequency,
  • war severity can be characterized as being logarithmic in casualty count [7].

In many cases the utility of a logarithmic scale can be traced back to an approximate logarithmic sensitivity in human perceptual systems, but it is very easy to confound cause and effect. After all, in the presence of power-law distributed external stimuli, there is a significant disadvantage in having the human perceptual system overwhelmed by large numbers of low-impact events, suggesting an evolutionary pressure towards suppressing sensitivity to low-impact events. This suggests that logarithmic sensitivity in human (and animal) perceptual systems is evolutionarily preferred for those senses that are subject to an external bath of power-law distributed stimuli.

Fortunately, for the purposes of applying maximum entropy ideas one does not need to know which is the cause and which is the effect—one is 'merely' using Bayesian principle to estimate underlying probabilities in the presence of limited knowledge; for current purposes this is most typically the single piece of information that 〈ln n〉 = χ.

Acknowledgments

MV acknowledges support via the Marsden Fund, and via a James Cook Fellowship, both administered by the Royal Society of New Zealand.

Please wait… references are loading.