Paper • The following article is Open access

Zipf's law, power laws and maximum entropy

Matt Visser¹

Published 16 April 2013 • © IOP Publishing and Deutsche Physikalische Gesellschaft
New Journal of Physics, Volume 15, April 2013 Citation Matt Visser 2013 New J. Phys. 15 043021 DOI 10.1088/1367-2630/15/4/043021

Download Article PDF

Article metrics

8138 Total downloads
Video abstract views

Author e-mails

matt.visser@msor.vuw.ac.nz

Author affiliations

¹ School of Mathematics, Statistics and Operations Research, Victoria University of Wellington, PO Box 600, Wellington 6140, New Zealand

Dates

Received 17 January 2013
Published 16 April 2013

Buy this article in print

Journal RSS

Abstract

Zipf's law, and power laws in general, have attracted and continue to attract considerable attention in a wide variety of disciplines—from astronomy to demographics to software structure to economics to linguistics to zoology, and even warfare. A recent model of random group formation (RGF) attempts a general explanation of such phenomena based on Jaynes' notion of maximum entropy applied to a particular choice of cost function. In the present paper I argue that the specific cost function used in the RGF model is in fact unnecessarily complicated, and that power laws can be obtained in a much simpler way by applying maximum entropy ideas directly to the Shannon entropy subject only to a single constraint: that the average of the logarithm of the observable quantity is specified.

Export citation and abstract BibTeX RIS

Previous article in issue

Next article in issue

Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Zipf's law [1–3], and power laws in general [4–6], have and continue to attract considerable attention in a wide variety of disciplines—from astronomy to demographics to software structure to economics to zoology, and even to warfare [7]. Typically one is dealing with integer-valued observables (numbers of objects, people, cities, words, animals, corpses), with n∈{1,2,3,...}. Sometimes the range of values is allowed to be infinite (at least in principle), sometimes a hard upper bound N is fixed (e.g., total population if one is interested in subdividing a fixed population into sub-classes). Particularly interesting probability distributions are probability laws of the form

Zipf's law: p_n∝1/n;
power laws: p_n∝1/n^z;
hybrid geometric/power laws: p_n∝wⁿ/n^z.

Specifically, a recent model of random group formation (RGF), see [8], attempts a general explanation of such phenomena based on Jaynes' notion of maximum entropy [9–13] applied to a particular choice of cost function [8]. (For recent related work largely in a demographic context see [14–18]. For related work in a fractal context implemented using an iterative framework see [19].)

In the present paper I shall argue that the specific cost function used in the RGF model is in fact unnecessarily complicated (in fact RGF most typically leads to a hybrid geometric/power law, not a pure power law), and that power laws can be obtained in a much simpler way by applying maximum entropy ideas directly to the Shannon entropy itself [20, 21] subject only to a single constraint: that the average of the logarithm of the observable quantity is specified. Similarly, I would argue that (at least as long as the main issue one is interested in is 'merely' the minimum requirements for obtaining a power law) the appeal to a fractal framework and the iterative model adopted by Pastor-Satorras and Wagensberg [19] is also unnecessarily complicated.

To place this observation in perspective, I will explore several variations on this theme, modifying both the relevant state space and the number of constraints, and will briefly discuss the relevant special functions of mathematical physics that one encounters (zeta functions, harmonic series, poly-logarithms). I shall also discuss an extremely general Gibbs-like model, and the use of non-Shannon entropies (the Rényi [22] and Tsallis [23] entropies and their generalizations). There is a very definite trade-off between simplicity and generality, and I shall very much focus on keeping the discussion as technically simple as possible, and on identifying the simplest model with minimalist assumptions.

2. Power laws in infinite state space

Let us define the set of observable quantities to be positive integers n∈{1,2,3,...}, without any a priori upper bound. The maximum entropy approach [9–13] seeks to estimate the probabilities p_n by maximizing the Shannon entropy [20, 21]

subject to a (small) number of constraints/cost functions—representing our limited state of knowledge regarding the underlying process. For example, the RGF model of Baek et al [8] uses one constraint and one (relatively complicated) cost function [8], in addition to the 'trivial' normalization constraint $\sum _n p_n = 1$ . Instead, let us consider the single constraint

Let us now maximize the Shannon entropy subject to this constraint. This is best done by introducing a Lagrange multiplier z corresponding to the constraint 〈ln n〉, plus a second Lagrange multiplier λ corresponding to the 'trivial' normalization constraint, and considering the quantity

Of course there is no loss of generality in redefining the Lagrange multiplier λ → Z as follows:

Varying with respect to the p_n yields the extremality condition

with explicit solution

Here ζ(z) is the Riemann zeta function [24–27], a relatively common and well-known special function, and the condition z > 1 is required to make the sum $\sum _{n=1}^\infty n^{-z} = \zeta (z)$ converge. This is enough to tell you that one will never exactly reproduce Zipf's law (z = 1) in this particular manner, though one can get arbitrarily close. The value of the Lagrange multiplier z (which becomes the exponent in the power law) is determined self-consistently in terms of χ by demanding

Then z∈(1,∞) while χ∈(0,∞). For practical calculations it is often best to view the exponent z as the single free parameter and χ(z) as the derived quantity, but this viewpoint can easily be inverted if desired. Near z = 1 we have the analytic estimate

where γ denotes Euler's constant. At maximum entropy, imposing the extremality condition and summing, we have

Near z = 1 we have the analytic estimate

For future reference, it is useful to observe that [27, see p 118]

where the Stieltjes constants γ_m satisfy

A better estimate, using Euler–Mclaurin summation, is

The quick lesson to take is this: by applying maximum entropy considerations to the single constraint 〈ln n〉 = χ you can get a pure power law with any exponent z > 1. Furthermore, note that the quantity

is the geometric mean of the integers {1,2,3,...} with the exponents weighted by the probabilities p_n. So one can just as easily obtain the pure power laws considered above by maximizing the entropy subject to the constraint that this geometric mean takes on a specified value.

3. Power laws in finite state space

If one desires an exact Zipf law (exponent z = 1, so that p_n∝1/n), then because the harmonic series diverges, $\sum _{n=1}^\infty 1/n = \infty$ , it is clear that something in the above formulation needs to change. Perhaps the easiest thing to do is to introduce an explicit maximum value of n, call it N, so that we take the set of observables to be positive integers n∈{1,2,3,...,N}. (Physicists would call this an infra-red cutoff, or large-distance cutoff.) The maximum entropy approach now amounts to considering

Varying with respect to the p_n and maximizing again yields the same extremality condition

but now implying

Here H_N(z) is the (reasonably well known) generalized harmonic function [27]

Compared with the previous case the only real difference lies in the normalization function. However, because the sum is now always finite, there is no longer any constraint on the value of the exponent z, in fact we can have z∈(−∞,∞). The case z = 1 is Zipf's law, while z = 0 is a uniform distribution, and z < 0 corresponds to an 'inverted hierarchy' where large values are more common than small values. The price paid for this extra flexibility is that the model now has two free parameters, which can be chosen to be z and N. One has the self-consistency constraint

It is easy to check that χ is now bounded by χ∈(0,ln N). At maximum entropy we now have

Because this is now a two-parameter model, it will always (naively) be a 'better' fit to observational data than a single-parameter model. Sometimes (for z ⩽ 1) retreating to this two-parameter model is necessary, but for z > 1 the one-parameter model of the previous section (N → ∞) should be preferred.

4. Zipf's law in finite state space

If for observational or theoretical reasons one is certain that z = 1 (Zipf's law), then the model reduces as follows: the state space is n∈{1,2,3,...,N} where N is now the only free parameter. Then explicitly forcing z → 1 one considers

This is completely equivalent to considering

but writing the quantity to be maximized in this way hides the role of the Shannon entropy. Varying with respect to the p_n and maximizing now yields a (very) slightly different extremality condition

and so

Here H_N is the (ordinary) harmonic number [27]

Then

Furthermore, at maximum entropy

Now we have already seen

and

Therefore

and at maximum entropy

Note that you can use this to estimate the size N of the state space one needs to adopt in order to be compatible with the (observed) logarithmic average 〈ln n〉. Indeed

This relates the size of the required state space N to the square of the geometric mean $\exp \langle \ln n \rangle$ .

5. Hybrid geometric/power models in infinite state space

To generate a hybrid geometric/power law model, similar in output to the RGF model [8], but with considerably simpler input assumptions, simply take both the logarithmic average 〈ln n〉 = χ, and the arithmetic average 〈n〉 = μ, to be specified—and then maximize the Shannon entropy subject to these two constraints, (plus the trivial normalization constraint). That is, introduce two Lagrange multipliers z and w, and maximize

Varying with respect to the p_n yields

with solution

Here the normalizing constant is the well-known poly-logarithm function [24–27]

Then

while

Furthermore, at maximum entropy

Note that the probability function arising in this model is fully as general as that arising in the RGF model [8], but with what is perhaps a somewhat clearer interpretation.

This is because the RGF model uses what may be viewed as an unnecessarily complicated 'cost function', with an unnecessary degeneracy in the parameters. Indeed, from [8] one sees (in their notation, slightly different from current notation)

That is

So the RGF cost function [8] is simply a linear combination of Shannon entropy, the logarithmic mean 〈ln k〉, and a redundant constant offset ln N. (Unfortunately the N of Baek et al [8] is not the same as the N used in this paper, the RGF parameter N corresponds to the number of independent realizations of the underlying statistical process one considers—it is the number of simulations, or number of universes in the statistical ensemble.) The additional complexity implicit in the RGF model can to some extent be viewed as a side-effect of forcing data into discrete ordinal boxes when one does not necessarily have good physical/mathematical/demographic reasons for knowing which particular boxes are 'best', how the boxes are to be assigned ordinal numbers, and where the box boundaries should most profitably be placed. Apart from the issues raised above, one could in addition explicitly restrict the state space to be finite, adding yet another free parameter (M in the language of Baek et al [8], N in the language of this note), but there is little purpose in doing so—the key insight is this: once the data are assigned to ordinal boxes, hybrid geometric/power laws drop out automatically and straightforwardly by maximizing the Shannon entropy subject to the two very simple constraints 〈ln n〉 = χ and 〈n〉 = μ.

6. Zipf's law: geometric version in infinite state space

If for observational or theoretical reasons one is certain that z = 1 (Zipf's law), but for whatever reason feels a finite state space cutoff N is inappropriate, then a geometric version of Zipf's law can be extracted from the hybrid model. Setting z = 1 the model reduces as follows: the state space is now n∈{1,2,3,...} while

This is completely equivalent to maximizing

but writing the quantity to be maximized in this way hides the role of the Shannon entropy. Varying with respect to the p_n yields

with solution

Note the normalizing function is now extremely simple—the natural logarithm. Then

while

Furthermore, at maximum entropy

This is a one-parameter model with a geometrical cutoff, which for w ≲ 1 (or more precisely w → 1⁻), approximates the naive un-normalizable Zipf law with arbitrary accuracy.

7. Very general Gibbs-like model

Let us now consider an arbitrary number of constraints of the form

and

One could always transform a g-type constraint into an f-type constraint or vice versa, but as we shall soon see there are advantages to keeping the logarithm explicit. Applying the maximum entropy principle amounts to considering

where with malice aforethought we have now relabelled the Lagrange multipliers for the f constraints as follows: ln w → − β. Then maximizing over the p_n we have the extremality condition

with explicit solution

where now the normalizing constant is

This can be viewed as a generalization/modification of the Gibbs distribution where we have explicitly pulled out some of the constraints (the g-type constraints) to make them look power-law-like, while the remaining constraints (the f-type constraints) are left in Boltzmann-like form. Then

while

Furthermore, at maximum entropy

It is only once one specifies particular choices for the functions f_a and g_i that the model becomes concrete, and only at that stage might one need to focus on particular special functions of mathematical physics. The model is extremely general—the drawback is that, because it can fit almost anything, it can 'explain' almost anything, and so can predict almost nothing.

8. Non-Shannon entropies

Shannon's entropy is by far the best motivated of the entropy functions infesting the literature. Without making any particular commitment to the advisability of doing so, we can certainly ask what happens if we apply maximum entropy ideas to non-Shannon entropies (such as the Rényi [22] or Tsallis [23] entropies and their generalizations). Let us define an entropic zeta function by

which certainly converges for s ⩾ 1 and may converge on a wider region. Then

where in both cases the Shannon entropy is recovered in the limit a → 0. More generally let us consider a generalized entropy of the form

for an arbitrary smooth function f(·). Let us further impose a constraint on the bth moment

With power laws being explicitly built in as input into both the generalized entropy and the constraint, it is perhaps not too surprising that we will manage to get power laws dropping out. One is now interested in maximizing

Varying the p_n leads to

with solution

(An overall factor of (1 + a)f'(ζ_S(1 + a)) simply drops out of the calculation.) If the number of states is finite, then we cannot a priori discard the parameter w, and we have derived a distorted power law. (The probability distribution then interpolates between a pure power law for w = 0 and a uniform distribution for w = ∞.) If on the other hand, the number of states is infinite then normalizability enforces w → 0, and λ factors out. We then have a pure power law

If the state space is the positive integers n∈{1,2,3,...} then Z → ζ(−b/a), the Riemann zeta function, and the sum converges only for −b/a > 1. So one of the two parameters (a, b) must be negative. In this situation

now requiring both −b − b/a > 1 and −b/a > 1, while at maximum entropy

So yes, one can also extract power laws from maximum entropy applied to non-Shannon entropies (in particular, the generalized Rényi–Tsallis entropies), but the process is (at best) rather clumsy, and seems an exercise in overkill. Apart from the whole question of whether or not non-Shannon entropies are particularly interesting, one should ask whether the result is particularly useful? This derivation does not seem to be in any way an improvement over the simpler one based directly on the Shannon entropy, so its utility is dubious.

9. Summary and discussion

The main point of this paper is that power laws (and their variants, including hybrid geometric/power laws) have a very natural and straightforward interpretation in terms of the maximum entropy formalism pioneered by Jaynes. The key to obtaining a pure power law in the simplest possible manner lies in maximizing the Shannon entropy while imposing the simple constraint 〈ln n〉 = χ. Depending on other features of the specific model under consideration, detailed analysis leads to certain of the special functions of mathematics (the Riemann zeta function, generalized harmonic functions, poly-logarithms or even ordinary logarithms), but these are relatively well-known mathematical objects, which are still tolerably simple to deal with.

Adding additional features (finite size state space, extra constraints) can (and typically will) modify both the functional form and the normalization constants appearing in the probability distribution. As always there is a trade-off between simplicity and flexibility. A more complicated model (with more free parameters) has a more flexible probability distribution, but this comes at a real (if often unacknowledged) cost in terms of internal complexity. A rather general Gibbs-like model is laid out and briefly discussed. We also briefly discuss applying maximum entropy ideas to non-Shannon entropies (such as the Rényi and Tsallis entropies). There is very definitely a trade-off in both elegance and plausibility, and I would argue strongly that the simplest and most elegant model consists of the Shannon entropy, a constraint on 〈ln n〉 = χ, and a trivial normalization constraint on the sum of probabilities.

The fact that the logarithmic average 〈ln n〉 plays such an important role in power laws seems to have a connection with the fact that logarithmic scales are ubiquitous in classifying various natural and social phenomena. For instance

stellar magnitudes are logarithmic in stellar luminosity,
earthquake magnitudes (modified Richter scale) are logarithmic in energy release,
sound intensity decibels are logarithmic in pressure,
the acidity/alkalinity pH scale is logarithmic in hydrogen ion concentration,
musical octaves are logarithmic in frequency,
war severity can be characterized as being logarithmic in casualty count [7].

In many cases the utility of a logarithmic scale can be traced back to an approximate logarithmic sensitivity in human perceptual systems, but it is very easy to confound cause and effect. After all, in the presence of power-law distributed external stimuli, there is a significant disadvantage in having the human perceptual system overwhelmed by large numbers of low-impact events, suggesting an evolutionary pressure towards suppressing sensitivity to low-impact events. This suggests that logarithmic sensitivity in human (and animal) perceptual systems is evolutionarily preferred for those senses that are subject to an external bath of power-law distributed stimuli.

Fortunately, for the purposes of applying maximum entropy ideas one does not need to know which is the cause and which is the effect—one is 'merely' using Bayesian principle to estimate underlying probabilities in the presence of limited knowledge; for current purposes this is most typically the single piece of information that 〈ln n〉 = χ.

Acknowledgments

MV acknowledges support via the Marsden Fund, and via a James Cook Fellowship, both administered by the Royal Society of New Zealand.

Please wait… references are loading.

Zipf's law, power laws and maximum entropy

Article metrics

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Power laws in infinite state space

3. Power laws in finite state space

4. Zipf's law in finite state space

5. Hybrid geometric/power models in infinite state space

6. Zipf's law: geometric version in infinite state space

7. Very general Gibbs-like model

8. Non-Shannon entropies

9. Summary and discussion

Acknowledgments