Nothing Special   »   [go: up one dir, main page]

Quantization of Discrete Probability Distributions: Qualcomm Inc. 5775 Morehouse Drive, San Diego, CA 92121

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

QUANTIZATION OF DISCRETE PROBABILITY DISTRIBUTIONS

Yuriy A. Reznik

arXiv:1008.3597v1 [cs.IT] 21 Aug 2010

Qualcomm Inc.
5775 Morehouse Drive, San Diego, CA 92121
yreznik@ieee.org
ABSTRACT
We study the problem of quantization of discrete probability distributions, arising in universal coding, as well
as other applications. We show, that in many situations
this problem can be reduced to the covering problem for
the unit simplex. Such setting yields precise asymptotic
characterization in the high-rate regime. Our main contribution is a simple and asymptotically optimal algorithm
for solving this problem. Performance of this algorithm is
studied and compared with several known solutions.

codes [25], [26], [30], [4], [15]. In this context, quantization of distributions becomes a small sub-problem in
a complex rate optimization process, and final solutions
yield very few insights about it.
In this paper, we study quantization of distributions as
a stand-alone problem. In Section 2, we introduce notation and formulate the problem. In Section 3, we study
achievable performance limits. In Section 4, we propose
and study an algorithm for solving this problem. In Section 5, we provide comparisons with other known techniques. Conclusions are drawn in Section 6.

1. INTRODUCTION

2. DESCRIPTION OF THE PROBLEM

The problem of coding of probability distributions surfaces many times in the history of source coding. First
universal codes, developed in late 1960s, such as LynchDavisson [21, 9], combinatorial [28], and enumerative
codes [7] used lossless encoding of frequencies of symbols in an input sequence. The Rice machine [24], developed in early 1970s, transmitted quantized estimate
of variance of sources distribution. Two-step universal
codes, developed by J. Rissanen in 1980s, explicitly estimate, quantize, and transmit parameters of distribution as
a first step of the encoding process [25, 26]. Vector quantization techniques for two-step universal coding were proposed in [30, 4]. In practice, two-step coding was often
implemented by constructing a Huffman tree, then encoding and transmitting this code tree, and then encoding and
transmitting the data. Such algorithms become very popular in 1980s and 1990s, and were used, for example, in ZIP
archiver [17], and JPEG image compression standard [16].
In recent years, the problem of coding of distributions
has attracted a new wave of interest coming from other
fields. For example, in computer vision, it is now customary to use histogram-derived descriptions of image features. Examples of such descriptors include SIFT [20],
SURF [1], and CHoG [2], differentiating mainly in a way
the quantize histograms. Several other uses of coding of
distributions are described in [11].
To the best of our knowledge, most related prior studies were motivated by optimal design of universal source

Let A = {r1 , . . . , rm }, m < , denote a discrete set of


events, and let m denote the set of probability distributions over A:

o
n
P

m = [1 , . . . , m ] Rm i > 0 , i i = 1 . (1)

1 Presented at the Information Theory and Applications (ITA) workshop, San Diego, CA, February 15, 2010, and at the Workshop on Information Theoretic Methods in Science and Engineering (WITMSE),
Tampere, Finland, August 16 18, 2010.

Let p m be an input distribution that we need to encode, and let Q m be a set of distributions that we
will be able to reproduce. We will call elements of Q
reconstruction points or centers in m . We will further
assume that Q is finite |Q| < , and that its elements are
enumerated and encoded by using fixed-rate code. The
rate of such code is R(Q) = log2 |Q| bits. By d (p, q)
we will denote a distance measure between distributions
p, q m .
In order to complete traditional setting of the quantization problem for distribution p m , it remains to
assume that it is produced by some random process, e.g.
a memoryless process with density over m . Then the
problem of quantization can be formulated as minimization of the average distance to the nearest reconstruction
point (cf. [14, Lemma 3.1])
m , , R) =
d(

inf Epm min d(p, q) ,

Qm
|Q|62R

qQ

(2)

However, we notice that in most applications, best possible accuracy of the reconstructed distribution is needed
instantaneously. For example, in the design of a two-part
universal code, sample-derived distribution is quantized
and immediately used for encoding of this sample [26].
Similarly, in computer vision / image recognition applications, histograms of gradients from an image are ex-

tracted, quantized, and used right away to find nearest


match for this image.
In all such cases, instead of minimizing the expected
distance, it makes more sense to design a quantizer that
minimizes the worst case- or maximal distance to the nearest reconstruction point. In other words, we need to solve
the following problem 2
d (m , R) =

inf

max min d(p, q) .

Qm pm qQ
|Q|62R

(3)

We next survey some known results about it.


3. ACHIEVABLE PERFORMANCE LIMITS
We note that the problem (3) is purely geometric in nature.
It is equivalent to the problem of covering of m with at
most 2R balls of equal radius. Related and immediately
applicable results can be found in Graf and Luschgy [14,
Chapter 10].
First, observe that m is a compact set in Rm1 (unit
m 1-simplex), and that its volume in Rm1 can be computed as follows [29]:

r


k
a
k
+
1
m

m1
=
. (4)

(m ) =

k!
2k k=m1
(m

1)!

a= 2

Figure 1. Examples of type lattices and their Voronoi cells


in 3 dimensions (m = 3, n = 1, 2, 3).
Proposition 1. With R :
q
R
m
d (m , R) 12 m1 (m1)!
2 m1
and more generally (for other Lr -norms, r > 1):
q
R
m
2 m1 ,
dr (m , R) Cm1,r m1 (m1)!

(8)

(9)

where Cm1,r are some constants.


We further note, that with large m the leading term
in (9) turns into


q
1
e
m
m1
(10)
(m1)! = m + O
m2

Here and below we assume that m > 3.


Next, we bring result for asymptotic covering radius
[14, Theorem 10.7]
p
R
lim 2 m1 d (m , R) = Cm1m1 m1 (m ), (5)

which is a decaying function of m. This highlights an interesting property and distinction of the problem of quantization of m-ary distributions, compared to quantization
of the unit (m 1)-dimensional cube.
Our next task is to design a practical algorithm for
solving this problem.

where Cm1 > 0 is a constant known as covering coefficient for the unit cube

4. PRACTICAL ALGORITHM FOR CODING OF


DISTRIBUTIONS

Cm1 = inf 2 m1 d ([0, 1]m1 , R).


R>0

(6)

The exact value of Cm1 depends on a distance measure d(p, q). For example, for L norm
i

Given some integer n > 1, define a lattice:



n
o
P

Qn = [q1 , . . . , qm ] Qm qi = kni , i ki = n , (11)

it is known that
1
2

Hereafter, when we work with specific Lr - norms:


dr (p, q) = ||p q||r =

X
i

|pi qi |r

1/r

(7)

we will attach subscripts r to covering radius d (.) and


other expressions to indicate type of norm being used.
By putting all these facts together, we obtain:
2

Our algorithm can be viewed as a custom designed lattice


quantizer. It is interesting in a sense that its lattice coincides with the concept of types in universal coding [8].
4.1.1. Type Lattice

d (p, q) = ||p q|| = max |pi qi | ,


Cm1, =

4.1. Algorithm design

where n, k1 , . . . , km Z+ . Parameter n serves as a common denominator to all fractions, and can be used to control the density and number of points in Qn .
By analogy with the concept of types in universal coding [8] we will refer to distributions q Qn as types.
For same reason we will call Qn a type lattice. Several
examples of type lattices are shown in Figure 1.
4.1.2. Quantization

The dual problem


R() =

inf

Qm :maxpm minqQ d(p,q)6

log2 |Q| ,

may also be posed. The resulting quantity R() can be understood as


Kolmogorovs -entropy for metric space (m , d) [18].

The task of finding the nearest type in Qn can be solved


by using the following simple algorithm: 3 :
3 This algorithm is similar in concept to Conway and Sloanes quantizer for lattice An [5, Chapter 20]. It works within (m-1) simplex.

Algorithm 1. Given p, n, find nearest q =

 k1
n


, . . . , knm :

1. Compute values (i = 1, . . . , m)


P
ki = npi + 12 , n = i ki .

2. If n = n the nearest type is given by: ki = ki .


Otherwise, compute errors
i = ki npi ,
and sort them such that
21

6 j1 6 j2 6 . . . 6 jm 6

1
2

3. Let = n n. If > 0 then decrement d values


ki with largest errors

j = i, . . . , m 1 ,
kji
kji =
kj i 1 i = m , . . . , m ,
otherwise, if < 0 increment || values ki with
smallest errors

kji + 1 i = 1, . . . , || ,
kji =
i = || + 1, . . . , m .
kj i

The logic of this algorithm is obvious. It finds points


that are nearest in terms of L-norms. By using quickselect instead of full sorting in step 2, its run time can
be reduced to O(m).
As mentioned earlier, the number of types in lattice Qn
depends on the parameter n. It is essentially the number
of partitions of n into m terms k1 + . . . + km = n:


n+m1
|Qn | =
.
(12)
m1

In order to encode a type with parameters k1 , . . . , km ,


we need to obtain its unique index (k1 , . . . , km ). We suggest to compute it as follows:

(k1 , . . . , kn ) =
(13)
P


k
1
j
n2
j1
XX
n i l=1 kl + m j 1
+ kn1 .
mj 1
j=1 i=0
This formula follows by induction (starting with m =
2, 3, etc.), and it implements lexicographic enumeration
of types. For example:
(0, 0, . . . , 0, n) = 0 ,
(0, 0, . . . , 1, n 1) = 1 ,

qi = q + vi , q Qn , i = 1, . . . , m 1,

n+m1
m1

1.

that log |Qn | = O(log n), this translates into at most


O(n log n) bit-additions

(15)

where vi are so-called glue vectors [5, Chapter 21]:




1 mi
i
mi
i
vi = n m , . . . , m , m , . . . , m .
(16)
{z
} | {z }
|
i times

mi times

We next compute maximum distances (covering radii).

Proposition 2. Let a = m/2. The following holds:



1
max min d (p, q) = n1 1 m
,
(17)
pm qQn
q
,
(18)
max min d2 (p, q) = n1 a(ma)
m
pm qQn

1 2a(ma)
n
m

max min d1 (p, q) =

(19)

Proof. We use vectors (16). The largest component values


appear when i = 1 or i = m 1. E.g. for i = 1:


1
1
v1 = n1 m1
m , m ,..., m .
This produces L - radius. The largest absolute sum
is achieved when all components are approximately the
same in magnitude. This happens when i = a:


1 ma
ma
a
a
va = n m , . . . , m , m , . . . , m .
|
{z
} |
{z
}
a times

ma times

This produces L1 - radius. L2 norm is the same for all


vectors vi , i > 0.
It remains to evaluate distance / rate characteristics of
type-lattice quantizer:
dr [Qn ](m , R) =

Similar combinatorial enumeration techniques were discussed in [28, 7, 27]. With precomputed array of binomial
coefficients, the computation of an index by using this formula requires only O(n) additions4 .
Once index is computed, it is transmitted by using its
direct binary representation at rate:
l
m
R(n) = log2 n+m1
.
(14)
m1
4 Considering

Type lattice Qn is related to so-called lattice An in lattice


theory [5, Chapter 4]. It can be understood as a bounded
subset of An with n = m 1 dimensions, which is subsequently scaled, and placed in the unit simplex.
Using this analogy, we can show that vertices of Voronoi
cells (so called holes) in type lattice Qn are located at:

pm qQn

4.1.3. Enumeration and Encoding

...
(n, 0, . . . , 0, 0) =

4.2. Analysis

min

max min dr (p, q) .

n:|Qn |62R pm qQn

We report the following.


Theorem 1. Let a = m/2. Then, with R :
d [Qn ](m , R)
d2 [Qn ](m , R)

d1 [Qn ](m , R)

2 m1

R
m1

2 m1

1
1 m
p
, (20)
m1
(m 1)!
q
a(ma)
m

p
, (21)
(m 1)!

m1

2a(ma)

pm
. (22)
m1
(m 1)!

Proof. We first obtain asymptotic (with n ) expansion for the rate of our code (14):

R = (m 1) log2 n log2 (m 1)! + O n1 .

This implies that

n 2 m1

p
(m 1)! .

m1

Statements of theorem are obtained by combination of this


relation with expressions (17-19).
4.2.1. Optimality
We now compare the result of Theorem 1 with theoretical
asymptotic estimates for covering radius for m (8, 9).
As evident, the maximum distance in our scheme decays
with the rate R as:
R

d [Qn ](m , R) 2 m1 ,
which matches the decay rate of theoretical estimates.
The only difference is in a constant factor. For example, under L norm, such factor in expression (8) is


q
log m
1 m1
1
.
m= +O
2
2
m
Our algorithm, on the other hand, uses a factor
1
1
61
< 1,
2
m
which starts with 12 when m = 2. This suggests that even
in terms of leading constant our algorithm is close to the
optimal.
4.2.2. Performance in terms of KL-distance
All previous results are obtained using L-norms. Such distance measures are common in computer vision applications [20, 22, 1]. In source coding, main interest presents
Kullback-Leibler (KL) distance:
X
pi
pi log2 .
(23)
dKL (p, q) = D(p||q) =
qi
i
It is not a true distance, so the exact analysis is complicated. Yet, by using Pinsker inequality [23]
dKL (p, q) >

1
2 ln 2

d1 (p, q)2 ,

(24)

we can at least show that for deep holes


dKL (q , q) >

1
2 ln 2

1 2a(ma)
n
m

2

dKL (q , q) &

1
2 ln 2

2R
m1

2a(ma)

pm
m1
(m 1)!

4.3. Additional improvements


4.3.1. Introducing bias
As easily observed, type lattice Qn places reconstruction
points with ki = 0 precisely on edges of the probability
simplex m . This is not best placement from quantization
standpoint, particularly when n is small. This placement
can be improved by using biased types:
qi =

ki +
, i = 1, . . . , m ,
n + m

where > 0 is a constant that defines shift towards the


middle of the simplex. In traditional source coding applications, it is customary to use = 1/2 [19]. In our
case, setting = 1/m appears to work best for L-norms,
as it introduces same distance to edges of the simplex as
covering radius of the lattice.
Algorithm 1 can be easily adjusted to find nearest points
in such modified lattice.
4.3.2. Using dual type lattice Qn
Another idea for improving performance of our quantization algorithm is to define and use dual type lattice Qn .
Such a lattice consists of all points:
q = q + vi , q Qn , q m i = 0, . . . , m 1,
where vi are the glue vectors (16).
The main advantage of using dual lattice would be
thinner covering at high dimensions (cf. [5, Chapter 2]).
But even at small dimensions, it may sometimes be useful. An example of this for m = 3 is shown in Figure. 2.
5. COMPARISON WITH OTHER TECHNIQUES
Given a probability distribution p m , one popular in
practice way of compressing it is to design a prefix code
(for example, a Huffman code) for this distribution p first,
and then encode the binary tree of such a code. Below
we summarize some known results about performance of
such schemes.
5.1. Performance of tree-based quantizers

By translating n to bitrate, we obtain

Figure 2. Two 10-point lattices: Q3 (left), and Q2 (right).


The second has much smaller cells.

2

(25)

More precise bounds can be obtained by using inequalities


described in [10].

By denoting by 1 , . . . , m lengths of prefix codes, recalling that they satisfy Kraft inequality [6], and noting that
2i can be used to map lengths back to probabilities, we
arrive at the following set:



P
Qtree = [q1 , . . . , qm ] Qm qi = 2i , i 2i 6 1 .

Figure 3. Maximal L1 distances vs rate characteristics d1 [H](R), d1 [GM](R), d1 [Qn ](R) achievable by Huffman-,
Gilbert-Moore-, and type-based quantization schemes.
There are several specific algorithms that one can employ for construction of codes, producing different subsets of Qtree . Below we only consider the use of classic
Huffman and Gilbert-Moore [13] codes. Some additional
tree-based quantization schemes can be found in [11].
Proposition 3. There exists a set QGM Qtree , such that
dKL [QGM ](RGM ) 6
d1 [QGM ](RGM )
d [QGM ](RGM )

2,

2 ln 2 ,

(27)

1,

(28)

(26)

where
RGM = log2 |QGM |
where Cn =

2n
1
n+1 n

= log2 Cm1
(29)
3
= 2 m 2 log2 m + O(1),

is the Catalan number.

Proof. We use Gilbert-Moore code [13]. Upper bound for


KL-distance is well known [13]. L1 bound follows by
Pinskers inequality (24). L bound is obvious: pi , qi
(0, 1). Gilbert-Moore code uses fixed assignment (e.g.
from left to right) of letters to the codewords. Any binary
rooted tree with m leaves can serve as a code. The number
of such trees is given by the Catalan number Cm1 .
Proposition 4. There exists a set QH QH , such that
dKL [QH ](RH ) 6
d1 [QH )](RH )
d [QH ](RH )

6
6

1,

2 ln 2 ,
1
2 ,

(30)
(31)
(32)

where
RH = log2 |QH | = m log2 m + O (m) .

(33)

Proof. We use Huffman code. Its KL-distance bound is


well known [6]. L1 bound follows by Pinskers inequality. L bound follows from sibling property of Huffman
trees [12]. It remains to estimate the number of Huffman trees Tm with m leaves. Consider a skewed tree,
with leaves at depths 1, 2, . . . , m
 1, m 1. The last
two leaves can be labeled by m
2 combinations of letters,
whereas the other leaves - by
combi (m 2)! possible
1
nations. Hence Tm > m
(m

2)!
=
m!.
Upper
2
2
bound is obtained by arbitrary labeling all binary trees
with m leaves: Tm < m! Cm1 , where Cm1 is the
Catalan number. Combining both we obtain: ln12 m <
log2 Tm m log2 m < 2 ln12 m.
5.2. Comparison

We present comparison of maximal L1 distances achievable by tree-based and type-based quantization schemes
in Figure 3. We consider cases of m = 5 and m = 10
dimensions. It can be observed that the proposed typebased scheme is more efficient and much more versatile,
allowing a wide range of possible rate/distance tradeoffs.
6. CONCLUSIONS
The problem of quantization of discrete probability distributions is studied. It is shown, that in many cases, this
problem can be reduced to the covering radius problem
for the unit simplex. Precise characterization of this problem in high-rate regime is reported. A simple algorithm
for solving this problem is also presented, analyzed, and
compared to other known solutions.
7. ACKNOWLEDGMENTS
The author would like to thank Prof. K. Zeger (UCSD)
for reviewing and providing very useful comments on initial draft version of this paper. The author also wishes to
thank Prof. B. Girod and his students V. Chandrasekhar,

G. Takacs, D. Chen, S. Tsai (Stanford University), and


Dr. R.Grzeszczuk (Nokia Research Center, Palo Alto) for
introduction to the field of computer vision, and collaboration that prompted study of this quantization problem [3].
8. REFERENCES
[1] H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, SURF:
Speeded Up Robust Features, Computer Vision and
Image Understanding (CVIU), vol. 110, no. 3, pp.
346359, 2008.
[2] V. Chandrasekhar, G. Takacs, D. Chen, S. Tsai,
R. Grzeszczuk, B. Girod, CHoG: Compressed histogram of gradients A low bit-rate feature descriptor, in Proc. Computer Vision and Pattern Recognition (CVPR09), 2009, pp. 2504-2511.
[3] V. Chandrasekhar, Y. Reznik, G. Takacs, D. Chen,
S. Tsai, R. Grzeszczuk, and B. Girod, Quantization
Schemes for Low Bitrate Compressed Histogram of
Gradient Descriptors, in Proc. IEEE Int. Workshop
on Mobile Vision (CVPR-IWMV10), 2010.
[4] P. A. Chou, M. Effros, and R. M. Gray, A vector
quantization approach to universal noiseless coding
and quantization, IEEE Trans. Information Theory,
vol. 42, no. 4, pp. 11091138, 1996.
[5] J. H. Conway and N. J. A. Sloane, Sphere Packings,
Lattices and Groups. New York: Springer-Verlag,
1998.
[6] T. M. Cover and J. M. Thomas, Elements of Information Theory. New York: John Wiley & Sons,
2006.
[7] T. Cover, Enumerative source coding, IEEE
Trans. Inform. Theory, vol. 19, pp. 7376, Jan. 1973.
[8] I. Csiszar, The method of types, IEEE Trans. Inf.
Theory, vol.44, no. 66, pp. 25052523, 1998.
[9] L. D. Davisson, Comments on Sequence time coding for data compression, Proc. IEEE, vol. 54, p.
2010. Dec. 1966.
[10] A. A. Fedotov, P. Harremoes, and F. Topse, Refinements of Pinskers inequality, IEEE Trans. Inf.
Theory, vol. 49, no. 6, pp. 14911498, 2003.
[11] T. Gagie, Compressing Probability Distributions,
Information Processing Letters, vol. 97, no. 4,
pp. 133137, 2006.
[12] R. Gallager, Variations on a theme by Huffman,
IEEE Trans. Inform. Theory, vol.24, no.6, pp. 668674, Nov 1978.
[13] E. N. Gilbert and E. F. Moore, Variable-Length
Binary Encodings, The Bell System Tech. Journal,
vol. 7, pp. 932967, 1959.
[14] S. Graf, and H. Luschgy, Foundations of Quantization for Probability Distributions. Berlin: SpringerVerlag, 2000.

[15] T. S. Han, and K. Kobayashi, Mathematics of Information and Coding. Boston: American Mathematical Society, 2001.
[16] ITU-T and ISO/IEC JTC1, Digital Compression and Coding of Continuous-Tone Still Images,
ISO/IEC 10918-1 ITU-T Rec. T.81, Sept. 1992.
[17] P. W. Katz, PKZIP. Commercial compression system, version 1.1, 1990.
[18] A. N. Kolmogorov and V. M. Tikhomirov, entropy and -capacity of sets in metric spaces, Uspekhi Math. Nauk, vol. 14, no. 2, pp. 3-86, 1959. (in
Russian)
[19] R. E. Krichevsky and V. K. Trofimov, The Performance of Universal Encoding, IEEE Trans. Information Theory, vol. 27, pp. 199207, 1981.
[20] D. Lowe, Distinctive Image Features from ScaleInvariant Keypoints, International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[21] T. J. Lynch, Sequence time coding for data compression, Proc. IEEE, vol. 54, pp. 14901491,
1966.
[22] K. Mikolajczyk and C. Schmid, Performance
Evaluation of Local Descriptors, IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 10, pp. 16151630, 2005.
[23] M. S. Pinsker, Information and Information Stability of Random Variables and Processes, Problemy Peredachi Informacii, vol. 7, AN SSSR,
Moscow 1960. (in Russian).
[24] R.F. Rice and J.R. Plaunt, Adaptive variable
length coding for efficient compression of spacecraft
television data, IEEE Trans. Comm. Tech., vol. 19,
no.1, pp. 889897, 1971.
[25] J. Rissanen, Universal coding, information, prediction and estimation, IEEE Trans. Inform. Theory, vol. 30, pp. 629-636, 1984.
[26] J. Rissanen, Fisher Information and Stochastic
Comprexity, IEEE Trans. Inform. Theory, vol. 42,
pp. 40-47, 1996.
[27] J. P. M. Schalkwijk, An algorithm for source coding, IEEE Trans. Inform. Theory, vol. 18, pp. 395
399, May 1972.
[28] Yu. M. Shtarkov and V. F. Babkin, Combinatorial
encoding for discrete stationary sources, in Proc.
2nd Int. Symp. on Information Theory, Akademiai
Kiado, 1971, pp. 249256.
[29] D.M.Y. Sommerville, An Introduction to the Geometry of n Dimentions. New York: Dover, 1958.
[30] K. Zeger, A. Bist, and T. Linder, Universal Source
Coding with Codebook Transmission, IEEE Trans.
Communications, vol. 42, no. 2, pp. 336346, 1994.

You might also like