WDIBS: Wasserstein deterministic information bottleneck for state abstraction to balance state-compression and performance

Xianchao Zhu¹,
Tianyi Huang¹,
Ruiyuan Zhang¹ &
…
William Zhu ORCID: orcid.org/0000-0001-8898-9244¹

515 Accesses
Explore all metrics

Abstract

As an important branch of reinforcement learning, Apprenticeship learning studies how an agent learns good behavioral decisions by observing an expert policy from the environment. It has made many encouraging breakthroughs in real-world applications. State abstraction is typically used to compress the state space of the environment to eliminate redundant information, thereby improving learning efficiency. However, excessive compression results in poor decision performance. Therefore, it is important to balance the compression degree and decision performance. Deterministic Information Bottleneck for State abstraction (DIBS) attempts to solve this problem. Specifically, DIBS uses the information rate to represent the compression degree at first. Then, decision performance after compression is measured using the Kullback-Leibler (KL) divergence of distributions between the policy after state compression and the expert policy. However, if the two distributions do not have exactly overlapping support sets, then the KL divergence is usually infinity, which leads to poor decision performance under the low information rate. In this paper, we propose the Wasserstein DIBS (WDIBS) algorithm to optimize the trade-off between the compression degree and decision performance. Specifically, we use the Wasserstein distance to calculate the difference of the distributions between the policy after state compression and the expert policy. Even if the two distributions do not have precisely overlapping support sets, the Wasserstein distance can still reflect their actual difference, thereby ensuring that WDIBS has good decision performance under the low information rate. Theoretical analyses and experiments demonstrate that our method provides a better trade-off between the compression degree and decision performance than DIBS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Hierarchical Representation Policy Iteration Algorithm for Reinforcement Learning

Model Compression for Deep Reinforcement Learning Through Mutual Information

Generalized exploration in policy search

Article 13 July 2017

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT press, Cambridge
MATH Google Scholar
Zhou W, Li W Safety-aware apprenticeship learning, Springer, Cham
Lindenstrauss E, Tsukamoto M (2018) From rate distortion theory to metric mean dimension: variational principle. IEEE Trans Inf Theory 64(5):3590–3609
Article MathSciNet Google Scholar
Abe D (2020) A theory of abstraction in reinforcement learning. Brown University
Lecarpentier E, Abel D, Asadi K, Jinnai Y, Rachelson E, Littman ML (2021) Lipschitz Lifelong Reinforcement Learning. In: Thirty-fifth AAAI conference on artificial intelligence, AAAI 2021, Virtual Event, February 2-9. AAAI Press, pp 8270–8278
Jonsson A, Gómez V (2016) Hierarchical linearly-solvable Markov decision problems. In: Proceedings of the twenty-sixth international conference on automated planning and scheduling, ICAPS 2016, London, UK, June 12-17. AAAI Press, pp 193–201
Menashe J, Stone P (2018) State abstraction synthesis for discrete models of continuous domains. In: AAAI spring symposia, Stanford University, Palo Alto, California, USA, March 26-28, 2018. AAAI Press
Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, Kavukcuoglu K (2017) Feudal networks for hierarchical reinforcement learning. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August. PMLR, pp 3540–3549
Cai Z, Yang X, Huang T, Zhu W (2020) A new similarity combining reconstruction coefficient with pairwise distance for agglomerative clustering. Inform Sci 508:173–182
Article MathSciNet Google Scholar
Cai Z, Zhu W (2018) Multi-label feature selection via feature manifold learning and sparsity regularization. Int J Mach Learn Cybern 9(8):1321–1334
Article Google Scholar
Huang T, Wang S, Zhu W (2020) An adaptive kernelized rank-order distance for clustering non-spherical data with high noise. Int J Mach Learn Cybern 11:1735–1747
Article Google Scholar
Yang X, Cai Z, Li R, Zhu W (2020) GDPC: Generalized density peaks clustering algorithm based on order similarity. Int J Mach Learn Cybern
Guo Z, Huang T, Cai Z, Zhu W (2018) A new local density for density peak clustering. In: 22nd Pacific-Asia conference on knowledge discovery and data mining, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018. Springer, pp 426–438
Li R, Yang X, Qin X, Zhu W (2019) Local gap density for clustering high-dimensional data with varying densities. Knowl-Based Syst 184:104905
Article Google Scholar
Bai A, Srivastava S, Russell SJ (2016) Markovian state and action abstractions for MDPs via hierarchical MCTS. In: The 25th Morgan Kaufmann international joint conference on artificial intelligence, IJCAI 2016, New York, NY, USA, July 9-15, 2016, Morgan Kaufmann. pp 3029–3039
Bellemare AG, Dabney W, Dadashi R, Taïga AA, Castro PS, Le Roux N, Schuurmans D, Lattimore T, Lyle C (2019a) A geometric perspective on optimal representations for reinforcement learning. In: Advances in neural information processing systems 32, NeurIPS 2019, December 8-14, Vancouver, BC, Canada, pp 4360–4371
MacKay DJC, Mac Kay DJC (2003) Information theory, inference and learning algorithms. Cambridge university press, Cambridge
Google Scholar
Unal S, Wagner AB (2016) A rate-distortion approach to index coding. IEEE Trans Inf Theory 62(11):6359–6378
Article MathSciNet Google Scholar
Li Q, Chen Y (2020) Rate distortion via deep learning. IEEE Trans Commun 68(1):456–465
Article Google Scholar
Cheraghchi M, Ribeiro JL (2021) An Overview of capacity results for synchronization channels. IEEE Trans Inf Theory 67(6):3207–3232
Article MathSciNet Google Scholar
Goyal A, Islam R, Strouse D, Ahmed Z, Botvinick M, Larochelle H, Levine S, Bengio Y (2019) InfoBot: Transfer and exploration via the information bottleneck. In: 7th international conference on learning representations, New Orleans, USA, May 6-9
Bacon P-L, Harb J, Precup D (2017) The option-critic architecture. In: Proceedings of the thirty-first AAAI conference on artificial intelligence, February 4-9, 2017, San Francisco, California. AAAI Press, pp 1726–1734
Puterman M (2014) Markov decision processes: Discrete stochastic dynamic programming. Wiley, Hoboken
MATH Google Scholar
Sutton RS, Barto A (2018) Reinforcement learning: An introduction. MIT press, Cambridge
MATH Google Scholar
Abel D, Arumugam D, Lehnert L, Littman ML (2018) State abstractions for lifelong reinforcement learning. In: The 35th ACM international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. ACM, pp 10–19
Jothimurugan K, Bastani O, Alur R (2021) Abstract value iteration for hierarchical reinforcement learning. In: The 24th international conference on artificial intelligence and statistics, AISTATS 2021, April 13-15, Virtual Event. PMLR, pp 1162–1170
Sutter D, Sutter T, Esfahani PM, Renner R (2015) Efficient approximation of channel capacities. IEEE Trans Inf Theory 61(4):1649–1666
Article MathSciNet Google Scholar
Strouse D, Schwab DJ (2017) The deterministic information bottleneck. Neural Comput 29 (6):1611–1630
Article MathSciNet Google Scholar
Nikolaidis S, Nath S, Procaccia AD, Srinivasa S (2017) Game-theoretic modeling of human adaptation in human-robot collaboration. In: ACM/IEEE international conference on human-robot interaction, HRI 2017, Vienna, Austria, March 6-9. ACM, pp 323–331
Finn C, Levine S, Abbeel P (2016) Guided cost learning: Deep inverse optimal control via policy optimization. In: Proceedings of the 33nd international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24. JMLR, pp 49–58
Kretzschmar H, Spies M, Sprunk C, Burgard W (2016) Socially compliant mobile robot navigation via inverse reinforcement learning. Int J Robot Res 57(5):1289–1307
Article Google Scholar
Abel D, Arumugam D, Lehnert L, Littman ML (2016) Near optimal behavior via approximate state abstraction. In: The 33th ACM international conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016. ACM, pp 10–19
Abel D, Arumugam D, Lehnert L, Littman ML (2017) Toward good abstractions for lifelong learning. In: NeurIPS workshop on hierarchical reinforcement learning
Peyré G, Cuturi M (2019) Computational optimal transport. Found Trends Mach Learn 11 (5-6):355–607
Article Google Scholar
Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial networks. In: The 33th ACM international conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. ACM, pp 214–223
Lopez R, Regier J, Jordan MI, Yosef N (2018) Information constraints on auto-encoding variational Bayes. In: Advances in neural information processing systems 31, NeurIPS 2018, December 3-8, Montréal, Canada. MIT Press, pp 6117–6128
Kim H, Mnih A (2018) Disentangling by factorising. In: The 35th ACM international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. ACM, pp 2654–2663
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap TP, Harley T, Silver D, Kavukcuoglu K (2016) Asynchronous methods for deep reinforcement learning. In: The 33th ACM International conference on machine learning, ICML 2016, New York City, NY, USA. ACM, pp 1928–1937
Bellemare MG, Naddaf Y, Veness J, Bowling M (2013) The arcade learning environment: An evaluation platform for general agents. J Artif Intell Res 47:253–279
Article Google Scholar
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: 3rd international conference on learning representations, San Diego, CA, USA, May 7-9
Ozair S, Lynch C, Bengio Y, van den Oord A, Levine S, Sermanet P (2019) Wasserstein dependency measure for representation learning. In: Annual conference on neural information processing systems 32, NeurIPS 2019, Vancouver, BC, Canada, 8-14 December, 2019. MIT Press, pp 15604–15614
Gelada C, Kumar S, Buckman J, Nachum O, Bellemare MG (2019) DeepMDP: Learning continuous latent space models for representation learning. In: The 36th ACM international conference on machine learning, ICML 2019, Long Beach, California, USA. ACM, pp 2170–2179
Abel D, Arumugam D, Asadi K, Jinnai Y, Littman ML, Wong LLS (2019) State abstraction as compression in apprenticeship learning. In: The 33th AAAI conference on artificial intelligence, AAAI 2019, Honolulu, Hawaii, USA. AAAI Press, pp 3134–3142
Dupont E (2018) Learning disentangled joint continuous and discrete representations. In: Annual conference on neural information processing systems 31, NeurIPS 2018, Montréal, Canada, December 3-8, 2018, pages 708–718. MIT Press
Devraj AM, Bušić A, Meyn S (2021) Fundamental design principles for reinforcement learning algorithms. In: Handbook of reinforcement learning and control. Springer, Cham, pp 75–137
Sørensen RA, Nielsen M, Karstoft H (2020) Routing in congested baggage handling systems using deep reinforcement learning. IOS Press, vol 27, pp 139–152
Rajeswaran A, Mordatch I, Kumar V (2020) A game theoretic framework for model based reinforcement learning. In: Proceedings of the 37th international conference on machine learning, ICML 2020, 13-18 July, Virtual Event. PMLR, pp 7953–7963
Abel D, Umbanhowar N, Khetarpal K, Arumugam D, Precup D, Littman ML (2020) Value Preserving State-Action Abstractions. In: The 23rd international conference on artificial intelligence and statistics, AISTATS 2020, 26-28 August, Online [Palermo, Sicily, Italy]. PMLR, pp 1639–1650
Hanten C, Katsuya F (2020) Reinforcement learning with convolutional reservoir computing. Appl Intell 50(8):2400–2410. Springer
Article Google Scholar
Lin E, Chen Q, Qi X (2020) Deep reinforcement learning for imbalanced classification. Appl Intell 50(8):2488–2502. Springer
Article Google Scholar
Pateria S, Subagdja B, Tan A, Quek C (2021) Hierarchical reinforcement learning: a comprehensive survey. ACM Comput Surv (CSUR) 54(5):1–35
Article Google Scholar

Download references

Acknowledgments

This work is supported in part by The National Nature Science Foundation of China under Grant Nos. 61772120. We thank Maxine Garcia, PhD, from Liwen Bianji (Edanz) (www.liwenbianji.cn/) for editing the English text of a draft of this manuscript.

Author information

Authors and Affiliations

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
Xianchao Zhu, Tianyi Huang, Ruiyuan Zhang & William Zhu

Authors

Xianchao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Tianyi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ruiyuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
William Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to William Zhu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: A

In this section, we prove theorems and lemmas involved in algorithms.

1.1 A.1 Proofs of Theorem 1

Firstly, we provide the proof of Theorem 1. Specifically, to prove Theorem 1, we need two lemmas.

Usually, it is difficult to calculate the size of the abstract state space $\left | {{S}_{\phi }} \right |$ in $\mathcal {J}$. We use the size of the abstract state space $\left | {{S}_{\phi }} \right |_{{{\rho }_{\phi }}\left (s\right )}^{{{\delta }_{{\min \limits } }}}$ in ${{\hat {\mathcal {J}}}_{WDIB}}$ to approximate $\left | {{S}_{\phi }} \right |$. $\left | {{S}_{\phi }} \right |_{{{\rho }_{\phi }}\left (s\right )}^{{{\delta }_{{\min \limits } }}}$ can be calculated by the size of the non-ignorable part of the alphabet under PMF defined by the following formula.

Definition 3

PMF-Used Alphabet Size: The PMF-used alphabet size of $\mathcal {X}$ is the number of the elements whose probability under p(x) is higher than some negligibility threshold ${{\delta }_{{\min \limits } }}\in \left (0,1 \right )$:

$$ \left| \mathcal {X} \right|_{p\left( x \right)}^{{{\delta }_{\min }}}:=min\{\left| \{x\in {\mathcal {X}}: p\left( x \right)>{{\delta }_{\min }}\} \right|,\left| \mathcal {X} \right|\}. $$

(27)

Therefore, we use the first lemma, which associates the entropy of PMF with the maximum size of the alphabet used by that PMF [43]:

Lemma 1

Consider a discrete stochastic variable X, with alphabet $\mathcal {X}$ and some PMF p(x). For a given threshold ${{\delta }_{{\min \limits } }}\in \left (0,1 \right )$, the PMF-used alphabet size of the alphabet is bounded, relative to some PMF p(x):

$$ {\mathcal {X}}_{p\left( x \right)}^{{{\delta }_{\min }}}\le \frac{H\left( X \right)}{{{\delta }_{\min }}\log \left( \frac{1}{{{\delta }_{\min }}} \right)}. $$

(28)

This bound is relatively large: it is not difficult to know that $H\left (X \right )\le {{\log }_{2}}\left | \mathcal {X} \right |$ [3]. Therefore, in the worst case, the boundary can be $\tilde {O}({1}/{{{\delta }_{{\min \limits } }}})$ times larger than the real alphabet. Nevertheless, the result allows us to correlate the entropy of a stochastic variable with the size of the alphabet it uses. In addition, by definition, the entropy $H\left ({{\rho }_{\phi }} \right )$ of the abstract state distribution provides a lower bound for the number of bits to represent the used part of S_ϕ. In this way, the entropy is a measure of the degree of compression, which uses the fact that the entropy of the most probable state can be expressed as 0. Therefore, lower entropy means a higher compression degree.

Proof

The entropy of a random variable X is as follows:

$$ H\left( X \right):=-\sum\limits_{x\in \mathcal {X} }{p\left( x \right){{\log }_{2}}p\left( x \right)}. $$

(29)

Then, for a given maximum entropy $H\left (X \right )\le N$, we seek to understand the largest possible $\left | \mathcal {X} \right |_{p\left (x \right )}^{{{\delta }_{\min \limits }}}$. That is:

$$ \underset{p\left( x \right):H\left( X \right)\le N}{\max } \left| \mathcal {X} \right|_{p\left( x \right)}^{{{\delta }_{\min }}}. $$

(30)

If the alphabet $\mathcal {X}$ takes the maximum size, the variable X obeys a uniform distribution, where the probability of each element is ${{\delta }_{{\min \limits } }}$ such that $H\left (X \right )=N$:

$$ \begin{array}{@{}rcl@{}} H\left( X \right)&=&-\sum\limits_{x\in \mathcal {X} }{p\left( x \right){{\log }_{2}}p\left( x \right)} \\ &=&-\sum\limits_{x\in \mathcal {X} }{{{\delta }_{\min }}{{\log }_{2}}{{\delta }_{\min}}} \\ &=&-\left| \mathcal {X} \right|{{\delta }_{\min }}{{\log }_{2}}{{\delta }_{\min }} \\ &=&\left| \mathcal {X} \right|{{\delta }_{\min }}{{\log }_{2}}\frac{1}{{{\delta }_{\min }}}. \end{array} $$

(31)

Therefore, for a given PMF p(x) with entropy $H\left (X \right )$, and a minimum threshold of probability, the minimum size of the alphabet $\mathcal {X}$ is upper bounded:

$$ \left| \mathcal {X} \right|\le \frac{H\left( X \right)}{{{\delta }_{\min }}{{\log }_{2}}\frac{1}{{{\delta }_{\min }}}}. $$

(32)

□

Next, we introduce a second lemma that relates the expected Wasserstein distance between two policies to the difference in value realized by the policies, in expectation under some state distribution:

Lemma 2

We denote by π the set of 1-Lipschitz-policies, s.t. for two stochastic policies, π₁ and π₂ on state space S, and a fixed probability distribution over S, p(s). If, for $\varepsilon \in {{\mathbb {R}}_{\ge 0}}$:

$$ \left\{ \pi :\pi \in \prod ,\text{ }\underset{a}{\sum}{\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right|} \right.\le \left. \varepsilon \right\}, $$

(33)

then:

$$ \begin{array}{ll} &\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right]\le \frac{\varepsilon \gamma }{1-\gamma }, \\ &\underset{p\left( s \right)}{\mathbb{E}} \left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{\pi }_{2}}}}\left( s \right) \right]\le \frac{\varepsilon \gamma }{1-\gamma }VMAX, \end{array} $$

(34)

where VMAX is an upper bound on the value function.

The above boundary allows us to correlate the distortion metric controlled by WIB with the distortion metric of the CVA objective.

Proof

Recall the Wasserstein distance between policies π₁ and π₂ for a given state s is defined as:

$$ \begin{array}{ll} &\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right] =\underset{s\in p\left( s \right)}{\sum}{\left| {{\rho }_{{{\pi }_{1}},{{s}_{0}}}}\left( s \right)-{{\rho }_{{{\pi }_{2}},{{s}_{0}}}}\left( s \right) \right|}, \end{array} $$

(35)

where ${{\rho }_{{{\pi }},{{s}_{0}}}}$ denotes the stationary distribution over states under policy π, starting in state s₀ ∈ S.

We first bound the difference between the two state distributions after t steps:

$$ \begin{array}{@{}rcl@{}} &&\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right) \right|}\\ &=&\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{\left| p\left( {{s}_{t}}={s}^{\prime}\left| {{s}_{0}},{{\pi }_{1}} \right. \right)-p\left( {{s}_{t}}={s}^{\prime}\left| {{s}_{0}},{{\pi }_{2}} \right. \right) \right|} \\ &\le& \underset{s\in p\left( s \right)}{\sum}{p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{1}} \right. \right)\underset{a}{\sum}{\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right|}}\\ &&\times\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{p\left( {s}^{\prime}\left| s,a \right. \right)} \\ &&+\underset{s\in p\left( s \right)}{\sum}{\left| p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{1}} \right. \right)-p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{2}} \right. \right) \right|} \\ &&\times\underset{a}{\sum}{{{\pi }_{2}}\left( a\left| s \right. \right)\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{p\left( {s}^{\prime}\left| s,a \right. \right)}} \\ &\le& \varepsilon +\underset{s\in p\left( s \right)}{\sum}{\left| p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{1}} \right. \right)-p\left( {{s}_{t-1}}=s\left| {{s}_{0}},{{\pi }_{2}} \right. \right) \right|} \\ &=&\varepsilon +\underset{{s}^{\prime}\in p\left( s \right)}{\sum}{\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t-1}\left( {{s}^{\prime}} \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t-1}\left( {{s}^{\prime}} \right) \right|}. \end{array} $$

(36)

From the above bound, we get the following inequality:

$$ \underset{{s}^{\prime}\in p\left( s \right)}{\sum} {\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( {{s}^{\prime}} \right) \right|}\le t\varepsilon, $$

(37)

then:

$$ \begin{array}{@{}rcl@{}} &&\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right] \\ &=&\underset{s\in p\left( s \right)}{\sum} {\left| {{\rho }_{{{\pi }_{1}},{{s}_{0}}}}\left( s \right)-{{\rho }_{{{\pi }_{2}},{{s}_{0}}}}\left( s \right) \right|} \\ &=&\underset{s\in p\left( s \right)}{\sum}\left| \left( 1-\gamma \right)\underset{t}{\sum}{{\gamma }^{t}}\rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( s \right)-\left( 1-\gamma \right)\right.\\ &&\times\left.\underset{t}{\sum}{{{\gamma }^{t}}\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( s \right)} \right| \\ &\le& \left( 1-\gamma \right)\underset{t}{\sum}{{{\gamma }^{t}}\underset{s}{\sum}{\left| \rho_{{{\pi }_{1}},{{s}_{0}}}^{t}\left( s \right)-\rho_{{{\pi }_{2}},{{s}_{0}}}^{t}\left( s \right) \right|}} \\ &\le& \left( 1-\gamma \right)\underset{t}{\sum}{{{\gamma }^{t}}t\varepsilon =\left( 1-\gamma \right)}\frac{\gamma \varepsilon }{{{\left( 1-\gamma \right)}^{2}}}=\frac{\gamma \varepsilon }{1-\gamma}. \end{array} $$

(38)

The expectation of the value function loss is as follows:

$$ \begin{array}{@{}rcl@{}} &&\underset{p\left( s \right)}{\sum} {\left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{\pi }_{2}}}}\left( s \right) \right]} \\ &\le& \underset{s}{\sum}{p\left( s \right)}\left( \underset{a}{\sum}{\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right|R\left( s,a \right)} \right) \\ &&+\underset{s}{\sum}{p\left( s \right)}\left( \gamma \underset{s^{\prime}}{\sum}{p\left( s,a,{s}^{\prime} \right)\left| {{V}^{{{\pi }_{1}}}}\left( {{s}^{\prime}} \right)-{{V}^{{{\pi }_{2}}}}\left( {{s}^{\prime}} \right) \right|} \right) \\ &=&\underset{s}{\sum}{\underset{a}{\sum}{p\left( s \right)}}\left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right| \\ &&\left( R\left( s,a \right) + \gamma \underset{s^{\prime}}{\sum}{p\left( s,a,{s}^{\prime} \right)\left| {{V}^{{{\pi }_{1}}}}\left( {{s}^{\prime}} \right) - {{V}^{{{\pi }_{2}}}}\left( {{s}^{\prime}} \right) \right|} \right). \end{array} $$

(39)

Then, we apply the upper bound on the possible value $VMAX=\frac {RMAX}{1-\gamma }$ to the above equation:

$$ \underset{p\left( s \right)}{\sum} {\left[ {{V}^{{{\pi }_{1}}}}\left( s \right) - {{V}^{{{\pi }_{2}}}}\left( s \right) \right]} \!\le\! VMAX \underset{p\left( s \right)}{\sum} {\left[ \left| {{\pi }_{1}}\left( a\left| s \right. \right) - {{\pi }_{2}}\left( a\left| s \right. \right) \right| \right]}. $$

(40)

Then, by Pinsker’s inequality, we conclude:

$$ \begin{array}{@{}rcl@{}} && \underset{p\left( s \right)}{\sum} {\left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{\pi }_{2}}}}\left( s \right) \right]}\\ &\le& VMAX \underset{p\left( s \right)}{\sum} {\left[ \left| {{\pi }_{1}}\left( a\left| s \right. \right)-{{\pi }_{2}}\left( a\left| s \right. \right) \right| \right]} \\ &\le& VMAX\underset{p\left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{1}}\left( a\left| s \right. \right),{{\pi }_{2}}\left( a\left| s \right. \right) \right) \right] \\ &\le& \frac{\gamma \varepsilon}{1-\gamma}VMAX. \end{array} $$

(41)

□

Now we use Lemma 1 and Lemma 2 to prove Theorem 1.

Theorem 2

The function f of WDIB objective ${{\hat {\mathcal {J}}}_{WDIB}}$ is an upper bound of the CVA Objective $\mathcal {J}$ that is tighter than ${{\hat {\mathcal {J}}}_{DIB}}$. That is:

$$ {{\forall }_{\phi }}: \mathcal{J}\left[ \phi \right]\le f\left( {{{\hat{\mathcal{J}}}}_{WDIB}}\left[ \phi \right] \right)\le f\left( {{{\hat{\mathcal{J}}}}_{DIB}}\left[ \phi \right] \right). $$

(42)

Proof

Consider the ϕ that minimizes ${{\hat {\mathcal {J}}}_{WDIB}}$, yielding the value of at most N + βε, where:

$$ \begin{array}{ll} &N:=H\left( {{S}_{\phi }} \right) \\ &\frac{\gamma \varepsilon }{1-\gamma }:=\underset{s\sim {{\rho }_{E}}}{\mathbb{E}} \left[ W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right) \right]. \end{array} $$

(43)

Then, by Lemma 1, we know:

$$ \left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}\le \frac{N}{{{\delta }_{\min }}{{\log }_{2}}\left( \frac{1}{{{\delta }_{\min }}} \right)}. $$

(44)

By Lemma 2, we know:

$$ \underset{\rho \left( s \right)}{\mathbb{E}} \left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{{\pi }_{\phi }}}}}\left( s \right) \right]\le \frac{\varepsilon \gamma }{1-\gamma}VMAX. $$

(45)

Therefore, we conclude:

$$ \begin{array}{@{}rcl@{}} \mathcal{J}\left[ \phi \right]&=&\left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}+\beta \underset{\rho \left( s \right)}{\mathbb{E}} \left[ {{V}^{{{\pi }_{1}}}}\left( s \right)-{{V}^{{{{\pi }_{\phi }}}}}\left( s \right) \right] \\ &\le& \left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}+\beta VMAX\underset{\rho \left( s \right)}{\mathbb{E}} \\ &&\times\left[ W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right) \right] \\ &\le& \frac{N}{{{\delta }_{\min }}{{\log }_{2}}\left( \frac{1}{{{\delta }_{\min }}} \right)}+\beta \frac{\varepsilon \gamma }{1-\gamma}VMAX. \end{array} $$

(46)

Thus, we can obtain the upper bound of $\mathcal {J}$, which can be seen as a function of the ${{\hat {\mathcal {J}}}_{WDIB}}$. □

The DIBS algorithm uses the following function as the upper bound of:

$$ \begin{array}{ll} &{{\hat{\mathcal{J}}}_{DIB}} \le \left| {{S}_{\phi }} \right|_{{{\rho }_{\phi }}\left( s \right)}^{{{\delta }_{\min }}}+2\beta VMAX \\ &\qquad \qquad \underset{\rho \left( s \right)}{\mathbb{E}} \left[ \sqrt{\frac{1}{2}{{D}_{KL}}\!\left( {{\pi }_{E}}\left( a\left| s \right. \right)\left\| {{\pi }_{\phi }}\!\left( a\left| \phi \left( s \right) \right. \!\right) \right. \right)} \right]. \end{array} $$

(47)

According to Pinsker’s inequality, the following inequality can be derived:

$$ \begin{array}{ll} &W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right) \!\le\! \sqrt{\frac{1}{2}{{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)\left\| {{\pi }_{\phi }}\left( a\left| \phi \left( s \right) \right. \right) \right. \right)}. \end{array} $$

(48)

Then,

$$ \mathcal{J}\left[ \phi \right]\le f\left( {{{\hat{\mathcal{J}}}}_{WDIB}}\left[ \phi \right] \right)\le f\left( {{{\hat{\mathcal{J}}}}_{DIB}}\left[ \phi \right] \right). $$

(49)

1.2 A.2 Extensions

Lastly, we briefly discuss the situation where the agent is located in high-dimensional observation space. In this case, the state abstractions are usually stochastic. We utilize a known result relating the KL divergence to mutual information by variational autoencoder (VAE) [36]. During the process of training VAE, the KL divergence between ${{q}_{\psi }}\left (z\left | x \right . \right )$ and some prior distribution over latent codes, $p\left (z \right )$, is minimized in expectation over the data distribution, p(x); by the way, ${\mathbb {E}_{p\left (x \right )}}\left [ {{D}_{KL}}\left ({{q}_{\psi }}\left (z\left | x \right . \right )\left \| p\left (z \right ) \right . \right ) \right ]$ is minimized. We treat x as the ground state representation S and z as the abstract state S_ϕ.

Proof

To prove that our objective function ${{{\hat {\mathcal {J}}}}_{WSIB}}$ is a tighter upper bound of ${{{\hat {\mathcal {J}}}}_{SIB}}$ than ${{{\hat {\mathcal {J}}}}_{SIB\_N}}$, we utilize a known result [44]; [37] whose proof we replicate here for completeness with $q\left (z,x \right )=p\left (x \right ){{q}_{\psi }}\left (z\left | x \right . \right )$ and $q\left (z \right )={\mathbb {E}_{p\left (x \right )}}\left [ {{q}_{\psi }}\left (z\left | x \right . \right ) \right ]$:

$$ \begin{array}{@{}rcl@{}} &&{\mathbb{E}_{p\left( x \right)}}\left[ {{D}_{KL}}\left( {{q}_{\psi }}\left( z\left| x \right. \right)||p\left( z \right) \right) \right] \\ &=&{\mathbb{E}_{p\left( x \right)}}\left[ {\mathbb{E}_{{{q}_{\psi }}\left( z\left| x \right. \right)}}\left[ \log \frac{{{q}_{\psi }}\left( z\left| x \right. \right)}{p\left( z \right)} \right] \right]\\ &=&{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \left( \frac{{{q}_{\psi }}\left( z\left| x \right. \right)}{p\left( z \right)}\frac{q\left( z \right)}{q\left( z \right)} \right) \right]\\ &=&{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \frac{{{q}_{\psi }}\left( z\left| x \right. \right)}{q\left( z \right)} \right]+{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \frac{q\left( z \right)}{p\left( z \right)} \right] \\ &=&{\mathbb{E}_{q\left( z,x \right)}}\left[ \log \frac{q\left( z,x \right)}{q\left( z \right)p\left( x \right)} \right]+{\mathbb{E}_{q\left( z \right)}}\left[ \log \frac{q\left( z \right)}{p\left( z \right)} \right] \\ &=&I\left( X;Z \right)+{{D}_{KL}}\left( q\left( z \right)||p\left( z \right) \right) \\ &\ge& I\left( X;Z \right). \end{array} $$

(50)

So the following inequality can be derived:

$$ \begin{array}{@{}rcl@{}} &&\underset{\phi }{\min} \underset{s\sim {{\rho }_{E}}}{\mathbb{E}} {{D}_{KL}}\left( {{q}_{\psi }}\left( {{S}_{\phi }}\left| S \right. \right)||p\left( {{S}_{\phi }} \right) \right)+ \\ &&\beta \underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ {{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)||{{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right) \right] \end{array} $$

(51)

$$ \begin{array}{@{}rcl@{}} &\ge& \underset{\phi }{\min} I\left( S;{{S}_{\phi }} \right)+ \\ &&\beta \underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ {{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)\left\| {{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right. \right) \right]. \end{array} $$

(52)

This formula is equivalent to the following form:

$$ {{{\hat{\mathcal{J}}}}_{SIB\_N}}\ge {{\hat{\mathcal{J}}}_{SIB}}. $$

(53)

It usually supposes that both ${{q}_{\psi }}\left (z\left | x \right . \right )$ and $p\left (z \right )$ obey the Gaussian distribution to use the Gaussian reparameterization technique [36]. Let us suppose that these distributions have dimension k, means μ_i, and covariance matrices $\sum \nolimits _{i}$, for i = 1, 2. The following two formulas can be obtained:

$$ \begin{array}{ll} \begin{aligned} &{{W}}{{\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right)}}=\left\| {{\mu }_{1}}-{{\mu }_{2}} \right\|_{2}^{2}\\ &+tr\left( \sum\nolimits_{1}{+\sum\nolimits_{2}{-2{{\left( \sum\nolimits_{2}^{{1}/{2} }{\sum\nolimits_{1}{\sum\nolimits_{2}^{{1}/{2} }}} \right)}^{{1}/{2} }}}} \right) \end{aligned} \end{array} $$

(54)

and

$$ \begin{array}{@{}rcl@{}} &&{{D}_{KL}}\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right)=\frac{1}{2}tr\left( \sum\nolimits_{1}^{-1}{\sum\nolimits_{2}} \right) \\ &&+\frac{1}{2}\left( {{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}^{T}}\sum\nolimits_{1}^{-1}{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}-k + \ln \left( \frac{\det \sum\nolimits_{1}}{\det \sum\nolimits_{2}} \right) \right). \end{array} $$

(55)

For simplicity, we consider $\sum \nolimits _{1}{\text {=}\sum \nolimits _{2}{\text {=}w{{I}_{k}}}}$, where $w \in \left [ 0,1 \right ]$ is the weight and μ₁≠μ₂. When combined with the − k term, the trace term in Wasserstein distance, the trace term in the KL divergence and the log-determinant ratio will be 0. So these two equations become:

$$ {{W}}{{\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right)}}=\left\| {{\mu }_{1}}-{{\mu }_{2}} \right\|_{2}^{2} $$

(56)

and

$$ {{D}_{KL}}\left( {{{q}_{\psi }}\left( z\left| x \right. \right)},{p\left( z \right)} \right) = {{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}^{T}}\sum\nolimits_{1}^{-1}{\left( {{\mu }_{1}} - {{\mu }_{2}} \right)}. $$

(57)

Note that Wasserstein distance does not change if the variance changes whereas the KL divergence does. Therefore, the following inequality can be derived:

$$ \begin{array}{@{}rcl@{}} &&\underset{\phi }{\min} \underset{s\sim {{\rho }_{E}}}{\mathbb{E}} {{D}_{KL}}\left( {{q}_{\psi }}\left( {{S}_{\phi }}\left| S \right. \right)||p\left( {{S}_{\phi }} \right) \right) \\ &\ge& \underset{\phi }{\min} \underset{s\sim {{\rho }_{E}}}{\mathbb{E}} W\left( {{q}_{\psi }}\left( {{S}_{\phi }}\left| S \right. \right),p\left( {{S}_{\phi }} \right) \right). \end{array} $$

(58)

At same time, according to Equation (46), the following inequality can be derived:

$$ \begin{array}{@{}rcl@{}} &&\underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ {{D}_{KL}}\left( {{\pi }_{E}}\left( a\left| s \right. \right)||{{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right) \right] \\ &\ge& \underset{s\sim {{\rho }_{E}},{{s}_{\phi }}\sim \phi \left( s \right)}{\mathbb{E}} \left[ W\left( {{\pi }_{E}}\left( a\left| s \right. \right),{{\pi }_{\phi }}\left( a\left| {{s}_{\phi }} \right. \right) \right) \right]. \end{array} $$

(59)

So the following inequality can be derived:

$$ {{{\hat{\mathcal{J}}}}_{SIB\_N}}\ge {{{\hat{\mathcal{J}}}}_{WSIB}}. $$

(60)

Therefore, we conclude:

$$ {{{\hat{\mathcal{J}}}}_{SIB\_N}}\ge {{{\hat{\mathcal{J}}}}_{WSIB}}\ge {{\hat{\mathcal{J}}}_{SIB}}. $$

(61)

□

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, X., Huang, T., Zhang, R. et al. WDIBS: Wasserstein deterministic information bottleneck for state abstraction to balance state-compression and performance. Appl Intell 52, 6316–6329 (2022). https://doi.org/10.1007/s10489-021-02787-4

Download citation

Accepted: 23 August 2021
Published: 07 September 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10489-021-02787-4

WDIBS: Wasserstein deterministic information bottleneck for state abstraction to balance state-compression and performance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Hierarchical Representation Policy Iteration Algorithm for Reinforcement Learning

Model Compression for Deep Reinforcement Learning Through Mutual Information

Generalized exploration in policy search

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: A

1.1 A.1 Proofs of Theorem 1

Definition 3

Lemma 1

Proof

Lemma 2

Proof

Theorem 2

Proof

1.2 A.2 Extensions

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

WDIBS: Wasserstein deterministic information bottleneck for state abstraction to balance state-compression and performance

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Hierarchical Representation Policy Iteration Algorithm for Reinforcement Learning

Model Compression for Deep Reinforcement Learning Through Mutual Information

Generalized exploration in policy search

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix: A

Appendix: A

1.1 A.1 Proofs of Theorem 1

Definition 3

Lemma 1

Proof

Lemma 2

Proof

Theorem 2

Proof

1.2 A.2 Extensions

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation