Nothing Special   »   [go: up one dir, main page]

Deep Generative Modeling Jakub M. Tomczak 2024 Scribd Download

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

Full download test bank at ebookmeta.

com

Deep Generative Modeling Jakub M. Tomczak

For dowload this book click LINK or Button below

https://ebookmeta.com/product/deep-generative-
modeling-jakub-m-tomczak/

OR CLICK BUTTON

DOWLOAD EBOOK

Download More ebooks from https://ebookmeta.com


More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Generative Deep Learning, 2nd Edition (Third Early


Release) David Foster

https://ebookmeta.com/product/generative-deep-learning-2nd-
edition-third-early-release-david-foster/

Deep Generative Models, and Data Augmentation,


Labelling, and Imperfections 1st Edition Sandy
Engelhardt (Editor)

https://ebookmeta.com/product/deep-generative-models-and-data-
augmentation-labelling-and-imperfections-1st-edition-sandy-
engelhardt-editor/

Generative Deep Learning: Teaching Machines to Paint,


Write, Compose, and Play, 2nd Edition David Foster

https://ebookmeta.com/product/generative-deep-learning-teaching-
machines-to-paint-write-compose-and-play-2nd-edition-david-
foster/

Generative Deep Learning: Teaching Machines to Paint,


Write, Compose, and Play, 2nd Edition (Seventh Early
Release) David Foster

https://ebookmeta.com/product/generative-deep-learning-teaching-
machines-to-paint-write-compose-and-play-2nd-edition-seventh-
early-release-david-foster/
Social Norms in Medieval Scandinavia Jakub Morawiec

https://ebookmeta.com/product/social-norms-in-medieval-
scandinavia-jakub-morawiec/

Modeling and Simulation in Python 1st Edition Jason M


Kinser

https://ebookmeta.com/product/modeling-and-simulation-in-
python-1st-edition-jason-m-kinser/

Modern Deep Learning for Tabular Data: Novel Approaches


to Common Modeling Problems 1st Edition Andre Ye

https://ebookmeta.com/product/modern-deep-learning-for-tabular-
data-novel-approaches-to-common-modeling-problems-1st-edition-
andre-ye/

Creative Prototyping with Generative AI: Augmenting


Creative Workflows with Generative AI (Design Thinking)
1st Edition Patrick Parra Pennefather

https://ebookmeta.com/product/creative-prototyping-with-
generative-ai-augmenting-creative-workflows-with-generative-ai-
design-thinking-1st-edition-patrick-parra-pennefather/

Sustainable Production and Logistics Modeling and


Analysis 1st Edition Surendra M. Gupta

https://ebookmeta.com/product/sustainable-production-and-
logistics-modeling-and-analysis-1st-edition-surendra-m-gupta/
Deep Generative Modeling
Jakub M. Tomczak

Deep Generative Modeling


Jakub M. Tomczak
Vrije Universiteit Amsterdam
Amsterdam
Noord-Holland, The Netherlands

ISBN 978-3-030-93157-5 ISBN 978-3-030-93158-2 (eBook)


https://doi.org/10.1007/978-3-030-93158-2

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my beloved wife Ewelina,
my parents, and brother.
Foreword

In the last decade, with the advance of deep learning, machine learning has made
enormous progress. It has completely changed entire subfields of AI such as
computer vision, speech recognition, and natural language processing. And more
fields are being disrupted as we speak, including robotics, wireless communication,
and the natural sciences.
Most advances have come from supervised learning, where the input (e.g., an
image) and the target label (e.g., a “cat”) are available for training. Deep neural
networks have become uncannily good at predicting objects in visuals scenes and
translating between languages. But obtaining labels to train such models is often
time consuming, expensive, unethical, or simply impossible. That’s why the field
has come to the realization that unsupervised (or self-supervised) methods are key
to make further progress.
This is no different for human learning: when human children grow up, the
amount of information that is consumed to learn about the world is mostly
unlabeled. How often does anyone really tell you what you see or hear in the
world? We must learn the regularities of the world unsupervised, and we do this
by searching for patterns and structure in the data.
And there is lots of structure to be learned! To illustrate this, imagine that we
choose the three colors of each pixel of an image uniformly at random. The result
will be an image that with overwhelmingly large probability will look like gibberish.
The vast majority of image space is filled with images that do not look like anything
we see when we open our eyes. This means that there is a huge amount of structure
that can be discovered, and so there is a lot to learn for children!
Of course, kids do not just stare into the world. Instead, they constantly interact
with it. When children play, they test their hypotheses about the laws of physics,
sociology, and psychology. When predictions are wrong, they are surprised and
presumably update their internal models to make better predictions next time. It is
reasonable to assume that this interactive play of an embodied intelligence is key to
at least arrive at the type of human intelligence we are used to. This type of learning
has clear parallels with reinforcement learning, where machines make plans, say to

vii
viii Foreword

play a game of chess, observe if they win or lose, and update their models of the
world and strategies to act in them.
But it’s difficult to make robots move around in the world to test hypotheses and
actively acquire their own annotations. So, the more practical approach to learning
with lots of data is unsupervised learning. This field has gained a huge amount of
attention and has seen stunning progress recently. One only needs to look at the
kind of images of non-existent human faces that we can now generate effortlessly
to experience the uncanny sense of progress the field has made.
Unsupervised learning comes in many flavors. This book is about the kind we
call probabilistic generative modeling. The goal of this subfield is to estimate a
probabilistic model of the input data. Once we have such a model, we can generate
new samples from it (i.e., new images of faces of people that do not exist).
A second goal is to learn abstract representations of the input. This latter field is
called representation learning. The high-level representations self-organize the input
into “disentangled” concepts, which could be the objects we are familiar with, such
as cars and cats, and their relationships.
While disentangling has a clear intuitive meaning, it has proven to be a
rather slippery concept to properly define. In the 1990s, people were thinking of
statistically independent latent variables. The goal of the brain was to transform the
highly dependent pixel representation into a much more efficient and less redundant
representation of independent latent variables, which compresses the input and
makes the brain more energy and information efficient.
Learning and compression are deeply connected concepts. Learning requires
lossy compression of data because we are interested in generalization and not in
storing the data. At the level of datasets, machine learning itself is about transferring
a tiny fraction of the information present in a dataset into the parameters of a model
and forgetting everything else.
Similarly, at the level of a single datapoint, when we process for example an
input image, we are ultimately interested in the abstract high-level concepts present
in that image, such as objects and their relations, and not in detailed, pixel-level
information. With our internal models we can reason about these objects, manipulate
them in our head and imagine possible counterfactual futures for them. Intelligence
is about squeezing out the relevant predictive information from the correlated soup
pixel-level information that hits our senses and representing that information in a
useful manner that facilitates mental manipulation.
But the objects that we are familiar with in our everyday lives are not really
all that independent. A cat that is chasing a bird is not statistically independent
of it. And so, people also made attempts to define disentangling in terms of
(subspaces of variables) that exhibit certain simple transformation properties when
we transform the input (a.k.a. equivariant representations), or as variables that one
can independently control in order to manipulate the world around us, or as causal
variables that are activating certain independent mechanisms that describe the world,
and so on.
The simplest way to train a model without labels is to learn a probabilistic
generative model (or density) of the input data. There are a number of techniques
Foreword ix

in the field of probabilistic generative models that focus directly on maximizing


the log-probability (or a bound on the log probability) of the data under the
generative model. Besides VAEs and GANs, this book explains normalizing flows,
autoregressive models, energy-based models, and the latest cool kid on the block:
deep diffusion models.
One can also learn representations that are good for a broad range of subsequent
prediction tasks without ever training a generative model. The idea is to design
tasks for the representation to solve that do not require one to acquire annotations.
For instance, when considering time varying data, one can simply predict the future,
which is fortunately always there for you. Or one can invent more exotic tasks such
as predicting whether a patch was to the right of the left of another patch, or whether
a movie is playing forward or backward, or predicting a word in the middle of
a sentence from the words around it. This type of unsupervised learning is often
called self-supervised learning, although I should admit that also this term seems to
be used in different ways by different people.
Many approaches can indeed be understood in this “auxiliary tasks” view
of unsupervised learning, including some probabilistic generative models. For
instance, a variational autoencoder (VAE) can be understood as predicting its own
input back by first pushing the information through an information bottleneck. A
GAN can be understood as predicting whether a presented input is a real image
(datapoint) or a fake (self-generated) one. Noise contrastive estimation can be seen
as predicting in latent space whether the embedding of an input patch was close or
far in space and/or time.
This book discusses the latest advances in deep probabilistic generative models.
And it does so in a very accessible way. What makes this book special is that, like
the child who is building a tower of bricks to understand the laws of physics, the
student who uses this book can learn about deep probabilistic generative models
by playing with code. And it really helps that the author has earned his spurs by
having published extensively in this field. It is a great to tool to teach this topic in
the classroom.
What will the future of our field bring? It seems obvious that progress towards
AGI will heavily rely on unsupervised learning. It’s interesting to see that the
scientific community seems to be divided into two camps: the “scaling camp”
believes that we achieve AGI by scaling our current technology to ever larger models
trained with more data and more compute power. Intelligence will automatically
emerge from this scaling. The other camp believes we need new theory and new
ideas to make further progress, such as the manipulation of discrete symbols (a.k.a.
reasoning), causality, and the explicit incorporation of common-sense knowledge.
And then there is of course the increasingly important and urgent discussion
of how humans will interact with these models: can they still understand what is
happening under the hood or should we simply give up on interpretability? How will
our lives change by models that understand us better than we do, and where humans
who follow the recommendations of algorithms are more successful than those who
resist? Or what information can we still trust if deepfakes become so realistic that
we cannot distinguish them anymore from the real thing? Will democracy still be
x Foreword

able function under this barrage of fake news? One thing is certain, this field is one
of the hottest in town, and this book is an excellent introduction to start engaging
with it. But everyone should be keenly aware that mastering this technology comes
with new responsibilities towards society. Let’s progress the field with caution.

October 30, 2021 Max Welling


Preface

We live in a world where Artificial Intelligence (AI) has become a widely used
term: there are movies about AI, journalists writing about AI, and CEOs talking
about AI. Most importantly, there is AI in our daily lives, turning our phones,
TVs, fridges, and vacuum cleaners into smartphones, smart TVs, smart fridges,
and vacuum robots. We use AI, however, we still do not fully understand what
“AI” is and how to formulate it, even though AI has been established as a separate
field in the 1950s. Since then, many researchers pursue the holy grail of creating
an artificial intelligence system that is capable of mimicking, understanding, and
aiding humans through processing data and knowledge. In many cases, we have
succeeded to outperform human beings on particular tasks in terms of speed and
accuracy! Current AI methods do not necessarily imitate human processing (neither
biologically nor cognitively) but rather are aimed at making a quick and accurate
decision like navigating in cleaning a room or enhancing the quality of a displayed
movie. In such tasks, probability theory is key since limited or poor quality of data
or intrinsic behavior of a system forces us to quantify uncertainty. Moreover, deep
learning has become a leading learning paradigm that allows learning hierarchical
data representations. It draws its motivation from biological neural networks;
however, the correspondence between deep learning and biological neurons is rather
far-fetched. Nevertheless, deep learning has brought AI to the next level, achieving
state-of-the-art performance in many decision-making tasks. The next step seems
to be a combination of these two paradigms, probability theory and deep learning,
to obtain powerful AI systems that are able to quantify their uncertainties about
environments they operate in.
What Is This Book About Then? This book tackles the problem of formulating AI
systems by combining probabilistic modeling and deep learning. Moreover, it goes
beyond the typical predictive modeling and brings together supervised learning and
unsupervised learning. The resulting paradigm, called deep generative modeling,
utilizes the generative perspective on perceiving the surrounding world. It assumes
that each phenomenon is driven by an underlying generative process that defines
a joint distribution over random variables and their stochastic interactions, i.e.,

xi
xii Preface

how events occur and in what order. The adjective “deep” comes from the fact
that the distribution is parameterized using deep neural networks. There are two
distinct traits of deep generative modeling. First, the application of deep neural
networks allows rich and flexible parameterization of distributions. Second, the
principled manner of modeling stochastic dependencies using probability theory
ensures rigorous formulation and prevents potential flaws in reasoning. Moreover,
probability theory provides a unified framework where the likelihood function plays
a crucial role in quantifying uncertainty and defining objective functions.
Who Is This Book for Then? The book is designed to appeal to curious students,
engineers, and researchers with a modest mathematical background in under-
graduate calculus, linear algebra, probability theory, and the basics in machine
learning, deep learning, and programming in Python and PyTorch (or other deep
learning libraries). It should appeal to students and researchers from a variety of
backgrounds, including computer science, engineering, data science, physics, and
bioinformatics that wish to get familiar with deep generative modeling. In order
to engage with a reader, the book introduces fundamental concepts with specific
examples and code snippets. The full code accompanying the book is available
online at:
https://github.com/jmtomczak/intro_dgm
The ultimate aim of the book is to outline the most important techniques in deep
generative modeling and, eventually, enable readers to formulate new models and
implement them.
The Structure of the Book The book consists of eight chapters that could be read
separately and in (almost) any order. Chapter 1 introduces the topic and highlights
important classes of deep generative models and general concepts. Chapters 2, 3
and 4 discuss modeling of marginal distributions while Chaps. 5, and 6 outline the
material on modeling of joint distributions. Chapter 7 presents a class of latent
variable models that are not learned through the likelihood-based objective. The
last chapter, Chap. 8, indicates how deep generative modeling could be used in the
fast-growing field of neural compression. All chapters are accompanied by code
snippets to help understand how the presented methods could be implemented. The
references are generally to indicate the original source of the presented material
and provide further reading. Deep generating modeling is a broad field of study,
and including all fantastic ideas is nearly impossible. Therefore, I would like to
apologize for missing any paper. If anyone feels left out, it was not intentional from
my side.
In the end, I would like to thank my wife, Ewelina, for her help and presence that
gave me the strength to carry on with writing this book. I am also grateful to my
parents for always supporting me, and my brother who spent a lot of time checking
the first version of the book and the code.

Amsterdam, The Netherlands Jakub M. Tomczak


November 1, 2021
Acknowledgments

This book, like many other books, would not have been possible without the con-
tribution and help from many people. During my career, I was extremely privileged
and lucky to work on deep generative modeling with an amazing set of people
whom I would like to thank here (in alphabetical order): Tameem Adel, Rianne
van den Berg, Taco Cohen, Tim Davidson, Nicola De Cao, Luka Falorsi, Eliseo
Ferrante, Patrick Forré, Ioannis Gatopoulos, Efstratios Gavves, Adam Gonczarek,
Amirhossein Habibian, Leonard Hasenclever, Emiel Hoogeboom, Maximilian Ilse,
Thomas Kipf, Anna Kuzina, Christos Louizos, Yura Perugachi-Diaz, Ties van
Rozendaal, Victor Satorras, Jerzy Światek,
˛ Max Welling, Szymon Zar˛eba, and
Maciej Zi˛eba.
I would like to thank other colleagues with whom I worked on AI and had plenty
of fascinating discussions (in alphabetical order): Davide Abati, Ilze Auzina, Babak
Ehteshami Bejnordi, Erik Bekkers, Tijmen Blankevoort, Matteo De Carlo, Fuda van
Diggelen, A.E. Eiben, Ali El Hassouni, Arkadiusz Gertych, Russ Greiner, Mark
Hoogendoorn, Emile van Krieken, Gongjin Lan, Falko Lavitt, Romain Lepert, Jie
Luo, ChangYong Oh, Siamak Ravanbakhsh, Diederik Roijers, David W. Romero,
Annette ten Teije, Auke Wiggers, and Alessandro Zonta.
I am especially thankful to my brother, Kasper, who patiently read all sections,
and ran and checked every single line of code in this book. You can’t even imagine
my gratitude for that!
I would like to thank my wife, Ewelina, for supporting me all the time and giving
me the strength to finish this book. Without her help and understanding, it would
be nearly impossible to accomplish this project. I would like to also express my
gratitude to my parents, Elżbieta and Ryszard, for their support at different stages
of my life because without them I would never be who I am now.

xiii
Contents

1 Why Deep Generative Modeling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 AI Is Not Only About Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Where Can We Use (Deep) Generative Modeling? . . . . . . . . . . . . . . . . . . . 3
1.3 How to Formulate (Deep) Generative Modeling?. . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Autoregressive Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Flow-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.4 Energy-Based Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 Purpose and Content of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Autoregressive Models Parameterized by Neural Networks . . . . . . . . . 14
2.2.1 Finite Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Long-Range Memory Through RNNs . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Long-Range Memory Through Convolutional Nets . . . . . . . . . . 16
2.3 Deep Generative Autoregressive Model in Action! . . . . . . . . . . . . . . . . . . . 19
2.3.1 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Is It All? No! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Flow-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Flows for Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.2 Change of Variables for Deep Generative Modeling . . . . . . . . . 30
3.1.3 Building Blocks of RealNVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3.1 Coupling Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3.2 Permutation Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3.3 Dequantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.4 Flows in Action! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xv
xvi Contents

3.1.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.6 Is It All? Really? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.7 ResNet Flows and DenseNet Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Flows for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Flows in R or Maybe Rather in Z? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Integer Discrete Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Probabilistic Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Variational Auto-Encoders: Variational Inference for
Non-linear Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 The Model and the Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 A Different Perspective on the ELBO. . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Components of VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3.1 Parameterization of Distributions . . . . . . . . . . . . . . . . . . . 63
4.3.3.2 Reparameterization Trick. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 VAE in Action! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.6 Typical Issues with VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 There Is More! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Improving Variational Auto-Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1.1 Standard Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1.2 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1.3 VampPrior: Variational Mixture of
Posterior Prior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1.4 GTM: Generative Topographic Mapping . . . . . . . . . . . 85
4.4.1.5 GTM-VampPrior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1.6 Flow-Based Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Variational Posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2.1 Variational Posteriors with Householder
Flows [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2.2 Variational Posteriors with Sylvester
Flows [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.2.3 Hyperspherical Latent Space . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Hierarchical Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 Hierarchical VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.2.1 Two-Level VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Contents xvii

4.5.2.2 Top-Down VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


4.5.2.3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.2.4 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.5.3 Diffusion-Based Deep Generative Models . . . . . . . . . . . . . . . . . . . . 112
4.5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5.3.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.5.3.3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5 Hybrid Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.1 Approach 1: Let’s Be Naive! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.1.2 Approach 2: Shared Parameterization! . . . . . . . . . . . . . . . . . . . . . . . . 131
5.2 Hybrid Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.3 Let’s Implement It! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6 Energy-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.2 Model Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.5 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.6 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7 Generative Adversarial Networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2 Implicit Modeling with Generative Adversarial Networks
(GANs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.3 Implementing GANs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.4 There Are Many GANs Out There! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8 Deep Generative Modeling for Neural Compression. . . . . . . . . . . . . . . . . . . . . 173
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
8.2 General Compression Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.3 A Short Detour: JPEG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
8.4 Neural Compression: Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
8.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
xviii Contents

A Useful Facts from Algebra and Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189


A.1 Norms & Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
A.2 Matrix Calculus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

B Useful Facts from Probability Theory and Statistics. . . . . . . . . . . . . . . . . . . . . 191


B.1 Commonly Used Probability Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
B.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Chapter 1
Why Deep Generative Modeling?

1.1 AI Is Not Only About Decision Making

Before we start thinking about (deep) generative modeling, let us consider a simple
example. Imagine we have trained a deep neural network that classifies images (x ∈
ZD ) of animals (y ∈ Y, and Y = {cat, dog, horse}). Further, let us assume that
this neural network is trained really well so that it always classifies a proper class
with a high probability p(y|x). So far so good, right? The problem could occur
though. As pointed out in [1], adding noise to images could result in completely
false classification. An example of such a situation is presented in Fig. 1.1 where
adding noise could shift predicted probabilities of labels; however, the image is
barely changed (at least to us, human beings).
This example indicates that neural networks that are used to parameterize the
conditional distribution p(y|x) seem to lack semantic understanding of images.
Further, we even hypothesize that learning discriminative models is not enough for
proper decision making and creating AI. A machine learning system cannot rely on
learning how to make a decision without understanding the reality and being able to
express uncertainty about the surrounding world. How can we trust such a system
if even a small amount of noise could change its internal beliefs and also shift its
certainty from one decision to the other? How can we communicate with such a
system if it is unable to properly express its opinion about whether its surrounding
is new or not?
To motivate the importance of the concepts like uncertainty and understanding
in decision making, let us consider a system that classifies objects, but this time
into two classes: orange and blue. We assume we have some two-dimensional data
(Fig. 1.2, left) and a new datapoint to be classified (a black cross in Fig. 1.2). We
can make decisions using two approaches. First, a classifier could be formulated
explicitly by modeling the conditional distribution p(y|x) (Fig. 1.2, middle). Sec-
ond, we can consider a joint distribution p(x, y) that could be further decomposed
as p(x, y) = p(y|x) p(x) (Fig. 1.2, right).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 1


J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_1
2 1 Why Deep Generative Modeling?

Fig. 1.1 An example of adding noise to an almost perfectly classified image that results in a shift
of predicted label

Fig. 1.2 And example of data (left) and two approaches to decision making: (middle) a discrimi-
native approach and (right) a generative approach

After training a model using the discriminative approach, namely, the conditional
distribution p(y|x), we obtain a clear decision boundary. Then, we see that the black
cross is farther away from the orange region; thus, the classifier assigns a higher
probability to the blue label. As a result, the classifier is certain about the decision!
On the other hand, if we additionally fit a distribution p(x), we observe that the
black cross is not only farther away from the decision boundary, but also it is distant
to the region where the blue datapoints lie. In other words, the black point is far away
from the region of high probability mass. As a result, the (marginal) probability
of the black cross p(x = black cross) is low, and the joint distribution p(x =
black cross, y = blue) will be low as well and, thus, the decision is uncertain!
This simple example clearly indicates that if we want to build AI systems that
make reliable decisions and can communicate with us, human beings, they must
understand the environment first. For this purpose, they cannot simply learn how
to make decisions, but they should be able to quantify their beliefs about their
surrounding using the language of probability [2, 3]. In order to do that, we claim
that estimating the distribution over objects, p(x), is crucial.
From the generative perspective, knowing the distribution p(x) is essential
because:
• It could be used to assess whether a given object has been observed in the past or
not.
• It could help to properly weight the decision.
• It could be used to assess uncertainty about the environment.
1.2 Where Can We Use (Deep) Generative Modeling? 3

• It could be used to actively learn by interacting with the environment (e.g., by


asking for labeling objects with low p(x)).
• And, eventually, it could be used to generate (synthesize) new objects.
Typically, in the literature of deep learning, generative models are treated as
generators of new data. However, here we try to convey a new perspective where
having p(x) has much broader applicability, and this could be essential for building
successful AI systems. Lastly, we would like to also make an obvious connection to
generative modeling in machine learning, where formulating a proper generative
process is crucial for understanding the phenomena of interest [3, 4]. However,
in many cases, it is easier to focus on the other factorization, namely, p(x, y) =
p(x|y) p(y). We claim that considering p(x, y) = p(y|x) p(x) has clear advantages
as mentioned before.

1.2 Where Can We Use (Deep) Generative Modeling?

With the development of neural networks and the increase in computational power,
deep generative modeling has become one of the leading directions in AI. Its
applications vary from typical modalities considered in machine learning, i.e., text
analysis (e.g., [5]), image analysis (e.g., [6]), audio analysis (e.g., [7]), to problems
in active learning (e.g., [8]), reinforcement learning (e.g., [9]), graph analysis (e.g.,
[10]), and medical imaging (e.g., [11]). In Fig. 1.3, we present graphically potential
applications of deep generative modeling.
In some applications, it is indeed important to generate (synthesize) objects or
modify features of objects to create new ones (e.g., an app turns a young person

Learning

Labeled Unlabeled
data data

Quering

Active learning

Text
Images

Graphs

Reinforcement Audio
learning
Medical imaging

Fig. 1.3 Various potential applications of deep generative modeling


4 1 Why Deep Generative Modeling?

into an old one). However, in others like active learning it is important to ask for
uncertain objects, i.e., objects with low p(x)) that should be labeled by an oracle. In
reinforcement learning, on the other hand, generating the next most likely situation
(states) is crucial for taking actions by an agent. For medical applications, explaining
a decision, e.g., in terms of the probability of the label and the object, is definitely
more informative to a human doctor than simply assisting with a diagnosis label.
If an AI system would be able to indicate how certain it is and also quantify
whether the object is suspicious (i.e., low p(x)) or not, then it might be used as
an independent specialist that outlines its own opinion.
These examples clearly indicate that many fields, if not all, could highly benefit
from (deep) generative modeling. Obviously, there are many mechanisms that AI
systems should be equipped with. However, we claim that the generative modeling
capability is definitely one of the most important ones, as outlined in the above-
mentioned cases.

1.3 How to Formulate (Deep) Generative Modeling?

At this point, after highlighting the importance and wide applicability of (deep)
generative modeling, we should ask ourselves how to formulate (deep) generative
models. In other words, how to express p(x) that we mentioned already multiple
times.
We can divide (deep) generative modeling into four main groups (see Fig. 1.4):
• Autoregressive generative models (ARM)
• Flow-based models
• Latent variable models
• Energy-based models
We use deep in brackets because most of what we have discussed so far could be
modeled without using neural networks. However, neural networks are flexible and
powerful and, therefore, they are widely used to parameterize generative models.
From now on, we focus entirely on deep generative models.
As a side note, please treat this taxonomy as a guideline that helps us to navigate
through this book, not something written in stone. Personally, I am not a big fan of
spending too much time on categorizing and labeling science because it very often
results in antagonizing and gatekeeping. Anyway, there is also a group of models
based on the score matching principle [12–14] that do not necessarily fit our simple
taxonomy. However, as pointed out in [14], these models share a lot of similarities
with latent variable models (if we treat consecutive steps of a stochastic process as
latent variables) and, thus, we treat them as such.
1.3 How to Formulate (Deep) Generative Modeling? 5

Deep Generative Models

Autoregressive Flow-based
Latent variable Energy-based
models models
models models
(e.g., PixelCNN) (e.g., RealNVP)

Implicit models Prescribed models


(e.g., GANs) (e.g., VAEs)

Fig. 1.4 A taxonomy of deep generative models

1.3.1 Autoregressive Models

The first group of deep generative models utilizes the idea of autoregressive
modeling (ARM). In other words, the distribution over x is represented in an
autoregressive manner:


D
p(x) = p(x0 ) p(xi |x<i ), (1.1)
i=1

where x<i denotes all x’s up to the index i.


Modeling all conditional distributions p(xi |x<i ) would be computationally
inefficient. However, we can take advantage of causal convolutions as presented
in [7] for audio and in [15, 16] for images. We will discuss ARMs more in depth in
Chap. 2.

1.3.2 Flow-Based Models

The change of variables formula provides a principled manner of expressing a


density of a random variable by transforming it with an invertible transformation
f [17]:
 
p(x) = p z = f (x) |Jf (x) |, (1.2)

where Jf (x) denotes the Jacobian matrix.


We can parameterize f using deep neural networks; however, it cannot be any
arbitrary neural networks, because we must be able to calculate the Jacobian matrix.
First ideas of using the change of variable formulate focused on linear, volume-
preserving transformations that yields |Jf (x) | = 1 [18, 19]. Further attempts utilized
6 1 Why Deep Generative Modeling?

theorems on matrix determinants that resulted in specific non-linear transformations,


namely, planar flows [20] and Sylvester flows [21, 22]. A different approach focuses
on formulating invertible transformations for which the Jacobian-determinant could
be calculated easily like for coupling layers in RealNVP [23]. Recently, arbitrary
neural networks are constrained in such a way they are invertible and the Jacobian-
determinant is approximated [24–26].
In the case of the discrete distributions (e.g., integers), for the probability mass
functions, there is no change of volume and, therefore, the change of variables
formula takes the following form:
 
p(x) = p z = f (x) . (1.3)

Integer discrete flows propose to use affine coupling layers with rounding
operators to ensure the integer-valued output [27]. A generalization of the affine
coupling layer was further investigated in [28].
All generative models that take advantage of the change of variables formula
are referred to as flow-based models or flows for short. We will discuss flows in
Chap. 3.

1.3.3 Latent Variable Models

The idea behind latent variable models is to assume a lower-dimensional latent


space and the following generative process:

z ∼ p(z)
x ∼ p(x|z).

In other words, the latent variables correspond to hidden factors in data, and the
conditional distribution p(x|z) could be treated as a generator.
The most widely known latent variable model is the probabilistic Principal
Component Analysis (pPCA) [29] where p(z) and p(x|z) are Gaussian distribu-
tions, and the dependency between z and x is linear.
A non-linear extension of the pPCA with arbitrary distributions is the Varia-
tional Auto-Encoder (VAE) framework [30, 31]. To make the inference tractable,
variational inference is utilized to approximate the posterior p(z|x), and neural
networks are used to parameterize the distributions. Since the publication of the
seminal papers by Kingma and Welling [30], Rezende et al. [31], there were multiple
extensions of this framework, including working on more powerful variational
posteriors [19, 21, 22, 32], priors [33, 34], and decoders [35]. Interesting directions
include considering different topologies of the latent space, e.g., the hyperspherical
latent space [36]. In VAEs and the pPCA all distributions must be defined upfront
1.3 How to Formulate (Deep) Generative Modeling? 7

and, therefore, they are called prescribed models. We will pay special attention to
this group of deep generative models in Chap. 4.
So far, ARMs, flows, the pPCA, and VAEs are probabilistic models with
the objective function being the log-likelihood function that is closely related
to using the Kullback–Leibler divergence between the data distribution and the
model distribution. A different approach utilizes an adversarial loss in which a
discriminator D(·) determines a difference between real data and synthetic
 data

provided by a generator in the implicit form, namely, p(x|z) = δ x − G(z) ,
where δ(·) is the Dirac delta. This group of models is called implicit models, and
Generative Adversarial Networks (GANs) [6] become one of the first successful
deep generative models for synthesizing realistic-looking objects (e.g., images). See
Chap. 7 for more details.

1.3.4 Energy-Based Models

Physics provide an interesting perspective on defining a group of generative


models through defining an energy function, E(x), and, eventually, the Boltzmann
distribution:
exp{−E(x)}
p(x) = , (1.4)
Z

where Z = x exp{−E(x)} is the partition function.
In other words, the distribution is defined by the exponentiated energy function
that is further normalized to obtain values between 0 and 1 (i.e., probabilities). There
is much more into that if we think about physics, but we do not require delving into
that. I refer to [37] as a great starting point for that.
Models defined by an energy function are referred to as energy-based models
(EBMs) [38]. The main idea behind EBMs is to formulate the energy function and
calculate (or rather approximate) the partition function. The largest group of EBMs
consists of Boltzmann Machines that entangle x’s through a bilinear form, i.e.,
E(x) = x Wx [39, 40]. Introducing latent variables and taking E(x, z) = x Wz
results in Restricted Boltzmann Machines [41]. The idea of Boltzmann machines
could be further extended to the joint distribution over x and y as it is done, e.g., in
classification Restricted Boltzmann Machines [42]. Recently, it has been shown that
an arbitrary neural network could be used to define the joint distribution [43]. We
will discuss how this could be accomplished in Chap. 6.
8 1 Why Deep Generative Modeling?

1.3.5 Overview

In Table 1.1, we compared all four groups of models (with a distinction between
implicit latent variable models and prescribed latent variable models) using arbitrary
criteria like:
• Whether training is typically stable
• Whether it is possible to calculate the likelihood function
• Whether one can use a model for lossy or lossless compression
• Whether a model could be used for representation learning
All likelihood-based models (i.e., ARMs, flows, EBMs, and prescribed models
like VAEs) can be trained in a stable manner, while implicit models like GANs
suffer from instabilities. In the case of the non-linear prescribed models like VAEs,
we must remember that the likelihood function cannot be exactly calculated, and
only a lower-bound could be provided. Similarly, EBMs require calculating the
partition function that is analytically intractable problem. As a result, we can get
the unnormalized probability or an approximation at best. ARMs constitute one
of the best likelihood-based models; however, their sampling process is extremely
slow due to the autoregressive manner of generating new content. EBMs require
running a Monte Carlo method to receive a sample. Since we operate on high-
dimensional objects, this is a great obstacle for using EBMs widely in practice. All
other approaches are relatively fast. In the case of compression, VAEs are models
that allow us to use a bottleneck (the latent space). On the other hand, ARMs and
flows could be used for lossless compression since they are density estimators
and provide the exact likelihood value. Implicit models cannot be directly used
for compression; however, recent works use GANs to improve image compression
[44]. Flows, prescribed models, and EBMs (if use latents) could be used for
representation learning, namely, learning a set of random variables that summarize
data in some way and/or disentangle factors in data. The question about what is a
good representation is a different story and we refer a curious reader to literature,
e.g., [45].

Table 1.1 A comparison of deep generative models


Generative models Training Likelihood Sampling Compression Representation
Autoregressive models Stable Exact Slow Lossless No
Flow-based models Stable Exact Fast/slow Lossless Yes
Implicit models Unstable No Fast No No
Prescribed models Stable Approximate Fast Lossy Yes
Energy-based models Stable Unnormalized Slow Rather not Yes
1.4 Purpose and Content of This Book 9

1.4 Purpose and Content of This Book

This book is intended as an introduction to the field of deep generative modeling.


Its goal is to convince you, dear reader, to the philosophy of generative modeling
and show you its beauty! Deep generative modeling is an interesting hybrid
that combines probability theory, statistics, probabilistic machine learning, and
deep learning in a single framework. However, to be able to follow the ideas
presented in this book it is advised to possess knowledge in algebra and calculus,
probability theory and statistics, the basics of machine learning and deep learning,
and programming with Python. Knowing PyTorch1 is highly recommended since
all code snippets are written in PyTorch. However, knowing other deep learning
frameworks like Keras, Tensorflow, or JAX should be sufficient to understand the
code.
In this book, we will not review machine learning concepts or building blocks
in deep learning unless it is essential to comprehend a given topic. Instead, we
will delve into models and training algorithms of deep generative models. We
will either discuss the marginal models, such as autoregressive models (Chap. 2),
flow-based models (Chap. 3): RealNVP, Integer Discrete Flows, and residual and
DenseNet flows, latent variable models (Chap. 4): Variational Auto-Encoder and its
components, hierarchical VAEs, and Diffusion-based deep generative models, or
frameworks for modeling the joint distribution like hybrid modeling (Chap. 5) and
energy-based models (Chap. 6). Eventually, we will present how deep generative
modeling could be useful for data compression within the neural compression
framework (Chap. 8). In general, the book is organized in such a way that each
chapter could be followed independently from the others and in an order that suits a
reader best.
So who is the target audience of this book? Well, hopefully everybody who is
interested in AI, but there are two groups who could definitely benefit from the
presented content. The first target audience is university students who want to go
beyond standard courses on machine learning and deep learning. The second group
is research engineers who want to broaden their knowledge on AI or prefer to make
the next step in their careers and learn about the next generation of AI systems.
Either way, the book is intended for curious minds who want to understand AI and
learn not only about theory but also how to implement the discussed material. For
this purpose, each topic is associated with general discussion and introduction that
is further followed by formal formulations and a piece of code (in PyTorch). The
intention of this book is to truly understand deep generative modeling that, in the
humble opinion of the author of this book, is only possible if one can not only derive
a model but also implement it. Therefore, this book is accompanied by the following
code repository:
https://github.com/jmtomczak/intro_dgm

1 https://pytorch.org/.
10 1 Why Deep Generative Modeling?

References

1. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International
Conference on Learning Representations, ICLR 2014, 2014.
2. Christopher M Bishop. Model-based machine learning. Philosophical Transactions of the
Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20120222,
2013.
3. Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature,
521(7553):452–459, 2015.
4. Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative
and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 1, pages 87–94. IEEE, 2006.
5. Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy
Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Language Learning, pages 10–21, 2016.
6. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint
arXiv:1406.2661, 2014.
7. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
8. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–
5981, 2019.
9. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
10. GraphVAE: Towards generation of small graphs using variational autoencoders,
author=Simonovsky, Martin and Komodakis, Nikos, booktitle=International Conference
on Artificial Neural Networks, pages=412–422, year=2018, organization=Springer.
11. Maximilian Ilse, Jakub M Tomczak, Christos Louizos, and Max Welling. DIVA: Domain
invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348.
PMLR, 2020.
12. Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.
13. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. arXiv preprint arXiv:1907.05600, 2019.
14. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
15. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
16. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
17. Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. arXiv preprint arXiv:1302.5125, 2013.
18. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.
19. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder
flow. arXiv preprint arXiv:1611.09630, 2016.
20. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Inter-
national Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
References 11

21. Rianne Van Den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 393–402. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
22. Emiel Hoogeboom, Victor Garcia Satorras, Jakub M Tomczak, and Max Welling. The
convolution exponential and generalized Sylvester flows. arXiv preprint arXiv:2006.01910,
2020.
23. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.
arXiv preprint arXiv:1605.08803, 2016.
24. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen.
Invertible residual networks. In International Conference on Machine Learning, pages 573–
582. PMLR, 2019.
25. Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows
for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
26. Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible DenseNets with
Concatenated LipSwish. Advances in Neural Information Processing Systems, 2021.
27. Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete
flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
28. Jakub M Tomczak. General invertible transformations for flow-based generative modeling.
INNF+, 2021.
29. Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622,
1999.
30. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint
arXiv:1312.6114, 2013.
31. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. In International conference on machine
learning, pages 1278–1286. PMLR, 2014.
32. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural
Information Processing Systems, 29:4743–4751, 2016.
33. Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul-
man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint
arXiv:1611.02731, 2016.
34. Jakub Tomczak and Max Welling. VAE with a VampPrior. In International Conference on
Artificial Intelligence and Statistics, pages 1214–1223. PMLR, 2018.
35. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
36. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.
Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 856–865. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
37. Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
38. Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-
based learning. Predicting structured data, 1(0), 2006.
39. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
40. Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in Boltzmann
machines. Parallel distributed processing: Explorations in the microstructure of cognition,
1(282-317):2, 1986.
41. Geoffrey E Hinton. A practical guide to training restricted Boltzmann machines. In Neural
networks: Tricks of the trade, pages 599–619. Springer, 2012.
12 1 Why Deep Generative Modeling?

42. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on Machine learning, pages
536–543, 2008.
43. Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad
Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should
treat it like one. In International Conference on Learning Representations, 2019.
44. Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity
generative image compression. Advances in Neural Information Processing Systems, 33, 2020.
45. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
Chapter 2
Autoregressive Models

2.1 Introduction

Before we start discussing how we can model the distribution p(x), we refresh our
memory about the core rules of probability theory, namely, the sum rule and the
product rule. Let us introduce two random variables x and y. Their joint distribution
is p(x, y). The product rule allows us to factorize the joint distribution in two
manners, namely:

p(x, y) = p(x|y)p(y) (2.1)


= p(y|x)p(x). (2.2)

In other words, the joint distribution could be represented as a product of a


marginal distribution and a conditional distribution. The sum rule tells us that if
we want to calculate the marginal distribution over one of the variables, we must
integrate out (or sum out) the other variable, that is:

p(x) = p(x, y). (2.3)
y

These two rules will play a crucial role in probability theory and statistics and, in
particular, in formulating deep generative models.
Now, let us consider a high-dimensional random variable x ∈ XD where X =
{0, 1, . . . , 255} (e.g., pixel values) or X = R. Our goal is to model p(x). Before we
jump into thinking of specific parameterization, let us first apply the product rule to
express the joint distribution in a different manner:


D
p(x) = p(x1 ) p(xd |x<d ), (2.4)
d=2

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 13


J. M. Tomczak, Deep Generative Modeling,
https://doi.org/10.1007/978-3-030-93158-2_2
14 2 Autoregressive Models

where x<d = [x1 , x2 , . . . , xd−1 ] . For instance, for x = [x1 , x2 , x3 ] , we have


p(x) = p(x1 )p(x2 |x1 )p(x3 |x1 , x2 ).
As we can see, the product rule applied multiple times to the joint distribu-
tion provides a principled manner of factorizing the joint distribution into many
conditional distributions. That’s great news! However, modeling all conditional
distributions p(xd |x<d ) separately is simply infeasible! If we did that, we would
obtain D separate models, and the complexity of each model would grow due to
varying conditioning. A natural question is whether we can do better, and the answer
is yes.

2.2 Autoregressive Models Parameterized by Neural


Networks

As mentioned earlier, we aim for modeling the joint distribution p(x) using
conditional distributions. A potential solution to the issue of using D separate model
is utilizing a single, shared model for the conditional distribution. However, we
need to make some assumptions to use such a shared model. In other words, we
look for an autoregressive model (ARM). In the next subsection, we outline ARMs
parameterized with various neural networks. After all, we are talking about deep
generative models so using a neural network is not surprising, isn’t it?

2.2.1 Finite Memory

The first attempt to limiting the complexity of a conditional model is to assume a


finite memory. For instance, we can assume that each variable is dependent on no
more than two other variables, namely:


D
p(x) = p(x1 )p(x2 |x1 ) p(xd |xd−1 , xd−2 ). (2.5)
d=3

Then, we can use a small neural network, e.g., multi-layered perceptron (MLP),
to predict the distribution of xd . If X = {0, 1, . . . , 255}, the MLP takes xd−1 , xd−2
and outputs probabilities for the categorical distribution of xd , θd . The MLP could
be of the following form:

[xd−1 , xd−2 ] → Linear(2, M) → ReLU → Linear(M, 256) → softmax → θd ,


(2.6)
where M denotes the number of hidden units, e.g., M = 300. An example of this
approach is depicted in Fig. 2.1.
2.2 Autoregressive Models Parameterized by Neural Networks 15

Fig. 2.1 An example of applying a shared MLP depending on two last inputs. Inputs are denoted
by blue nodes (bottom), intermediate representations are denoted by orange nodes (middle), and
output probabilities are denoted by green nodes (top). Notice that a probability θd is not dependent
on xd

It is important to notice that now we use a single, shared MLP to predict


probabilities for xd . Such a model is not only non-linear but also its parameterization
is convenient due to a relatively small number of weights to be trained. However,
the obvious drawback of this approach is a limited memory (i.e., only two last
variables in our example). Moreover, it is unclear a priori how many variables we
should use in conditioning. In many problems, e.g., image processing, learning long-
range statistics is crucial to understand complex patterns in data; therefore, having
long-range memory is essential.

2.2.2 Long-Range Memory Through RNNs

A possible solution to the problem of a short-range memory modeled by an MLP


relies on applying a recurrent neural network (RNN) [1, 2]. In other words, we can
model the conditional distributions as follows [3]:

p(xd |x<d ) = p (xd |RNN(xd−1 , hd−1 )) , (2.7)

where hd = RNN(xd−1 , hd−1 ), and hd is a hidden context, which acts as a memory


that allows learning long-range dependencies. An example of using an RNN is
presented in Fig. 2.2.
This approach gives a single parameterization, thus, it is efficient and also solves
the problem of a finite memory. So far so good! Unfortunately, RNNs suffer from
other issues, namely:
• They are sequential, hence, slow.
16 2 Autoregressive Models

Fig. 2.2 An example of applying an RNN depending on two last inputs. Inputs are denoted by blue
nodes (bottom), intermediate representations are denoted by orange nodes (middle), and output
probabilities are denoted by green nodes (top). Notice that compared to the approach with a shared
MLP, there is an additional dependency between intermediate nodes hd

• If they are badly conditioned (i.e., the eigenvalues of a weight matrix are
larger or smaller than 1, then they suffer from exploding or vanishing gradients,
respectively, that hinders learning long-range dependencies.
There exist methods to help training RNNs like gradient clipping or, more generally,
gradient regularization [4] or orthogonal weights [5]. However, here we are not
interested in looking into rather specific solutions to new problems. We seek for a
different parameterization that could solve our original problem, namely, modeling
long-range dependencies in an ARM.

2.2.3 Long-Range Memory Through Convolutional Nets

In [6, 7] it was noticed that convolutional neural networks (CNNs) could be used
instead of RNNs to model long-range dependencies. To be more precise, one-
dimensional convolutional layers (Conv1D) could be stacked together to process
sequential data. The advantages of such an approach are the following:
• Kernels are shared (i.e., an efficient parameterization).
• The processing is done in parallel that greatly speeds up computations.
• By stacking more layers, the effective kernel size grows with the network depth.
These three traits seem to place Conv1D-based neural networks as a perfect solution
to our problem. However, can we indeed use them straight away?
A Conv1D can be applied to calculate embeddings like in [7], but it cannot be
used for autoregressive models. Why? Because we need convolutions to be causal
[8]. Causal in this context means that a Conv1D layer is dependent on the last k
inputs but the current one (option A) or with the current one (option B). In other
words, we must “cut” the kernel in half and forbid it to look into the next variables
2.2 Autoregressive Models Parameterized by Neural Networks 17

Fig. 2.3 An example of applying causal convolutions. The kernel size is 2, but by applying dilation
in higher layers, a much larger input could be processed (red edges), thus, a larger memory is
utilized. Notice that the first layers must be option A to ensure proper processing

(look into the future). Importantly, the option A is required in the first layer because
the final output (i.e., the probabilities θd ) cannot be dependent on xd . Additionally,
if we are concerned about the effective kernel size, we can use dilation larger
than 1.
In Fig. 2.3 we present an example of a neural network consisting of 3 causal
Conv1D layers. The first CausalConv1D is of type A, i.e., it does not take into
account only the last k inputs without the current one. Then, in the next two layers,
we use CausalConv1D (option B) with dilations 2 and 3. Typically, the dilation
values are 1, 2, 4, and 8 (v.d. Oord et al., 2016a); however, taking 2 and 4 would
not nicely fit in a figure. We highlight in red all connections that go from the output
layer to the input layer. As we can notice, stacking CausalConv1D layers with the
dilation larger than 1 allows us to learn long-range dependencies (in this example,
by looking at 7 last inputs).
An example of an implementation of CausalConv1D layer is presented below. If
you are still confused about the option A and the option B, please analyze the code
snippet step-by-step.
1 class CausalConv1d (nn. Module ):
2 def __init__ (self , in_channels , out_channels , kernel_size ,
dilation , A=False , ∗∗ kwargs ):
3 super( CausalConv1d , self ). __init__ ()
4

5 # The general idea is the following: We take the built−in


PyTorch Conv1D . Then , we must pick a proper padding , because
we must ensure the convolutional is causal . Eventually , we
18 2 Autoregressive Models

must remove some final elements of the output , because we


simply don ’t need them! Since CausalConv1D is still a
convolution , we must define the kernel size , dilation , and
whether it is option A (A=True) or option B (A=False).
Remember that by playing with dilation we can enlarge the
size of the memory .
6

7 # attributes :
8 self. kernel_size = kernel_size
9 self. dilation = dilation
10 self.A = A # whether option A (A=True) or B (A= False)
11 self. padding = ( kernel_size − 1) ∗ dilation + A ∗ 1
12

13 # we will do padding by ourselves in the forward pass!


14 self. conv1d = torch.nn. Conv1d ( in_channels , out_channels ,
15 kernel_size , stride =1,
16 padding =0,
17 dilation =dilation ,∗∗ kwargs )
18

19 def forward (self , x):


20 # We do padding only from the left! This is more
efficient implementation .
21 x = torch.nn. functional .pad(x, (self.padding , 0))
22 conv1d_out = self. conv1d (x)
23 if self.A:
24 # Remember , we cannot be dependent on the current
component; therefore , the last element is removed .
25 return conv1d_out [:, :, : −1]
26 else:
27 return conv1d_out

Listing 2.1 Causal convolution 1D

The CausalConv1D layers are better-suited to modeling sequential data than


RNNs. They obtain not only better results (e.g., classification accuracy) but also
allow learning long-range dependencies more efficiently than RNNs [8]. Moreover,
they do not suffer from exploding/vanishing gradient issues. As a result, they seem
to be a perfect parameterization for autoregressive models! Their supremacy has
been proven in many cases, including audio processing by WaveNet, a neural
network consisting of CausalConv1D layers [9], or image processing by PixelCNN,
a model with CausalConv2D components [10].
Then, is there any drawback of applying autoregressive models parameterized by
causal convolutions? Unfortunately, yes, there is and it is connected with sampling.
If we want to evaluate probabilities for given inputs, we need to calculate the
forward pass where all calculations are done in parallel. However, if we want to
sample new objects, we must iterate through all positions (think of a big for-loop,
from the first variable to the last one) and iteratively predict probabilities and sample
new values. Since we use convolutions to parameterize the model, we must do D full
forward passes to get the final sample. That is a big waste, but, unfortunately, that
is the price we must pay for all “goodies” following from the convolutional-based
2.3 Deep Generative Autoregressive Model in Action! 19

parameterization of the ARM. Fortunately, there is on-going research on speeding


up computations, e.g., see [11].

2.3 Deep Generative Autoregressive Model in Action!

Alright, let us talk more about details and how to implement an ARM. Here, and in
the whole book, we focus on images, e.g., x ∈ {0, 1, . . . , 15}64 . Since images are
represented by integers, we will use the categorical distribution to represent them (in
next chapters, we will comment on the choice of distribution for images and present
some alternatives). We model p(x) using an ARM parameterized by CausalConv1D
layers. As a result, each conditional is the following:

p(xd |x<d ) = Categorical (xd |θd (x<d )) (2.8)



L
 [xd =l]
= θd,l , (2.9)
l=1

where [a = b] is the Iverson bracket (i.e., [a = b] = 1 if a = b, and [a = b] = 0


if a = b), and θd (x<d ) ∈ [0, 1]16 is the outputof the CausalConv1D-based neural
network with the softmax in the last layer, so L l=1 θd,l = 1. To be very clear, the
last layer must have 16 output channels (because there are 16 possible values per
pixel), and the softmax is taken over these 16 values. We stack CausalConv1D layers
with non-linear activation functions in between (e.g., LeakyRELU). Of course,
we must remember about taking the option A CausalConv1D as the first layer!
Otherwise we break the assumption about taking into account xd in predicting θd .
What about the objective function? ARMs are the likelihood-based models, so for
given N i.i.d. datapoints D = {x1 , . . . , xN }, we aim at maximizing the logarithm of
the likelihood function, that is (we will use the product and sum rules again):

ln p(D) = ln p(xn ) (2.10)
n

= ln p(xn ) (2.11)
n
 
= ln p(xn,d |xn,<d ) (2.12)
n d
 
 
= ln p(xn,d |xn,<d ) (2.13)
n d
 
 
= ln Categorical (xd |θd (x<d )) (2.14)
n d
20 2 Autoregressive Models

  L 
  
= [xd = l] ln θd (x<d ) . (2.15)
n d l=1

For simplicity, we assumed that x<1 = ∅, i.e., no conditioning. As we can notice,


the objective function takes a very nice form! First, the logarithm over the i.i.d. data
D results in a sum over datapoints of the logarithm of individual distributions p(xn ).
Second, applying the product rule, together with the logarithm, results in another
sum, this time over dimensions. Eventually, by parameterizing the conditionals by
CausalConv1D, we can calculate all θd in one forward pass and then check the pixel
value (see the last line of ln p(D)). Ideally, we want θd,l to be as close to 1 as
possible if xd = l.

2.3.1 Code

Uff... Alright, let’s take a look at some code. The full code is available under the
following: https://github.com/jmtomczak/intro_dgm. Here, we focus only on the
code for the model. We provide details in the comments.
1 class ARM(nn. Module ):
2 def __init__ (self , net , D=2, num_vals =256):
3 super(ARM , self). __init__ ()
4

5 # Remember , always credit the author , even if it’s you ;)


6 print(’ARM by JT.’)
7

8 # This is a definition of a network . See the next cell.


9 self.net = net
10 # This is how many values a pixel can take.
11 self. num_vals = num_vals
12 # This is the problem dimentionality (the number of
pixels )
13 self.D = D
14

15 # This function calculates the arm output .


16 def f(self , x):
17 # First , we apply causal convolutions .
18 h = self.net(x. unsqueeze (1))
19 # In channels , we have the number of values . Therefore ,
we change the order of dims.
20 h = h. permute (0, 2, 1)
21 # We apply softmax to calculate probabilities .
22 p = torch. softmax (h, 2)
23 return p
24

25 # The forward pass calculates the log−probability of an image


.
26 def forward (self , x, reduction=’avg ’):
27 if reduction == ’avg ’:
2.3 Deep Generative Autoregressive Model in Action! 21

28 return −(self. log_prob (x).mean ())


29 elif reduction == ’sum ’:
30 return −(self. log_prob (x).sum ())
31 else:
32 raise ValueError (’reduction could be either ‘avg ‘ or
‘sum ‘. ’)
33

34 # This function calculates the log−probability (log−


categorical ).
35 # See the full code in the separate file for details .
36 def log_prob (self , x):
37 mu_d = self.f(x)
38 log_p = log_categorical (x, mu_d , num_classes =self.
num_vals , reduction =’sum ’, dim=−1).sum(−1)
39

40 return log_p
41

42 # This function implements sampling procedure.


43 def sample (self , batch_size ):
44 # As you can notice , we first initialize a tensor with
zeros.
45 x_new = torch.zeros (( batch_size , self.D))
46

47 # Then , iteratively , we sample a value for a pixel.


48 for d in range(self.D):
49 p = self.f(x_new)
50 x_new_d = torch. multinomial (p[:, d, :], num_samples
=1)
51 x_new [:, d] = x_new_d [: ,0]
52

53 return x_new

Listing 2.2 Autoregressive model parameterized by causal convolutions 1D

1 # An example of a network . NOTICE : The first layer is A=True ,


while all the others are A=False.
2 # At this point we should know already why :)
3 M = 256
4

5 net = nn. Sequential (


6 CausalConv1d ( in_channels =1, out_channels =MM , dilation =1,
kernel_size =kernel , A=True , bias=True),
7 nn. LeakyReLU (),
8 CausalConv1d ( in_channels =MM , out_channels =MM , dilation =1,
kernel_size =kernel , A=False , bias=True),
9 nn. LeakyReLU (),
10 CausalConv1d ( in_channels =MM , out_channels =MM , dilation =1,
kernel_size =kernel , A=False , bias=True),
11 nn. LeakyReLU (),
12 CausalConv1d ( in_channels =MM , out_channels =num_vals , dilation
=1, kernel_size =kernel , A=False , bias=True))

Listing 2.3 An example of a network


22 2 Autoregressive Models

A B C

Fig. 2.4 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the ARM. (c) The validation curve during training

3x3 kernel

*
masked
kernel

masked
input image feature map
3x3 kernel

Fig. 2.5 An example of a masked 3×3 kernel (i.e., a causal 2D kernel): (left) A difference between
a standard kernel (all weights are used; denoted by green) and a masked kernel (some weights are
masked, i.e., not used; in red). For the masked kernel, we denoted the node (pixel) in the middle in
violet, because it is either masked (option A) or not (option B). (middle) An example of an image
(light orange nodes: zeros, light blue nodes: ones) and a masked kernel (option A). (right) The
result of applying the masked kernel to the image (with padding equal to 1)

Perfect! Now we are ready to run the full code. After training our ARM, we
should obtain results similar to those in Fig. 2.4.

2.4 Is It All? No!

First of all, we discussed one-dimensional causal convolutions that are typically


insufficient for modeling images due to their spatial dependencies in 2D (or 3D if
we consider more than 1 channel; for simplicity, we focus on a 2D case). In [10], a
CausalConv2D was proposed. The idea is similar to that discussed so far, but now
we need to ensure that the kernel will not look into future pixels in both the x-axis
and y-axis. In Fig. 2.5, we present the difference between a standard kernel where
all kernel weights are used and a masked kernel with some weights zeroed-out (or
masked). Notice that in CausalConv2D we must also use option A for the first layer
(i.e., we skip the pixel in the middle) and we can pick option B for the remaining
layers. In Fig. 2.6, we present the same example as in Fig. 2.5 but using numeric
values.
2.4 Is It All? No! 23

Fig. 2.6 The same example 0 0 0 0 0 0 0 0 0 0 0 0 0 0


as in Fig. 2.5 but with 0 1 1 1 1 1 0 0 1 2 2 2 1 0
numeric values 0 0 0 0 0 1 0 111 1 2 3 3 3 3 2
0 0 0 0 1 0 0 110 0 0 0 0 2 2 1
0 0 0 1 0 0 0 000 0 0 0 2 2 1 0
0 0 1 0 0 0 0 0 0 2 2 1 0 0
0 1 0 0 0 0 0 0 2 2 1 0 0 0

In [12], the authors propose a further improvement on the causal convolutions.


The main idea relies on creating a block that consists of vertical and horizontal
convolutional layers. Moreover, they use gated non-linearity function, namely:

h = tanh(Wx)  σ (Vx). (2.16)

See Figure 2 in [12] for details.


Further improvements on ARMs applied to images are presented in [13]. Therein,
the authors propose to replace the categorical distribution used for modeling pixel
values with the discretized logistic distribution. Moreover, they suggest to use a
mixture of discretized logistic distributions to further increase flexibility of their
ARMs.
The introduction of the causal convolution opened multiple opportunities for
deep generative modeling and allowed obtaining state-of-the-art generations and
density estimations. It is impossible to review all papers here, we just name a few
interesting directions/applications that are worth remembering:
• An alternative ordering of pixels was proposed in [14]. Instead of using the
ordering from left to right, a “zig–zag” pattern was proposed that allows pixels
to depend on pixels previously sampled to the left and above.
• ARMs could be used as stand-alone models or they can be used in a combination
with other approaches. For instance, they can be used for modeling a prior in the
(Variational) Auto-Encoders [15].
• ARMs could be also used to model videos [16]. Factorization of sequential data
like video is very natural, and ARMs fit this scenario perfectly.
• A possible drawback of ARMs is a lack of latent representation because all
conditionals are modeled explicitly from data. To overcome this issue, [17]
proposed to use a PixelCNN-based decoder in a Variational Auto-Encoder.
• An interesting and important research direction is about proposing new architec-
tures/components of ARMs or speeding them up. As mentioned earlier, sampling
from ARMs could be slow, but there are ideas to improve on that by predictive
sampling [11, 18].
• Alternatively, we can replace the likelihood function with other similarity met-
rics, e.g., the Wasserstein distance between distributions as in quantile regression.
In the context of ARMs, quantile regression was applied in [19], requiring only
minor architectural changes, that resulted in improved quality scores.
24 2 Autoregressive Models

• An important class of models constitute transformers [20]. These models use


self-attention layers instead of causal convolutions.
• Multi-scale ARMs were proposed to scale high-quality images logarithmically
instead of quadratically. The idea is to make local independence assumptions
[21] or impose a partitioning on the spatial dimensions [22]. Even though these
ideas allow lowering the memory requirements, sampling remains rather slow.

References

1. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
2. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
3. Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural
networks. In ICML, 2011.
4. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. In International conference on machine learning, pages 1310–1318. PMLR,
2013.
5. Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.
In International Conference on Machine Learning, pages 1120–1128. PMLR, 2016.
6. Ronan Collobert and Jason Weston. A unified architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of the 25th international
conference on Machine learning, pages 160–167, 2008.
7. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network
for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, pages 212–217. Association for Computational Linguistics, 2014.
8. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu-
tional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
9. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
10. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
11. Auke Wiggers and Emiel Hoogeboom. Predictive sampling with forecasting autoregressive
models. In International Conference on Machine Learning, pages 10260–10269. PMLR, 2020.
12. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with pixelcnn decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
13. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the
pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint
arXiv:1701.05517, 2017.
14. Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved
autoregressive generative model. In International Conference on Machine Learning, pages
864–872. PMLR, 2018.
15. Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video
compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7033–7042, 2019.
Another random document with
no related content on Scribd:
gushingly grateful, but on the whole responsive. A few
doleful remarks about his own bodily condition wound up
the effort neatly, and served as an excuse for shortness.

"Have you done? Now read mine, father."

Colonel Tracy obeyed, and towards the close, he exclaimed,


"Hallo! What's this? Going to Craye!"

"I forgot to show you Colonel Erskine's note to me. Won't it


be lovely? I shall like to see Craye."

"My dear, I couldn't possibly think of such a thing."

"But this doesn't bind you to anything. I only say what I


think,—how very very delightful it will be. And after such a
present from him—don't you think we shall feel inclined to
do whatever he wants? Now, if you will give me your note,
I'll have them both posted directly."

"Well," the Colonel said in resigned accents; and he resisted


no more.

"Things certainly are better than they were a year ago,"


Dorothea thought; but she did not think how much her own
patience and unselfishness had had to do with the change.
CHAPTER XVIII
A MISTAKE

"THE twelfth of February," said Emmeline Claughton. She


spoke in a slow considering tone, gazing at the Woodlands'
drawing-room fireplace, and surrounded by the Woodlands
quartette of ladies.

"Nearly a fortnight off," remarked Margot.

"Yes."

"My father was bent upon getting the Tracys down here on
the earliest possible day; but nothing will induce Colonel
Tracy to stir sooner."

"No."

"So February the twelfth has been definitely settled?"

"Yes."

"Have you anything against it?" asked Isabel abruptly,


speaking out what the others only thought.

"Why should I?"

"Well—you looked—"

"Colonel Erskine is naturally anxious to see his old friend. I


would not have a hand in putting off such a meeting for a
single hour. If it had happened to be a week later—"
"But why? What difference could that make?"

"Oh, none really. Only Mervyn is coming home on the


seventh for two or three weeks; and we have just heard
that Edred means to run down on the twelfth for a couple of
nights or so. Mother thought some of you would come to
dinner on the thirteenth,—Colonel Erskine, and perhaps
Margot and Dolly. You don't care for dinner parties, I know."

"I detest them. But why shouldn't they all go still, and the
Tracys too?" asked blundering Isabel.

Emmeline met the suggestion by silence.

"My dear, that would not do," said Mrs. Erskine. "We can't
inflict utter strangers upon Mrs. Claughton."

"But couldn't—" Isabel hesitated, and looked at Dolly with a


meaning glance, which Dolly did not see, but felt. A swift
flush rose to the girl's pale cheeks.

"My father would not think of leaving Colonel Tracy," said


Margot, purposely misunderstanding the question. "It is
unfortunate, but I am afraid the thing can't be."

"If the Tracys could be put off for two days," said Isabel.

Dolly spoke up suddenly. "O no; my father would be so


disappointed. Very likely, that would mean they're not
coming at all. It can't be helped."

"It is very unfortunate," said Emmeline.

"Things won't always fit in just as one wishes," said Dolly.


Then she left her seat and went towards the door. "Margot,
I quite forgot to see to those Christmas roses in your room.
I'll do it now."
Margot simply said, "Thank you."

Isabel exclaimed, "Why, Dolly, there is no hurry. You needn't


run away while Emmeline is here."

"I may not have time by-and-by," said Dolly, and she
escaped without saying good-bye.

Twenty minutes later Margot went upstairs, and found Dolly,


as she expected, in her bedroom. The supply of Christmas
roses had been turned out upon a small table, and the vase
had been filled with fresh water. Dolly stood with her back
to the door, snipping at the ends of the stalks in most
businesslike style; but the next moment Margot saw tears
running fast down her cheeks.

"My dear Dolly!" she said gently.

"I haven't—quite done," Dolly murmured.

Margot stood for a few seconds watching; but the tears


streamed on. Dolly's lips quivered unmanageably, and it
was evident that she could not see what she was doing.
Margot drew the scissors out of her hand, sat down, and
took Dolly into her arms. There was a momentary of effort
at resistance; and then Dolly gave in, hid her face, and
broke into bitter sobbing.

"Poor little Dolly! Dear little Dolly! Never mind! A good cry
will make you feel better."

"O Margot! It is so hard. I don't know how to bear it!"

So much and no more reached Margot's ears. She


attempted no answer at first, but stroked the fair hair and
kissed the hot brow over and over again, with comforting
whispers. Presently, when the sobs lessened, she asked—
"What is it that seems so hard?"

"I don't know. Everything."

"Not only this disappointment about the evening at the


Park?"

"Oh,—that and—everything."

"I'm so sorry. It is very unfortunate, as Emmeline says.


After you were gone, I tried to feel my way to some other
arrangement; but Emmeline did not help me. If Mrs.
Claughton has set her mind on having my father, she would
not care to have you and me without him,—two ladies at a
dinner are not very welcome, you know. And I don't quite
think we both ought to leave Miss Tracy under the
circumstances. Colonel Tracy must be a touchy man, and he
might take offence. And, Dolly, I don't think it would do for
you to go alone, well as we know the Claughtons. Even if
Emmeline had proposed it, and she didn't—"

"No," whispered Dolly.

"But we are sure to see Mervyn and Edred somehow."

Dolly sighed heavily.

"Perhaps Edred may stay longer than he intends."

"Yes," murmured Dolly; "when he knows that—that—she


will be here."

"Dorothea Tracy? It may be only our fancy about him and


her. Still—"

"Margot, I feel so wicked about her sometimes."

"Or rather, you are tempted to feel wickedly."


"Is that all? I think I do feel it—now and then. I'm trying
not to give in. But when she comes—if I should hate her—if
I should see that she—"

Margot was silent, considering what to say. Then she spoke


out gently.

"If you should see Edred loving and seeking Dorothea Tracy,
you know that one happiness which you wish for is not to
be yours. You would know that the life you could choose is
not to be your life. Dolly, some of us have to go through
that pain, and, hard as it may seem, I think we are not the
worse for it in the end; at least, we need not be. One has to
learn, somehow, to fight and endure: and that may be as
good a way as any other. I can't tell yet if that is to be your
discipline; but if it is, you will not hate Dorothea Tracy. She
has a right to be loved: and she would not be to blame.
Whether he would be to blame is another question. I do not
know if he has ever given any reason—"

Margot hesitated, but she had no answer to the half-spoken


question.

"One thing I do know," she said; "whatever may be the


ending of all this, the last few months have done our Dolly
no harm."

"O Margot!"

"I don't think you can judge. Perhaps an outsider can tell
better. I had a fear at one time that yours was to be only a
kitten life, Dolly—nothing in it but amusement and self-
pleasing. Lately, I do see a difference."

"I am afraid it is only, partly, because I haven't cared;


because everything has seemed not worth doing."
"And that has made you give more time to things that are
worth doing—partly because you haven't cared. But, dear,
you have cared, and you do care. Do you think I have not
seen the fight going on?"

"Margot, you are such a comfort!" said Dolly, sighing.

If Dolly Erskine looked forward to the twelfth of February


with doubtful sensations, Dorothea Tracy's expectations
were of unmixed delight.

For a while it had seemed very uncertain whether the visit


to Craye was a thing to be or not to be. Colonel Erskine's
invitation was pressed cordially, but Colonel Tracy held
back. A trickling correspondence went on for three weeks,
before the one veteran gave in to the other. Colonel Tracy at
length yielded, partly to his old friend's desire, partly to his
daughter's insistence, and consented to name the twelfth of
February. Thereafter he was hold to his word.

The twelfth of February came—a mild grey day, more like


autumn than winter. Dolly had hoped and longed-for a frost
which might mean skating at the Park, but no frost
rewarded her expectations. The roads were muddy; the air
was saturated with moisture.

At four o'clock the train, fifteen minutes overdue, drew up


at the small platform, where two elderly porters loitered
about. Colonel Erskine stood talking to the station-master,
with Dolly by his side. He would have no one but his Dolly
to welcome the other Dorothea.

A red face came out of one carriage window, and a voice


called—
"Hi! Is this Craye?"

"Yes, yes. All right!" Colonel Erskine moved swiftly forward,


beckoning to a porter. "See to this gentleman's luggage," he
said.

Colonel Tracy jumped out, and the hands of the long-


separated comrades met in a hasty clasp—stirred and warm
on the one side, shy and uncomfortable on the other.
"Welcome—" Colonel Erskine tried to say, and it was as
much as he could do to bring the word out. His voice was
husky, and something like a tear shone in each eye, while
Colonel Tracy's face was at its reddest, and he had not an
idea at command.

Then Dolly followed suit, shaking hands with the Colonel,


and privately thinking what an ugly man he was. Colonel
Erskine helped Dorothea to descend, and as she sprang on
the platform, she squeezed his hand, saying eagerly, "How
good you are to us!"

"No, no—it is you who are good to come," Colonel Erskine


answered, returning the warm pressure. "Here is my Dolly—
your namesake. You have met before;" and he tried to
laugh, though there was still a wet glitter in his eyes, as he
brought the girls together, with a hand on the arm of each.

"At our Christening," Dorothea said at once. Dolly was very


quiet, putting out her gloved hand with one shy glance; and
a curious tenderness crept into Dorothea's eyes. "What a
little darling! How I shall love her!" she was saying to
herself; but Dolly could not guess the thought.

Colonel Tracy muttered something about "luggage," and


careered away down the platform, only to find his trunks
already landed. The other three followed, Colonel Erskine
saying—"So your father is quite well again?"
"Oh, quite!" Dorothea's bright glance said plainly. "Thanks
to you!"

"You are very like your mother," said Colonel Erskine, a


touch of sadness in the tone.

"Am I? It is nice to be told that."

"Doesn't your father say the same?"

"I don't know. Yes—perhaps—something of the kind."

Colonel Tracy awaited their arrival, not yet at his ease.


"What's to be done with these?" he asked gruffly as they
approached.

"Do you object to a short walk? It is not far," said Colonel


Erskine. "That's right. Then Miss Tracy and Dolly will go in
the pony-carriage. The trunks are all right. A porter will
bring them presently. This way."

Dolly did not approve of the arrangement. She shrank from


being alone with Dorothea; yet it was manifestly a good
plan. The two old friends might well wish for a few minutes
together, after their long estrangement. Whether Colonel
Tracy desired it, might indeed be a matter for doubt, though
he offered no protest; but Colonel Erskine's face showed
unmitigated pleasure, and Dolly submitted.

"Take the lower road, Fred," were her father's parting words
to the boy. Dolly had meant to give a contrary order. The
"lower road" was less steep, but much longer than the more
direct route, and she did not care for a lengthened tête-à-
tête. However, it had to be. Jack, the plump pony, trotted
leisurely off along the village street, and the two Colonels
turned up a side lane.
"Craye seems a very pretty place," said Dorothea.

"Yes—I suppose it is."

"And you have lived here a long while?"

"Yes; ever since I was quite little."

"It must be nice to have a settled home."

"Yes," Dolly answered dreamily.

"I wonder," Dorothea said, after a break, "I wonder whether


you care half as much about seeing me as I do about seeing
you."

Dolly made a quick movement. "O yes," she began, "I am


very glad."

"But of course, it can't be the same. You have so many


belonging to you,—so many friends; and I have nobody
except my father."

Had she not Edred too? That thought darted through Dolly's
mind with the force and pain of an actual stab. It seemed to
take away her breath, and to turn her pale.

"People in London generally have more friends than people


in the country," she said.

"Do they? Ah—so Mr. Claughton says—Mr. Mervyn


Claughton, I mean," with a half smile. Dorothea hesitated
for a second, noting Dolly's faint blush. Then was it Mervyn,
not Edred, who might hope to win Dolly? "Poor man!"
Dorothea said to herself, thinking of Edred, and there was a
little sigh, not wholly on his account. She went on talking
quietly, while so thinking: "But I am not in the swing of
London society, for my father goes nowhere."

"Doesn't he?"

The indifferent tone hardly called for a response; and a


pause followed.

"I wonder whether I may say one thing about—" began


Dorothea, and again there came to Dolly the question,
which was like a stab—was it something about Edred? But
—"about your father," were the next words, and Dolly's
strained attention lessened. "We owe him so much. You
know, of course, how good he has been—how kind and
noble. One can't explain feeling," Dorothea added with a
little laugh; "but if I could—Do you know, I almost think
there can't be another man like him in the world."

"I am sure he is very glad," said Dolly, feeling her own


words and manner to be horribly cold. "And it is nice for
them to be together again."

"Yes," Dorothea murmured. "It must be more to me than to


you, of course." Then she abruptly changed the subject by
asking, "Is the Park far from your house?"

Dolly grow rigid. "No," she said.

"You know, I have seen something of your friends, the


Claughtons." Dorothea coloured faintly, and Dolly saw it;
but she did not see how much of the blush was on her own
account, in sympathy with her supposed feelings. "I was
surprised to hear that Mr. Claughton—our curate—would be
down here just now."

"Only for two nights."


"I believe he hopes to stay for a week. He called on us the
day before yesterday, and said so."

Dolly twisted herself round to lean over the back, her face
turned away. "That shawl—it seems to be slipping," came in
rather smothered accents. "O never mind—all right. Yes,
and the eldest brother is here too—Mervyn, I mean." Dolly
straightened herself, and Dorothea could not but notice her
brilliant blush, could not but connect it with the last uttered
name.

"Then it is Mr. Mervyn Claughton— not the other," she said


to herself decisively. "Well, I have not come here to step in
the way of Dolly's happiness, even supposing I had the
power. If any choice is left to me, I must keep clear of Mr.
Mervyn Claughton."

"You know him too, don't you?" said Dolly, looking ahead,
with burning cheeks.

"The eldest Mr. Claughton. Yes; and he seems very


pleasant," said Dorothea. "I know them both—a little."

"He has a great deal the most fun in him of the two."

Dorothea smiled. "Yes: a great deal." She could hardly think


of the word "fun" in connection with Edred Claughton.

"And he skates beautifully. I only wish we had a frost while


he is here."

"Does Mr. Edred Claughton skate too?"

"Not much. He is clumsy compared with his brother."

Dorothea made no immediate answer. The pony was


walking slowly uphill, and Dolly seemed to sink into a
dream. She woke from it after a while, to find Dorothea
attentively studying her.

"I forgot! How stupid of me not to talk!"

"Why?"

"Why, you have only just come."

"But I have not come to be a trouble. I should not like you


to feel any 'ought' about talking to me."

"I didn't mean it exactly in that way."

Dolly pulled herself upright, and endeavoured to put on an


air of polite interest. "It is such a dull day," she said. "You
should see Craye in sunshine."

Dorothea was still studying Dolly: and her next words were
unexpected—

"I don't think you ought to have come to the station to


meet me. You are tired—or something—are you not?"

"Tiredness doesn't matter," said Dolly, with a short laugh.

"What makes you so?"

"Nothing particular,—at least, nothing that can be helped.


Please don't say a word about it at home."

Dolly glanced up as she spoke, and the pitying tenderness


of Dorothea's look almost upset her self-command.
Dorothea could see the muscles in her throat working
painfully.

"No, of course I will not. But I know so well that feeling of


wanting to cry about nothing when one is overdone."
"Thank you," murmured Dolly, glad of any respectable
excuse to let two or three tears drop. "Only, it is awfully
stupid," she added, trying to smile. "One has no business to
be so ridiculous. You will be sure not to tell."

The short and steep cut from Craye to Woodlands was


supposed to take not more than fifteen minutes up, and ten
minutes down of quick walking. The two Colonels, however,
managed to spend an hour on the road. Tea was cold before
they appeared. Colonel Tracy had by that time parted with
the last remnants of embarrassment. Dorothea had never in
her life seen him so much at his ease, or so full of talk.

The old comrades were inseparable all that evening. They


fought old battles over again, lived old days over again, told
old regimental stories over again, discussed the histories of
brother veterans over again,—only about the long quarrel,
now happily ended, a discreet silence was kept. If anything
had had to be said on that subject, it was doubtless said in
the tête-à-tête walk.

Dorothea was greatly taken with Mrs. Erskine; also she


liked Isabel, and found Margot charming. But her chief
admiration was for Colonel Erskine, and her chief interest
centred itself in Dolly.

Without seeming to do so, she watched Dolly closely, noted


every change of colour, observed every sign of depression.
A quick instinct had told her at once that some kind of
trouble lay below Dolly's physical listlessness; but, from
lack of experience, she was too easily taken in as to Dolly's
feelings. That Edred loved Dolly, and that Dolly cared for
Mervyn, she felt now little doubt. But—did Mervyn care for
Dolly? Did the clue to Dolly's trouble lie in that direction?

Dolly had her wish, after all. The world awoke next morning
to a frost-decked landscape.

She did not skip with delight, as she would have done a
year earlier, but only stood soberly looking out.

"Will it be hard enough for skating? And will the Claughtons


ask us?" she murmured.

"Splendid frost, Dolly," greeted her downstairs.

"Just the weather for you."

"For skating, father?"

"Ah, ha,—that's what she always thinks of," laughed Colonel


Erskine, who was in high spirits. "Dolly is a first-rate skater.
But you don't look quite the thing this morning, child. What
is wrong?"—as he kissed her.

"Oh, nothing. I'm only cold," said Dolly, trying to believe


what she said. It would never do to give in and be lazy,—if
an invitation should come from the Claughtons.

A good part of the morning passed without any sign, and


Dolly's languor could not but be noticed. Nothing would
induce her to leave the house, and she seemed unable to
settle to any occupation.

"I don't suppose the pond is safe yet," Isabel said


repeatedly. "Emmeline would be sure to send us word. She
always does."

Dorothea had already been for a brisk turn with her father
and Colonel Erskine. She now sat contentedly near a
window, work in hand, ready for talk or silence as others
might wish. There were no signs about Dorothea of a mind
ill at ease: yet she had fought a fight in the past night, and
had come off conqueror. Whatever pain might be involved
to herself in the resolution, she was utterly determined not
to stand in the way of Dolly's happiness. If Dolly cared for
Mervyn Claughton, the less Dorothea had to do with him,
the better. She was not without a certain consciousness of
power over him; and a young man hovering between two
girls is often easily swayed by a touch either way. Dorothea
would not, if she might, give that touch.

The resolution was not taken without a sigh, perhaps not


without a tear; for Dorothea liked Mervyn. She was
conscious that she could have liked him very much indeed.
But if Dolly's happiness were at stake,—"No, no, no!"
Dorothea cried in her heart twenty times that morning.
"After what we owe to Dolly's father—oh, no, never! I will
never be the one to come between."

Nobody looking at Dorothea's placid face would have


dreamt of any such thoughts below. She did not hang about
listlessly, like Dolly, or change colour at the sound of every
bell.

Suddenly a boy passed the window, and the hitherto inert


Dolly darted from the room. She came back brilliant.

"It's all right,—all right, Issy! Ice as hard as possible. We


are to go directly after lunch, as many as like. Emmeline
particularly asks Colonel Tracy and Dorothea. Do you
skate?"—to Dorothea.
"Yes; only I have no skates here."

"Oh, that doesn't matter. We'll fit you with a pair. Past
twelve,—nearly an hour to lunch. Where is father? I must
tell him."

Dolly flitted off, and Isabel stood gazing after her.

"What a child it is still. Who would guess her to be not far


from twenty?"

"She doesn't seem so old as I am," said Dorothea.

"No, indeed. I am afraid Margot and I can't go," continued


Isabel. "Margot can't stand about, and I have so many
things to see to. Will you think it very neglectful if we don't?
My father and Dolly will be there."

Dorothea managed to set Isabel's mind at rest. She was a


little excited herself at the prospect of the Park gathering,
and wondered silently, would the elder Mr. Claughton be as
pleasant to her as when they had last met? Would both the
brothers pursue Dolly with anxious attentions? Would Dolly
smile upon Mervyn, and turn a cold shoulder to Edred?
CHAPTER XIX
"STRICTLY IN CONFIDENCE"

LUNCH over, the two Dorotheas hastened away to dress.


Dolly would not permit the loss of a moment. Expeditious as
Dorothea always was, she found Dolly in the hall, ready
dressed, charming in her dark furs and golden hair. Both
pallor and limpness were gone, but Dorothea did not quite
like the sharp contrasts of pink and white in the small face.

"Are you sure you ought to go to-day?" she asked in a low


tone, when they were off, the two Colonels bringing up the
rear, arm-in-arm.

"Ought to go. Oh, why?" and the pink became crimson.

"I don't fancy you are quite well."

"Is that all? I fancied you meant—at least, I didn't know


what you meant. I'm only awfully tired," said Dolly, with a
forced laugh. "If it wasn't for the skating, I should like to lie
on the sofa and cry. But that would be so stupid."

"Only, if you are not fit to go—"

"I am fit, and I mean to go." Dolly spoke with a touch of


pettishness. "It would be absurd to give in. Just laziness."

The frozen pond lay near the centre of a large meadow,


behind the Park garden. A good many people were already
assembled there when the Woodlands party arrived. Dolly
passed among them, nodding, smiling, shaking hands, but
scarcely pausing for an instant until the edge of the pond
was reached.
"How do you do, Dolly?" Mervyn said, coming up. "Why!"—
and his tone showed great surprise—"Miss Tracy!"

"Didn't you know Miss Tracy was with us?" asked Dolly.

"I really did not. Nobody has had the grace to tell me."

Dorothea could not but be aware of the pleasure in


Mervyn's face, and the warmth of his hand-clasp. Her heart
beat rather fast: yet the next moment, he was looking with
evident admiration at Dolly.

"And I must not hinder that! I must do nothing to hinder


that!" she told herself.

"So you are actually staying at Woodlands?" said Mervyn.

"Yes; we came yesterday. Colonel Erskine proved to be my


father's old friend."

"Ah, I remember,—you were questioning me in the Park. I


must renew acquaintance with Colonel Tracy presently.
There's Emmeline calling me to a sense of my duties. I hope
yonder portly dame doesn't mean to adventure herself on
the ice. She'll drown the whole bevy of us. Arctic frost
wouldn't sustain her weight. Have you skates, Miss Tracy?
I'll be back in a minute. Here, Edred, can you see to these
ladies?"

Edred's response to the appeal was not too cordial. He


shook hands with Dolly, but hardly met her eyes; and then
he bent his attention to the fastening of Dorothea's skates.
When they both looked up, Dolly was gone.

"Where can she be?" Dorothea asked. "Yes, I see! Your


brother has her on the ice."
A shadow crossed Edred's face, marked enough to be
unmistakable. "Yes," he said briefly. "Now, will you let me
help you?"

Dorothea was not a very experienced skater, and some little


assistance was welcome. Edred attached himself to her side
for a considerable time.

"Poor man! it is hard upon him!" thought Dorothea, "when


he is longing to be with Dolly. But—if she has what she
wants, I must not interfere."

Neither Dorothea, nor Edred wore capable of difficult


evolutions. They went solemnly round and round the pond,
doing their best to avoid collisions. Dorothea tried in vain to
get up any manner of conversation on everyday topics. She
took refuge at last in Edred's London work, mentioned the
Parish, and started him in a lengthy dissertation upon the
duties of churchwardens. Whether she or he thought much
about what he said may be doubted; but the gravity of the
two faces gave them every appearance of intent interest.

Dolly flashed past now and then, holding Mervyn's hand.


The two were executing intricate curves, with equal ease
and grace. Dorothea felt certain that at all events Dolly was
enjoying herself.

"Pretty creature!" she murmured, half-aloud.

"I beg your pardon?" said Edred, interrupted in his


disquisition.

"I was only thinking how sweet Dolly looks to-day."

"She is—" and a cold pause. "She can be—attractive."

"I should think she could! Attractive! I call her lovely!"

You might also like