Deep Generative Modeling Jakub M. Tomczak 2024 Scribd Download
Deep Generative Modeling Jakub M. Tomczak 2024 Scribd Download
Deep Generative Modeling Jakub M. Tomczak 2024 Scribd Download
com
https://ebookmeta.com/product/deep-generative-
modeling-jakub-m-tomczak/
OR CLICK BUTTON
DOWLOAD EBOOK
https://ebookmeta.com/product/generative-deep-learning-2nd-
edition-third-early-release-david-foster/
https://ebookmeta.com/product/deep-generative-models-and-data-
augmentation-labelling-and-imperfections-1st-edition-sandy-
engelhardt-editor/
https://ebookmeta.com/product/generative-deep-learning-teaching-
machines-to-paint-write-compose-and-play-2nd-edition-david-
foster/
https://ebookmeta.com/product/generative-deep-learning-teaching-
machines-to-paint-write-compose-and-play-2nd-edition-seventh-
early-release-david-foster/
Social Norms in Medieval Scandinavia Jakub Morawiec
https://ebookmeta.com/product/social-norms-in-medieval-
scandinavia-jakub-morawiec/
https://ebookmeta.com/product/modeling-and-simulation-in-
python-1st-edition-jason-m-kinser/
https://ebookmeta.com/product/modern-deep-learning-for-tabular-
data-novel-approaches-to-common-modeling-problems-1st-edition-
andre-ye/
https://ebookmeta.com/product/creative-prototyping-with-
generative-ai-augmenting-creative-workflows-with-generative-ai-
design-thinking-1st-edition-patrick-parra-pennefather/
https://ebookmeta.com/product/sustainable-production-and-
logistics-modeling-and-analysis-1st-edition-surendra-m-gupta/
Deep Generative Modeling
Jakub M. Tomczak
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To my beloved wife Ewelina,
my parents, and brother.
Foreword
In the last decade, with the advance of deep learning, machine learning has made
enormous progress. It has completely changed entire subfields of AI such as
computer vision, speech recognition, and natural language processing. And more
fields are being disrupted as we speak, including robotics, wireless communication,
and the natural sciences.
Most advances have come from supervised learning, where the input (e.g., an
image) and the target label (e.g., a “cat”) are available for training. Deep neural
networks have become uncannily good at predicting objects in visuals scenes and
translating between languages. But obtaining labels to train such models is often
time consuming, expensive, unethical, or simply impossible. That’s why the field
has come to the realization that unsupervised (or self-supervised) methods are key
to make further progress.
This is no different for human learning: when human children grow up, the
amount of information that is consumed to learn about the world is mostly
unlabeled. How often does anyone really tell you what you see or hear in the
world? We must learn the regularities of the world unsupervised, and we do this
by searching for patterns and structure in the data.
And there is lots of structure to be learned! To illustrate this, imagine that we
choose the three colors of each pixel of an image uniformly at random. The result
will be an image that with overwhelmingly large probability will look like gibberish.
The vast majority of image space is filled with images that do not look like anything
we see when we open our eyes. This means that there is a huge amount of structure
that can be discovered, and so there is a lot to learn for children!
Of course, kids do not just stare into the world. Instead, they constantly interact
with it. When children play, they test their hypotheses about the laws of physics,
sociology, and psychology. When predictions are wrong, they are surprised and
presumably update their internal models to make better predictions next time. It is
reasonable to assume that this interactive play of an embodied intelligence is key to
at least arrive at the type of human intelligence we are used to. This type of learning
has clear parallels with reinforcement learning, where machines make plans, say to
vii
viii Foreword
play a game of chess, observe if they win or lose, and update their models of the
world and strategies to act in them.
But it’s difficult to make robots move around in the world to test hypotheses and
actively acquire their own annotations. So, the more practical approach to learning
with lots of data is unsupervised learning. This field has gained a huge amount of
attention and has seen stunning progress recently. One only needs to look at the
kind of images of non-existent human faces that we can now generate effortlessly
to experience the uncanny sense of progress the field has made.
Unsupervised learning comes in many flavors. This book is about the kind we
call probabilistic generative modeling. The goal of this subfield is to estimate a
probabilistic model of the input data. Once we have such a model, we can generate
new samples from it (i.e., new images of faces of people that do not exist).
A second goal is to learn abstract representations of the input. This latter field is
called representation learning. The high-level representations self-organize the input
into “disentangled” concepts, which could be the objects we are familiar with, such
as cars and cats, and their relationships.
While disentangling has a clear intuitive meaning, it has proven to be a
rather slippery concept to properly define. In the 1990s, people were thinking of
statistically independent latent variables. The goal of the brain was to transform the
highly dependent pixel representation into a much more efficient and less redundant
representation of independent latent variables, which compresses the input and
makes the brain more energy and information efficient.
Learning and compression are deeply connected concepts. Learning requires
lossy compression of data because we are interested in generalization and not in
storing the data. At the level of datasets, machine learning itself is about transferring
a tiny fraction of the information present in a dataset into the parameters of a model
and forgetting everything else.
Similarly, at the level of a single datapoint, when we process for example an
input image, we are ultimately interested in the abstract high-level concepts present
in that image, such as objects and their relations, and not in detailed, pixel-level
information. With our internal models we can reason about these objects, manipulate
them in our head and imagine possible counterfactual futures for them. Intelligence
is about squeezing out the relevant predictive information from the correlated soup
pixel-level information that hits our senses and representing that information in a
useful manner that facilitates mental manipulation.
But the objects that we are familiar with in our everyday lives are not really
all that independent. A cat that is chasing a bird is not statistically independent
of it. And so, people also made attempts to define disentangling in terms of
(subspaces of variables) that exhibit certain simple transformation properties when
we transform the input (a.k.a. equivariant representations), or as variables that one
can independently control in order to manipulate the world around us, or as causal
variables that are activating certain independent mechanisms that describe the world,
and so on.
The simplest way to train a model without labels is to learn a probabilistic
generative model (or density) of the input data. There are a number of techniques
Foreword ix
able function under this barrage of fake news? One thing is certain, this field is one
of the hottest in town, and this book is an excellent introduction to start engaging
with it. But everyone should be keenly aware that mastering this technology comes
with new responsibilities towards society. Let’s progress the field with caution.
We live in a world where Artificial Intelligence (AI) has become a widely used
term: there are movies about AI, journalists writing about AI, and CEOs talking
about AI. Most importantly, there is AI in our daily lives, turning our phones,
TVs, fridges, and vacuum cleaners into smartphones, smart TVs, smart fridges,
and vacuum robots. We use AI, however, we still do not fully understand what
“AI” is and how to formulate it, even though AI has been established as a separate
field in the 1950s. Since then, many researchers pursue the holy grail of creating
an artificial intelligence system that is capable of mimicking, understanding, and
aiding humans through processing data and knowledge. In many cases, we have
succeeded to outperform human beings on particular tasks in terms of speed and
accuracy! Current AI methods do not necessarily imitate human processing (neither
biologically nor cognitively) but rather are aimed at making a quick and accurate
decision like navigating in cleaning a room or enhancing the quality of a displayed
movie. In such tasks, probability theory is key since limited or poor quality of data
or intrinsic behavior of a system forces us to quantify uncertainty. Moreover, deep
learning has become a leading learning paradigm that allows learning hierarchical
data representations. It draws its motivation from biological neural networks;
however, the correspondence between deep learning and biological neurons is rather
far-fetched. Nevertheless, deep learning has brought AI to the next level, achieving
state-of-the-art performance in many decision-making tasks. The next step seems
to be a combination of these two paradigms, probability theory and deep learning,
to obtain powerful AI systems that are able to quantify their uncertainties about
environments they operate in.
What Is This Book About Then? This book tackles the problem of formulating AI
systems by combining probabilistic modeling and deep learning. Moreover, it goes
beyond the typical predictive modeling and brings together supervised learning and
unsupervised learning. The resulting paradigm, called deep generative modeling,
utilizes the generative perspective on perceiving the surrounding world. It assumes
that each phenomenon is driven by an underlying generative process that defines
a joint distribution over random variables and their stochastic interactions, i.e.,
xi
xii Preface
how events occur and in what order. The adjective “deep” comes from the fact
that the distribution is parameterized using deep neural networks. There are two
distinct traits of deep generative modeling. First, the application of deep neural
networks allows rich and flexible parameterization of distributions. Second, the
principled manner of modeling stochastic dependencies using probability theory
ensures rigorous formulation and prevents potential flaws in reasoning. Moreover,
probability theory provides a unified framework where the likelihood function plays
a crucial role in quantifying uncertainty and defining objective functions.
Who Is This Book for Then? The book is designed to appeal to curious students,
engineers, and researchers with a modest mathematical background in under-
graduate calculus, linear algebra, probability theory, and the basics in machine
learning, deep learning, and programming in Python and PyTorch (or other deep
learning libraries). It should appeal to students and researchers from a variety of
backgrounds, including computer science, engineering, data science, physics, and
bioinformatics that wish to get familiar with deep generative modeling. In order
to engage with a reader, the book introduces fundamental concepts with specific
examples and code snippets. The full code accompanying the book is available
online at:
https://github.com/jmtomczak/intro_dgm
The ultimate aim of the book is to outline the most important techniques in deep
generative modeling and, eventually, enable readers to formulate new models and
implement them.
The Structure of the Book The book consists of eight chapters that could be read
separately and in (almost) any order. Chapter 1 introduces the topic and highlights
important classes of deep generative models and general concepts. Chapters 2, 3
and 4 discuss modeling of marginal distributions while Chaps. 5, and 6 outline the
material on modeling of joint distributions. Chapter 7 presents a class of latent
variable models that are not learned through the likelihood-based objective. The
last chapter, Chap. 8, indicates how deep generative modeling could be used in the
fast-growing field of neural compression. All chapters are accompanied by code
snippets to help understand how the presented methods could be implemented. The
references are generally to indicate the original source of the presented material
and provide further reading. Deep generating modeling is a broad field of study,
and including all fantastic ideas is nearly impossible. Therefore, I would like to
apologize for missing any paper. If anyone feels left out, it was not intentional from
my side.
In the end, I would like to thank my wife, Ewelina, for her help and presence that
gave me the strength to carry on with writing this book. I am also grateful to my
parents for always supporting me, and my brother who spent a lot of time checking
the first version of the book and the code.
This book, like many other books, would not have been possible without the con-
tribution and help from many people. During my career, I was extremely privileged
and lucky to work on deep generative modeling with an amazing set of people
whom I would like to thank here (in alphabetical order): Tameem Adel, Rianne
van den Berg, Taco Cohen, Tim Davidson, Nicola De Cao, Luka Falorsi, Eliseo
Ferrante, Patrick Forré, Ioannis Gatopoulos, Efstratios Gavves, Adam Gonczarek,
Amirhossein Habibian, Leonard Hasenclever, Emiel Hoogeboom, Maximilian Ilse,
Thomas Kipf, Anna Kuzina, Christos Louizos, Yura Perugachi-Diaz, Ties van
Rozendaal, Victor Satorras, Jerzy Światek,
˛ Max Welling, Szymon Zar˛eba, and
Maciej Zi˛eba.
I would like to thank other colleagues with whom I worked on AI and had plenty
of fascinating discussions (in alphabetical order): Davide Abati, Ilze Auzina, Babak
Ehteshami Bejnordi, Erik Bekkers, Tijmen Blankevoort, Matteo De Carlo, Fuda van
Diggelen, A.E. Eiben, Ali El Hassouni, Arkadiusz Gertych, Russ Greiner, Mark
Hoogendoorn, Emile van Krieken, Gongjin Lan, Falko Lavitt, Romain Lepert, Jie
Luo, ChangYong Oh, Siamak Ravanbakhsh, Diederik Roijers, David W. Romero,
Annette ten Teije, Auke Wiggers, and Alessandro Zonta.
I am especially thankful to my brother, Kasper, who patiently read all sections,
and ran and checked every single line of code in this book. You can’t even imagine
my gratitude for that!
I would like to thank my wife, Ewelina, for supporting me all the time and giving
me the strength to finish this book. Without her help and understanding, it would
be nearly impossible to accomplish this project. I would like to also express my
gratitude to my parents, Elżbieta and Ryszard, for their support at different stages
of my life because without them I would never be who I am now.
xiii
Contents
xv
xvi Contents
3.1.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.6 Is It All? Really? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1.7 ResNet Flows and DenseNet Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Flows for Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.2 Flows in R or Maybe Rather in Z? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.3 Integer Discrete Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.4 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.5 What’s Next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Probabilistic Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Variational Auto-Encoders: Variational Inference for
Non-linear Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 The Model and the Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.2 A Different Perspective on the ELBO. . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Components of VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.3.1 Parameterization of Distributions . . . . . . . . . . . . . . . . . . . 63
4.3.3.2 Reparameterization Trick. . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.4 VAE in Action! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.6 Typical Issues with VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3.7 There Is More! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Improving Variational Auto-Encoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1 Priors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4.1.1 Standard Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4.1.2 Mixture of Gaussians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.1.3 VampPrior: Variational Mixture of
Posterior Prior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1.4 GTM: Generative Topographic Mapping . . . . . . . . . . . 85
4.4.1.5 GTM-VampPrior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.4.1.6 Flow-Based Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.1.7 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2 Variational Posteriors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4.2.1 Variational Posteriors with Householder
Flows [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2.2 Variational Posteriors with Sylvester
Flows [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4.2.3 Hyperspherical Latent Space . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5 Hierarchical Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5.2 Hierarchical VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.5.2.1 Two-Level VAEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Contents xvii
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
Chapter 1
Why Deep Generative Modeling?
Before we start thinking about (deep) generative modeling, let us consider a simple
example. Imagine we have trained a deep neural network that classifies images (x ∈
ZD ) of animals (y ∈ Y, and Y = {cat, dog, horse}). Further, let us assume that
this neural network is trained really well so that it always classifies a proper class
with a high probability p(y|x). So far so good, right? The problem could occur
though. As pointed out in [1], adding noise to images could result in completely
false classification. An example of such a situation is presented in Fig. 1.1 where
adding noise could shift predicted probabilities of labels; however, the image is
barely changed (at least to us, human beings).
This example indicates that neural networks that are used to parameterize the
conditional distribution p(y|x) seem to lack semantic understanding of images.
Further, we even hypothesize that learning discriminative models is not enough for
proper decision making and creating AI. A machine learning system cannot rely on
learning how to make a decision without understanding the reality and being able to
express uncertainty about the surrounding world. How can we trust such a system
if even a small amount of noise could change its internal beliefs and also shift its
certainty from one decision to the other? How can we communicate with such a
system if it is unable to properly express its opinion about whether its surrounding
is new or not?
To motivate the importance of the concepts like uncertainty and understanding
in decision making, let us consider a system that classifies objects, but this time
into two classes: orange and blue. We assume we have some two-dimensional data
(Fig. 1.2, left) and a new datapoint to be classified (a black cross in Fig. 1.2). We
can make decisions using two approaches. First, a classifier could be formulated
explicitly by modeling the conditional distribution p(y|x) (Fig. 1.2, middle). Sec-
ond, we can consider a joint distribution p(x, y) that could be further decomposed
as p(x, y) = p(y|x) p(x) (Fig. 1.2, right).
Fig. 1.1 An example of adding noise to an almost perfectly classified image that results in a shift
of predicted label
Fig. 1.2 And example of data (left) and two approaches to decision making: (middle) a discrimi-
native approach and (right) a generative approach
After training a model using the discriminative approach, namely, the conditional
distribution p(y|x), we obtain a clear decision boundary. Then, we see that the black
cross is farther away from the orange region; thus, the classifier assigns a higher
probability to the blue label. As a result, the classifier is certain about the decision!
On the other hand, if we additionally fit a distribution p(x), we observe that the
black cross is not only farther away from the decision boundary, but also it is distant
to the region where the blue datapoints lie. In other words, the black point is far away
from the region of high probability mass. As a result, the (marginal) probability
of the black cross p(x = black cross) is low, and the joint distribution p(x =
black cross, y = blue) will be low as well and, thus, the decision is uncertain!
This simple example clearly indicates that if we want to build AI systems that
make reliable decisions and can communicate with us, human beings, they must
understand the environment first. For this purpose, they cannot simply learn how
to make decisions, but they should be able to quantify their beliefs about their
surrounding using the language of probability [2, 3]. In order to do that, we claim
that estimating the distribution over objects, p(x), is crucial.
From the generative perspective, knowing the distribution p(x) is essential
because:
• It could be used to assess whether a given object has been observed in the past or
not.
• It could help to properly weight the decision.
• It could be used to assess uncertainty about the environment.
1.2 Where Can We Use (Deep) Generative Modeling? 3
With the development of neural networks and the increase in computational power,
deep generative modeling has become one of the leading directions in AI. Its
applications vary from typical modalities considered in machine learning, i.e., text
analysis (e.g., [5]), image analysis (e.g., [6]), audio analysis (e.g., [7]), to problems
in active learning (e.g., [8]), reinforcement learning (e.g., [9]), graph analysis (e.g.,
[10]), and medical imaging (e.g., [11]). In Fig. 1.3, we present graphically potential
applications of deep generative modeling.
In some applications, it is indeed important to generate (synthesize) objects or
modify features of objects to create new ones (e.g., an app turns a young person
Learning
Labeled Unlabeled
data data
Quering
Active learning
Text
Images
Graphs
Reinforcement Audio
learning
Medical imaging
into an old one). However, in others like active learning it is important to ask for
uncertain objects, i.e., objects with low p(x)) that should be labeled by an oracle. In
reinforcement learning, on the other hand, generating the next most likely situation
(states) is crucial for taking actions by an agent. For medical applications, explaining
a decision, e.g., in terms of the probability of the label and the object, is definitely
more informative to a human doctor than simply assisting with a diagnosis label.
If an AI system would be able to indicate how certain it is and also quantify
whether the object is suspicious (i.e., low p(x)) or not, then it might be used as
an independent specialist that outlines its own opinion.
These examples clearly indicate that many fields, if not all, could highly benefit
from (deep) generative modeling. Obviously, there are many mechanisms that AI
systems should be equipped with. However, we claim that the generative modeling
capability is definitely one of the most important ones, as outlined in the above-
mentioned cases.
At this point, after highlighting the importance and wide applicability of (deep)
generative modeling, we should ask ourselves how to formulate (deep) generative
models. In other words, how to express p(x) that we mentioned already multiple
times.
We can divide (deep) generative modeling into four main groups (see Fig. 1.4):
• Autoregressive generative models (ARM)
• Flow-based models
• Latent variable models
• Energy-based models
We use deep in brackets because most of what we have discussed so far could be
modeled without using neural networks. However, neural networks are flexible and
powerful and, therefore, they are widely used to parameterize generative models.
From now on, we focus entirely on deep generative models.
As a side note, please treat this taxonomy as a guideline that helps us to navigate
through this book, not something written in stone. Personally, I am not a big fan of
spending too much time on categorizing and labeling science because it very often
results in antagonizing and gatekeeping. Anyway, there is also a group of models
based on the score matching principle [12–14] that do not necessarily fit our simple
taxonomy. However, as pointed out in [14], these models share a lot of similarities
with latent variable models (if we treat consecutive steps of a stochastic process as
latent variables) and, thus, we treat them as such.
1.3 How to Formulate (Deep) Generative Modeling? 5
Autoregressive Flow-based
Latent variable Energy-based
models models
models models
(e.g., PixelCNN) (e.g., RealNVP)
The first group of deep generative models utilizes the idea of autoregressive
modeling (ARM). In other words, the distribution over x is represented in an
autoregressive manner:
D
p(x) = p(x0 ) p(xi |x<i ), (1.1)
i=1
Integer discrete flows propose to use affine coupling layers with rounding
operators to ensure the integer-valued output [27]. A generalization of the affine
coupling layer was further investigated in [28].
All generative models that take advantage of the change of variables formula
are referred to as flow-based models or flows for short. We will discuss flows in
Chap. 3.
z ∼ p(z)
x ∼ p(x|z).
In other words, the latent variables correspond to hidden factors in data, and the
conditional distribution p(x|z) could be treated as a generator.
The most widely known latent variable model is the probabilistic Principal
Component Analysis (pPCA) [29] where p(z) and p(x|z) are Gaussian distribu-
tions, and the dependency between z and x is linear.
A non-linear extension of the pPCA with arbitrary distributions is the Varia-
tional Auto-Encoder (VAE) framework [30, 31]. To make the inference tractable,
variational inference is utilized to approximate the posterior p(z|x), and neural
networks are used to parameterize the distributions. Since the publication of the
seminal papers by Kingma and Welling [30], Rezende et al. [31], there were multiple
extensions of this framework, including working on more powerful variational
posteriors [19, 21, 22, 32], priors [33, 34], and decoders [35]. Interesting directions
include considering different topologies of the latent space, e.g., the hyperspherical
latent space [36]. In VAEs and the pPCA all distributions must be defined upfront
1.3 How to Formulate (Deep) Generative Modeling? 7
and, therefore, they are called prescribed models. We will pay special attention to
this group of deep generative models in Chap. 4.
So far, ARMs, flows, the pPCA, and VAEs are probabilistic models with
the objective function being the log-likelihood function that is closely related
to using the Kullback–Leibler divergence between the data distribution and the
model distribution. A different approach utilizes an adversarial loss in which a
discriminator D(·) determines a difference between real data and synthetic
data
provided by a generator in the implicit form, namely, p(x|z) = δ x − G(z) ,
where δ(·) is the Dirac delta. This group of models is called implicit models, and
Generative Adversarial Networks (GANs) [6] become one of the first successful
deep generative models for synthesizing realistic-looking objects (e.g., images). See
Chap. 7 for more details.
1.3.5 Overview
In Table 1.1, we compared all four groups of models (with a distinction between
implicit latent variable models and prescribed latent variable models) using arbitrary
criteria like:
• Whether training is typically stable
• Whether it is possible to calculate the likelihood function
• Whether one can use a model for lossy or lossless compression
• Whether a model could be used for representation learning
All likelihood-based models (i.e., ARMs, flows, EBMs, and prescribed models
like VAEs) can be trained in a stable manner, while implicit models like GANs
suffer from instabilities. In the case of the non-linear prescribed models like VAEs,
we must remember that the likelihood function cannot be exactly calculated, and
only a lower-bound could be provided. Similarly, EBMs require calculating the
partition function that is analytically intractable problem. As a result, we can get
the unnormalized probability or an approximation at best. ARMs constitute one
of the best likelihood-based models; however, their sampling process is extremely
slow due to the autoregressive manner of generating new content. EBMs require
running a Monte Carlo method to receive a sample. Since we operate on high-
dimensional objects, this is a great obstacle for using EBMs widely in practice. All
other approaches are relatively fast. In the case of compression, VAEs are models
that allow us to use a bottleneck (the latent space). On the other hand, ARMs and
flows could be used for lossless compression since they are density estimators
and provide the exact likelihood value. Implicit models cannot be directly used
for compression; however, recent works use GANs to improve image compression
[44]. Flows, prescribed models, and EBMs (if use latents) could be used for
representation learning, namely, learning a set of random variables that summarize
data in some way and/or disentangle factors in data. The question about what is a
good representation is a different story and we refer a curious reader to literature,
e.g., [45].
1 https://pytorch.org/.
10 1 Why Deep Generative Modeling?
References
1. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International
Conference on Learning Representations, ICLR 2014, 2014.
2. Christopher M Bishop. Model-based machine learning. Philosophical Transactions of the
Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20120222,
2013.
3. Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature,
521(7553):452–459, 2015.
4. Julia A Lasserre, Christopher M Bishop, and Thomas P Minka. Principled hybrids of generative
and discriminative models. In 2006 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’06), volume 1, pages 87–94. IEEE, 2006.
5. Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy
Bengio. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL
Conference on Computational Natural Language Learning, pages 10–21, 2016.
6. Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. arXiv preprint
arXiv:1406.2661, 2014.
7. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
8. Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning.
In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5972–
5981, 2019.
9. David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
10. GraphVAE: Towards generation of small graphs using variational autoencoders,
author=Simonovsky, Martin and Komodakis, Nikos, booktitle=International Conference
on Artificial Neural Networks, pages=412–422, year=2018, organization=Springer.
11. Maximilian Ilse, Jakub M Tomczak, Christos Louizos, and Max Welling. DIVA: Domain
invariant variational autoencoders. In Medical Imaging with Deep Learning, pages 322–348.
PMLR, 2020.
12. Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score
matching. Journal of Machine Learning Research, 6(4), 2005.
13. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data
distribution. arXiv preprint arXiv:1907.05600, 2019.
14. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and
Ben Poole. Score-based generative modeling through stochastic differential equations. In
International Conference on Learning Representations, 2020.
15. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
16. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with PixelCNN decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
17. Oren Rippel and Ryan Prescott Adams. High-dimensional probability estimation with deep
density models. arXiv preprint arXiv:1302.5125, 2013.
18. Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components
estimation. arXiv preprint arXiv:1410.8516, 2014.
19. Jakub M Tomczak and Max Welling. Improving variational auto-encoders using householder
flow. arXiv preprint arXiv:1611.09630, 2016.
20. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Inter-
national Conference on Machine Learning, pages 1530–1538. PMLR, 2015.
References 11
21. Rianne Van Den Berg, Leonard Hasenclever, Jakub M Tomczak, and Max Welling. Sylvester
normalizing flows for variational inference. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 393–402. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
22. Emiel Hoogeboom, Victor Garcia Satorras, Jakub M Tomczak, and Max Welling. The
convolution exponential and generalized Sylvester flows. arXiv preprint arXiv:2006.01910,
2020.
23. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP.
arXiv preprint arXiv:1605.08803, 2016.
24. Jens Behrmann, Will Grathwohl, Ricky TQ Chen, David Duvenaud, and Jörn-Henrik Jacobsen.
Invertible residual networks. In International Conference on Machine Learning, pages 573–
582. PMLR, 2019.
25. Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residual flows
for invertible generative modeling. arXiv preprint arXiv:1906.02735, 2019.
26. Yura Perugachi-Diaz, Jakub M Tomczak, and Sandjai Bhulai. Invertible DenseNets with
Concatenated LipSwish. Advances in Neural Information Processing Systems, 2021.
27. Emiel Hoogeboom, Jorn WT Peters, Rianne van den Berg, and Max Welling. Integer discrete
flows and lossless compression. arXiv preprint arXiv:1905.07376, 2019.
28. Jakub M Tomczak. General invertible transformations for flow-based generative modeling.
INNF+, 2021.
29. Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis.
Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622,
1999.
30. Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint
arXiv:1312.6114, 2013.
31. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation
and approximate inference in deep generative models. In International conference on machine
learning, pages 1278–1286. PMLR, 2014.
32. Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max
Welling. Improved variational inference with inverse autoregressive flow. Advances in Neural
Information Processing Systems, 29:4743–4751, 2016.
33. Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schul-
man, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint
arXiv:1611.02731, 2016.
34. Jakub Tomczak and Max Welling. VAE with a VampPrior. In International Conference on
Artificial Intelligence and Statistics, pages 1214–1223. PMLR, 2018.
35. Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga, Francesco Visin, David
Vazquez, and Aaron Courville. PixelVAE: A latent variable model for natural images. arXiv
preprint arXiv:1611.05013, 2016.
36. Tim R Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M Tomczak.
Hyperspherical variational auto-encoders. In 34th Conference on Uncertainty in Artificial
Intelligence 2018, UAI 2018, pages 856–865. Association For Uncertainty in Artificial
Intelligence (AUAI), 2018.
37. Edwin T Jaynes. Probability theory: The logic of science. Cambridge university press, 2003.
38. Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-
based learning. Predicting structured data, 1(0), 2006.
39. David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
Boltzmann machines. Cognitive science, 9(1):147–169, 1985.
40. Geoffrey E Hinton, Terrence J Sejnowski, et al. Learning and relearning in Boltzmann
machines. Parallel distributed processing: Explorations in the microstructure of cognition,
1(282-317):2, 1986.
41. Geoffrey E Hinton. A practical guide to training restricted Boltzmann machines. In Neural
networks: Tricks of the trade, pages 599–619. Springer, 2012.
12 1 Why Deep Generative Modeling?
42. Hugo Larochelle and Yoshua Bengio. Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on Machine learning, pages
536–543, 2008.
43. Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad
Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should
treat it like one. In International Conference on Learning Representations, 2019.
44. Fabian Mentzer, George D Toderici, Michael Tschannen, and Eirikur Agustsson. High-fidelity
generative image compression. Advances in Neural Information Processing Systems, 33, 2020.
45. Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review
and new perspectives. IEEE transactions on pattern analysis and machine intelligence,
35(8):1798–1828, 2013.
Chapter 2
Autoregressive Models
2.1 Introduction
Before we start discussing how we can model the distribution p(x), we refresh our
memory about the core rules of probability theory, namely, the sum rule and the
product rule. Let us introduce two random variables x and y. Their joint distribution
is p(x, y). The product rule allows us to factorize the joint distribution in two
manners, namely:
These two rules will play a crucial role in probability theory and statistics and, in
particular, in formulating deep generative models.
Now, let us consider a high-dimensional random variable x ∈ XD where X =
{0, 1, . . . , 255} (e.g., pixel values) or X = R. Our goal is to model p(x). Before we
jump into thinking of specific parameterization, let us first apply the product rule to
express the joint distribution in a different manner:
D
p(x) = p(x1 ) p(xd |x<d ), (2.4)
d=2
As mentioned earlier, we aim for modeling the joint distribution p(x) using
conditional distributions. A potential solution to the issue of using D separate model
is utilizing a single, shared model for the conditional distribution. However, we
need to make some assumptions to use such a shared model. In other words, we
look for an autoregressive model (ARM). In the next subsection, we outline ARMs
parameterized with various neural networks. After all, we are talking about deep
generative models so using a neural network is not surprising, isn’t it?
D
p(x) = p(x1 )p(x2 |x1 ) p(xd |xd−1 , xd−2 ). (2.5)
d=3
Then, we can use a small neural network, e.g., multi-layered perceptron (MLP),
to predict the distribution of xd . If X = {0, 1, . . . , 255}, the MLP takes xd−1 , xd−2
and outputs probabilities for the categorical distribution of xd , θd . The MLP could
be of the following form:
Fig. 2.1 An example of applying a shared MLP depending on two last inputs. Inputs are denoted
by blue nodes (bottom), intermediate representations are denoted by orange nodes (middle), and
output probabilities are denoted by green nodes (top). Notice that a probability θd is not dependent
on xd
Fig. 2.2 An example of applying an RNN depending on two last inputs. Inputs are denoted by blue
nodes (bottom), intermediate representations are denoted by orange nodes (middle), and output
probabilities are denoted by green nodes (top). Notice that compared to the approach with a shared
MLP, there is an additional dependency between intermediate nodes hd
• If they are badly conditioned (i.e., the eigenvalues of a weight matrix are
larger or smaller than 1, then they suffer from exploding or vanishing gradients,
respectively, that hinders learning long-range dependencies.
There exist methods to help training RNNs like gradient clipping or, more generally,
gradient regularization [4] or orthogonal weights [5]. However, here we are not
interested in looking into rather specific solutions to new problems. We seek for a
different parameterization that could solve our original problem, namely, modeling
long-range dependencies in an ARM.
In [6, 7] it was noticed that convolutional neural networks (CNNs) could be used
instead of RNNs to model long-range dependencies. To be more precise, one-
dimensional convolutional layers (Conv1D) could be stacked together to process
sequential data. The advantages of such an approach are the following:
• Kernels are shared (i.e., an efficient parameterization).
• The processing is done in parallel that greatly speeds up computations.
• By stacking more layers, the effective kernel size grows with the network depth.
These three traits seem to place Conv1D-based neural networks as a perfect solution
to our problem. However, can we indeed use them straight away?
A Conv1D can be applied to calculate embeddings like in [7], but it cannot be
used for autoregressive models. Why? Because we need convolutions to be causal
[8]. Causal in this context means that a Conv1D layer is dependent on the last k
inputs but the current one (option A) or with the current one (option B). In other
words, we must “cut” the kernel in half and forbid it to look into the next variables
2.2 Autoregressive Models Parameterized by Neural Networks 17
Fig. 2.3 An example of applying causal convolutions. The kernel size is 2, but by applying dilation
in higher layers, a much larger input could be processed (red edges), thus, a larger memory is
utilized. Notice that the first layers must be option A to ensure proper processing
(look into the future). Importantly, the option A is required in the first layer because
the final output (i.e., the probabilities θd ) cannot be dependent on xd . Additionally,
if we are concerned about the effective kernel size, we can use dilation larger
than 1.
In Fig. 2.3 we present an example of a neural network consisting of 3 causal
Conv1D layers. The first CausalConv1D is of type A, i.e., it does not take into
account only the last k inputs without the current one. Then, in the next two layers,
we use CausalConv1D (option B) with dilations 2 and 3. Typically, the dilation
values are 1, 2, 4, and 8 (v.d. Oord et al., 2016a); however, taking 2 and 4 would
not nicely fit in a figure. We highlight in red all connections that go from the output
layer to the input layer. As we can notice, stacking CausalConv1D layers with the
dilation larger than 1 allows us to learn long-range dependencies (in this example,
by looking at 7 last inputs).
An example of an implementation of CausalConv1D layer is presented below. If
you are still confused about the option A and the option B, please analyze the code
snippet step-by-step.
1 class CausalConv1d (nn. Module ):
2 def __init__ (self , in_channels , out_channels , kernel_size ,
dilation , A=False , ∗∗ kwargs ):
3 super( CausalConv1d , self ). __init__ ()
4
7 # attributes :
8 self. kernel_size = kernel_size
9 self. dilation = dilation
10 self.A = A # whether option A (A=True) or B (A= False)
11 self. padding = ( kernel_size − 1) ∗ dilation + A ∗ 1
12
Alright, let us talk more about details and how to implement an ARM. Here, and in
the whole book, we focus on images, e.g., x ∈ {0, 1, . . . , 15}64 . Since images are
represented by integers, we will use the categorical distribution to represent them (in
next chapters, we will comment on the choice of distribution for images and present
some alternatives). We model p(x) using an ARM parameterized by CausalConv1D
layers. As a result, each conditional is the following:
L
= [xd = l] ln θd (x<d ) . (2.15)
n d l=1
2.3.1 Code
Uff... Alright, let’s take a look at some code. The full code is available under the
following: https://github.com/jmtomczak/intro_dgm. Here, we focus only on the
code for the model. We provide details in the comments.
1 class ARM(nn. Module ):
2 def __init__ (self , net , D=2, num_vals =256):
3 super(ARM , self). __init__ ()
4
40 return log_p
41
53 return x_new
A B C
Fig. 2.4 An example of outcomes after the training: (a) Randomly selected real images. (b)
Unconditional generations from the ARM. (c) The validation curve during training
3x3 kernel
*
masked
kernel
masked
input image feature map
3x3 kernel
Fig. 2.5 An example of a masked 3×3 kernel (i.e., a causal 2D kernel): (left) A difference between
a standard kernel (all weights are used; denoted by green) and a masked kernel (some weights are
masked, i.e., not used; in red). For the masked kernel, we denoted the node (pixel) in the middle in
violet, because it is either masked (option A) or not (option B). (middle) An example of an image
(light orange nodes: zeros, light blue nodes: ones) and a masked kernel (option A). (right) The
result of applying the masked kernel to the image (with padding equal to 1)
Perfect! Now we are ready to run the full code. After training our ARM, we
should obtain results similar to those in Fig. 2.4.
References
1. Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical
evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint
arXiv:1412.3555, 2014.
2. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
3. Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural
networks. In ICML, 2011.
4. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. In International conference on machine learning, pages 1310–1318. PMLR,
2013.
5. Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.
In International Conference on Machine Learning, pages 1120–1128. PMLR, 2016.
6. Ronan Collobert and Jason Weston. A unified architecture for natural language processing:
Deep neural networks with multitask learning. In Proceedings of the 25th international
conference on Machine learning, pages 160–167, 2008.
7. Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network
for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, pages 212–217. Association for Computational Linguistics, 2014.
8. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolu-
tional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
9. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex
Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
10. Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks.
In International Conference on Machine Learning, pages 1747–1756. PMLR, 2016.
11. Auke Wiggers and Emiel Hoogeboom. Predictive sampling with forecasting autoregressive
models. In International Conference on Machine Learning, pages 10260–10269. PMLR, 2020.
12. Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray
Kavukcuoglu. Conditional image generation with pixelcnn decoders. In Proceedings of the
30th International Conference on Neural Information Processing Systems, pages 4797–4805,
2016.
13. Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the
pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint
arXiv:1701.05517, 2017.
14. Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved
autoregressive generative model. In International Conference on Machine Learning, pages
864–872. PMLR, 2018.
15. Amirhossein Habibian, Ties van Rozendaal, Jakub M Tomczak, and Taco S Cohen. Video
compression with rate-distortion autoencoders. In Proceedings of the IEEE/CVF International
Conference on Computer Vision, pages 7033–7042, 2019.
Another random document with
no related content on Scribd:
gushingly grateful, but on the whole responsive. A few
doleful remarks about his own bodily condition wound up
the effort neatly, and served as an excuse for shortness.
"Yes."
"My father was bent upon getting the Tracys down here on
the earliest possible day; but nothing will induce Colonel
Tracy to stir sooner."
"No."
"Yes."
"Well—you looked—"
"I detest them. But why shouldn't they all go still, and the
Tracys too?" asked blundering Isabel.
"My dear, that would not do," said Mrs. Erskine. "We can't
inflict utter strangers upon Mrs. Claughton."
"If the Tracys could be put off for two days," said Isabel.
"I may not have time by-and-by," said Dolly, and she
escaped without saying good-bye.
"Poor little Dolly! Dear little Dolly! Never mind! A good cry
will make you feel better."
"Oh,—that and—everything."
"If you should see Edred loving and seeking Dorothea Tracy,
you know that one happiness which you wish for is not to
be yours. You would know that the life you could choose is
not to be your life. Dolly, some of us have to go through
that pain, and, hard as it may seem, I think we are not the
worse for it in the end; at least, we need not be. One has to
learn, somehow, to fight and endure: and that may be as
good a way as any other. I can't tell yet if that is to be your
discipline; but if it is, you will not hate Dorothea Tracy. She
has a right to be loved: and she would not be to blame.
Whether he would be to blame is another question. I do not
know if he has ever given any reason—"
"O Margot!"
"I don't think you can judge. Perhaps an outsider can tell
better. I had a fear at one time that yours was to be only a
kitten life, Dolly—nothing in it but amusement and self-
pleasing. Lately, I do see a difference."
"Take the lower road, Fred," were her father's parting words
to the boy. Dolly had meant to give a contrary order. The
"lower road" was less steep, but much longer than the more
direct route, and she did not care for a lengthened tête-à-
tête. However, it had to be. Jack, the plump pony, trotted
leisurely off along the village street, and the two Colonels
turned up a side lane.
"Craye seems a very pretty place," said Dorothea.
Had she not Edred too? That thought darted through Dolly's
mind with the force and pain of an actual stab. It seemed to
take away her breath, and to turn her pale.
"Doesn't he?"
Dolly twisted herself round to lean over the back, her face
turned away. "That shawl—it seems to be slipping," came in
rather smothered accents. "O never mind—all right. Yes,
and the eldest brother is here too—Mervyn, I mean." Dolly
straightened herself, and Dorothea could not but notice her
brilliant blush, could not but connect it with the last uttered
name.
"You know him too, don't you?" said Dolly, looking ahead,
with burning cheeks.
"He has a great deal the most fun in him of the two."
"Why?"
Dorothea was still studying Dolly: and her next words were
unexpected—
Dolly had her wish, after all. The world awoke next morning
to a frost-decked landscape.
She did not skip with delight, as she would have done a
year earlier, but only stood soberly looking out.
Dorothea had already been for a brisk turn with her father
and Colonel Erskine. She now sat contentedly near a
window, work in hand, ready for talk or silence as others
might wish. There were no signs about Dorothea of a mind
ill at ease: yet she had fought a fight in the past night, and
had come off conqueror. Whatever pain might be involved
to herself in the resolution, she was utterly determined not
to stand in the way of Dolly's happiness. If Dolly cared for
Mervyn Claughton, the less Dorothea had to do with him,
the better. She was not without a certain consciousness of
power over him; and a young man hovering between two
girls is often easily swayed by a touch either way. Dorothea
would not, if she might, give that touch.
"Oh, that doesn't matter. We'll fit you with a pair. Past
twelve,—nearly an hour to lunch. Where is father? I must
tell him."
"Didn't you know Miss Tracy was with us?" asked Dolly.
"I really did not. Nobody has had the grace to tell me."