Deep Learning
Deep Learning
Deep Learning
Release 0.16.1
Preface 1
Installation 9
Notation 13
1 Introduction 17
1.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2 Key Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Kinds of Machine Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4 Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.5 The Road to Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6 Success Stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
1.7 Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2 Preliminaries 43
2.1 Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.1.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.1.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.1.3 Broadcasting Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.1.4 Indexing and Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.1.5 Saving Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.1.6 Conversion to Other Python Objects . . . . . . . . . . . . . . . . . . . . . 50
2.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.1 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2.2 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.3 Conversion to the Tensor Format . . . . . . . . . . . . . . . . . . . . . . . 53
2.3 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.1 Scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.2 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3.3 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.4 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.3.5 Basic Properties of Tensor Arithmetic . . . . . . . . . . . . . . . . . . . . 58
2.3.6 Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.3.7 Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.8 Matrix-Vector Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.3.9 Matrix-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 62
2.3.10 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.3.11 More on Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4 Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4.1 Derivatives and Differentiation . . . . . . . . . . . . . . . . . . . . . . . . 67
i
2.4.2 Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.3 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.4.4 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.5 Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.5.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.5.2 Backward for Non-Scalar Variables . . . . . . . . . . . . . . . . . . . . . . 73
2.5.3 Detaching Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2.5.4 Computing the Gradient of Python Control Flow . . . . . . . . . . . . . . 74
2.6 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
2.6.1 Basic Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2.6.2 Dealing with Multiple Random Variables . . . . . . . . . . . . . . . . . . 80
2.6.3 Expectation and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
2.7 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.7.1 Finding All the Functions and Classes in a Module . . . . . . . . . . . . . 84
2.7.2 Finding the Usage of Specific Functions and Classes . . . . . . . . . . . . 85
ii
3.6 Implementation of Softmax Regression from Scratch . . . . . . . . . . . . . . . . 117
3.6.1 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 117
3.6.2 Defining the Softmax Operation . . . . . . . . . . . . . . . . . . . . . . . 118
3.6.3 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6.4 Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 119
3.6.5 Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3.6.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.7 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
3.7 Concise Implementation of Softmax Regression . . . . . . . . . . . . . . . . . . . 124
3.7.1 Initializing Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 125
3.7.2 Softmax Implementation Revisited . . . . . . . . . . . . . . . . . . . . . . 125
3.7.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.7.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
iii
4.9.1 Types of Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.9.2 Examples of Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . 178
4.9.3 Correction of Distribution Shift . . . . . . . . . . . . . . . . . . . . . . . . 180
4.9.4 A Taxonomy of Learning Problems . . . . . . . . . . . . . . . . . . . . . . 183
4.9.5 Fairness, Accountability, and Transparency in Machine Learning . . . . . 185
4.10 Predicting House Prices on Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . 186
4.10.1 Downloading and Caching Datasets . . . . . . . . . . . . . . . . . . . . . 186
4.10.2 Kaggle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
4.10.3 Accessing and Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . 189
4.10.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
4.10.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
4.10.6 K-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
4.10.7 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.10.8 Submitting Predictions on Kaggle . . . . . . . . . . . . . . . . . . . . . . 194
iv
6.3 Padding and Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.3.1 Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
6.3.2 Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
6.4 Multiple Input and Multiple Output Channels . . . . . . . . . . . . . . . . . . . . 241
6.4.1 Multiple Input Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
6.4.2 Multiple Output Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
6.4.3 1 × 1 Convolutional Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
6.5 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
6.5.1 Maximum Pooling and Average Pooling . . . . . . . . . . . . . . . . . . . 245
6.5.2 Padding and Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
6.5.3 Multiple Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
6.6 Convolutional Neural Networks (LeNet) . . . . . . . . . . . . . . . . . . . . . . . 249
6.6.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250
6.6.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
v
8 Recurrent Neural Networks 299
8.1 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
8.1.1 Statistical Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
8.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
8.1.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305
8.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
8.2.1 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.2.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
8.2.3 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
8.2.4 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 311
8.3 Language Models and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
8.3.1 Learning a Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . 313
8.3.2 Markov Models and n-grams . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.3.3 Natural Language Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 314
8.3.4 Reading Long Sequence Data . . . . . . . . . . . . . . . . . . . . . . . . . 317
8.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8.4.1 Neural Networks without Hidden States . . . . . . . . . . . . . . . . . . . 322
8.4.2 Recurrent Neural Networks with Hidden States . . . . . . . . . . . . . . . 322
8.4.3 RNN-based Character-Level Language Models . . . . . . . . . . . . . . . . 324
8.4.4 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
8.5 Implementation of Recurrent Neural Networks from Scratch . . . . . . . . . . . . 327
8.5.1 One-Hot Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
8.5.2 Initializing the Model Parameters . . . . . . . . . . . . . . . . . . . . . . 328
8.5.3 RNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
8.5.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.5.5 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
8.5.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
8.6 Concise Implementation of Recurrent Neural Networks . . . . . . . . . . . . . . . 335
8.6.1 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.6.2 Training and Predicting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
8.7 Backpropagation Through Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
8.7.1 Analysis of Gradients in RNNs . . . . . . . . . . . . . . . . . . . . . . . . 338
8.7.2 Backpropagation Through Time in Detail . . . . . . . . . . . . . . . . . . 341
vi
9.5 Machine Translation and the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 368
9.5.1 Downloading and Preprocessing the Dataset . . . . . . . . . . . . . . . . 369
9.5.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
9.5.3 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9.5.4 Loading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
9.5.5 Putting All Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . 373
9.6 Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
9.6.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
9.6.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
9.6.3 Putting the Encoder and Decoder Together . . . . . . . . . . . . . . . . . 375
9.7 Sequence to Sequence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
9.7.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.7.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
9.7.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
9.7.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
9.7.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
9.7.6 Evaluation of Predicted Sequences . . . . . . . . . . . . . . . . . . . . . . 384
9.8 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
9.8.1 Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
9.8.2 Exhaustive Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
9.8.3 Beam Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388
vii
10.7.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
10.7.5 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
10.7.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
viii
11.11.3 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
ix
13.4.1 Generating Multiple Anchor Boxes . . . . . . . . . . . . . . . . . . . . . . 583
13.4.2 Intersection over Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
13.4.3 Labeling Training Set Anchor Boxes . . . . . . . . . . . . . . . . . . . . . 587
13.4.4 Bounding Boxes for Prediction . . . . . . . . . . . . . . . . . . . . . . . . 592
13.5 Multiscale Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596
13.6 The Object Detection Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
13.6.1 Downloading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 599
13.6.2 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
13.6.3 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 601
13.7 Single Shot Multibox Detection (SSD) . . . . . . . . . . . . . . . . . . . . . . . . . 602
13.7.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
13.7.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
13.7.3 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610
13.8 Region-based CNNs (R-CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
13.8.1 R-CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
13.8.2 Fast R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
13.8.3 Faster R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
13.8.4 Mask R-CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618
13.9 Semantic Segmentation and the Dataset . . . . . . . . . . . . . . . . . . . . . . . 619
13.9.1 Image Segmentation and Instance Segmentation . . . . . . . . . . . . . . 619
13.9.2 The Pascal VOC2012 Semantic Segmentation Dataset . . . . . . . . . . . . 620
13.10 Transposed Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625
13.10.1 Basic 2D Transposed Convolution . . . . . . . . . . . . . . . . . . . . . . 625
13.10.2 Padding, Strides, and Channels . . . . . . . . . . . . . . . . . . . . . . . . 626
13.10.3 Analogy to Matrix Transposition . . . . . . . . . . . . . . . . . . . . . . . 627
13.11 Fully Convolutional Networks (FCN) . . . . . . . . . . . . . . . . . . . . . . . . . 628
13.11.1 Constructing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629
13.11.2 Initializing the Transposed Convolution Layer . . . . . . . . . . . . . . . . 631
13.11.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
13.11.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632
13.11.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633
13.12 Neural Style Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635
13.12.1 Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636
13.12.2 Reading the Content and Style Images . . . . . . . . . . . . . . . . . . . . 637
13.12.3 Preprocessing and Postprocessing . . . . . . . . . . . . . . . . . . . . . . 638
13.12.4 Extracting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
13.12.5 Defining the Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . 639
13.12.6 Creating and Initializing the Composite Image . . . . . . . . . . . . . . . 641
13.12.7 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642
13.13 Image Classification (CIFAR-10) on Kaggle . . . . . . . . . . . . . . . . . . . . . . 645
13.13.1 Obtaining and Organizing the Dataset . . . . . . . . . . . . . . . . . . . . 646
13.13.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 648
13.13.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649
13.13.4 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650
13.13.5 Defining the Training Functions . . . . . . . . . . . . . . . . . . . . . . . 651
13.13.6 Training and Validating the Model . . . . . . . . . . . . . . . . . . . . . . 652
13.13.7 Classifying the Testing Set and Submitting Results on Kaggle . . . . . . . . 652
13.14 Dog Breed Identification (ImageNet Dogs) on Kaggle . . . . . . . . . . . . . . . . 654
13.14.1 Obtaining and Organizing the Dataset . . . . . . . . . . . . . . . . . . . . 655
13.14.2 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656
13.14.3 Reading the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
x
13.14.4 Defining the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657
13.14.5 Defining the Training Functions . . . . . . . . . . . . . . . . . . . . . . . 658
13.14.6 Training and Validating the Model . . . . . . . . . . . . . . . . . . . . . . 659
13.14.7 Classifying the Testing Set and Submitting Results on Kaggle . . . . . . . . 660
xi
15.3 Sentiment Analysis: Using Convolutional Neural Networks . . . . . . . . . . . . . 723
15.3.1 One-Dimensional Convolutional Layer . . . . . . . . . . . . . . . . . . . . 724
15.3.2 Max-Over-Time Pooling Layer . . . . . . . . . . . . . . . . . . . . . . . . 726
15.3.3 The TextCNN Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727
15.4 Natural Language Inference and the Dataset . . . . . . . . . . . . . . . . . . . . . 730
15.4.1 Natural Language Inference . . . . . . . . . . . . . . . . . . . . . . . . . 731
15.4.2 The Stanford Natural Language Inference (SNLI) Dataset . . . . . . . . . . 731
15.5 Natural Language Inference: Using Attention . . . . . . . . . . . . . . . . . . . . 735
15.5.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
15.5.2 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 740
15.6 Fine-Tuning BERT for Sequence-Level and Token-Level Applications . . . . . . . . 742
15.6.1 Single Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
15.6.2 Text Pair Classification or Regression . . . . . . . . . . . . . . . . . . . . 743
15.6.3 Text Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744
15.6.4 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 745
15.7 Natural Language Inference: Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . 747
15.7.1 Loading Pretrained BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . 748
15.7.2 The Dataset for Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . 749
15.7.3 Fine-Tuning BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 750
xii
16.7.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780
16.7.3 Sequential Dataset with Negative Sampling . . . . . . . . . . . . . . . . . 781
16.7.4 Load the MovieLens 100K dataset . . . . . . . . . . . . . . . . . . . . . . 782
16.7.5 Train the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
16.8 Feature-Rich Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . 784
16.8.1 An Online Advertising Dataset . . . . . . . . . . . . . . . . . . . . . . . . 785
16.8.2 Dataset Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785
16.9 Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787
16.9.1 2-Way Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . 787
16.9.2 An Efficient Optimization Criterion . . . . . . . . . . . . . . . . . . . . . 788
16.9.3 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 788
16.9.4 Load the Advertising Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 789
16.9.5 Train the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 789
16.10 Deep Factorization Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790
16.10.1 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
16.10.2 Implemenation of DeepFM . . . . . . . . . . . . . . . . . . . . . . . . . . 792
16.10.3 Training and Evaluating the Model . . . . . . . . . . . . . . . . . . . . . . 793
xiii
18.3.2 Rules of Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
18.4 Multivariable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
18.4.1 Higher-Dimensional Differentiation . . . . . . . . . . . . . . . . . . . . . 846
18.4.2 Geometry of Gradients and Gradient Descent . . . . . . . . . . . . . . . . 847
18.4.3 A Note on Mathematical Optimization . . . . . . . . . . . . . . . . . . . . 848
18.4.4 Multivariate Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
18.4.5 The Backpropagation Algorithm . . . . . . . . . . . . . . . . . . . . . . . 851
18.4.6 Hessians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 854
18.4.7 A Little Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856
18.5 Integral Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
18.5.1 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 861
18.5.2 The Fundamental Theorem of Calculus . . . . . . . . . . . . . . . . . . . 863
18.5.3 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865
18.5.4 A Comment on Sign Conventions . . . . . . . . . . . . . . . . . . . . . . . 866
18.5.5 Multiple Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 867
18.5.6 Change of Variables in Multiple Integrals . . . . . . . . . . . . . . . . . . 869
18.6 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870
18.6.1 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 870
18.7 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887
18.7.1 The Maximum Likelihood Principle . . . . . . . . . . . . . . . . . . . . . 888
18.7.2 Numerical Optimization and the Negative Log-Likelihood . . . . . . . . . 889
18.7.3 Maximum Likelihood for Continuous Variables . . . . . . . . . . . . . . . 891
18.8 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
18.8.1 Bernoulli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
18.8.2 Discrete Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
18.8.3 Continuous Uniform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 896
18.8.4 Binomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
18.8.5 Poisson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900
18.8.6 Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 903
18.8.7 Exponential Family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906
18.9 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 907
18.9.1 Optical Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . 908
18.9.2 The Probabilistic Model for Classification . . . . . . . . . . . . . . . . . . 909
18.9.3 The Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 909
18.9.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
18.10 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914
18.10.1 Evaluating and Comparing Estimators . . . . . . . . . . . . . . . . . . . . 914
18.10.2 Conducting Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . 918
18.10.3 Constructing Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 922
18.11 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925
18.11.1 Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925
18.11.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 927
18.11.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929
18.11.4 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . 933
18.11.5 Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935
xiv
19.2.1 Registering and Logging In . . . . . . . . . . . . . . . . . . . . . . . . . . 944
19.2.2 Creating a SageMaker Instance . . . . . . . . . . . . . . . . . . . . . . . . 945
19.2.3 Running and Stopping an Instance . . . . . . . . . . . . . . . . . . . . . . 946
19.2.4 Updating Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 947
19.3 Using AWS EC2 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 948
19.3.1 Creating and Running an EC2 Instance . . . . . . . . . . . . . . . . . . . . 948
19.3.2 Installing CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953
19.3.3 Installing MXNet and Downloading the D2L Notebooks . . . . . . . . . . . 954
19.3.4 Running Jupyter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955
19.3.5 Closing Unused Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . 956
19.4 Using Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956
19.5 Selecting Servers and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957
19.5.1 Selecting Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
19.5.2 Selecting GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
19.6 Contributing to This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
19.6.1 Minor Text Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
19.6.2 Propose a Major Change . . . . . . . . . . . . . . . . . . . . . . . . . . . 962
19.6.3 Adding a New Section or a New Framework Implementation . . . . . . . . 963
19.6.4 Submitting a Major Change . . . . . . . . . . . . . . . . . . . . . . . . . . 963
19.7 d2l API Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
Bibliography 989
Index 1001
xv
xvi
Preface
Just a few years ago, there were no legions of deep learning scientists developing intelligent prod-
ucts and services at major companies and startups. When the youngest among us (the authors)
entered the field, machine learning did not command headlines in daily newspapers. Our parents
had no idea what machine learning was, let alone why we might prefer it to a career in medicine or
law. Machine learning was a forward-looking academic discipline with a narrow set of real-world
applications. And those applications, e.g., speech recognition and computer vision, required so
much domain knowledge that they were often regarded as separate areas entirely for which ma-
chine learning was one small component. Neural networks then, the antecedents of the deep
learning models that we focus on in this book, were regarded as outmoded tools.
In just the past five years, deep learning has taken the world by surprise, driving rapid progress
in fields as diverse as computer vision, natural language processing, automatic speech recogni-
tion, reinforcement learning, and statistical modeling. With these advances in hand, we can now
build cars that drive themselves with more autonomy than ever before (and less autonomy than
some companies might have you believe), smart reply systems that automatically draft the most
mundane emails, helping people dig out from oppressively large inboxes, and software agents that
dominate the worldʼs best humans at board games like Go, a feat once thought to be decades away.
Already, these tools exert ever-wider impacts on industry and society, changing the way movies
are made, diseases are diagnosed, and playing a growing role in basic sciences—from astrophysics
to biology.
This book represents our attempt to make deep learning approachable, teaching you the concepts,
the context, and the code.
For any computing technology to reach its full impact, it must be well-understood, well-
documented, and supported by mature, well-maintained tools. The key ideas should be clearly
distilled, minimizing the onboarding time needing to bring new practitioners up to date. Mature
libraries should automate common tasks, and exemplar code should make it easy for practitioners
to modify, apply, and extend common applications to suit their needs. Take dynamic web appli-
cations as an example. Despite a large number of companies, like Amazon, developing successful
database-driven web applications in the 1990s, the potential of this technology to aid creative en-
trepreneurs has been realized to a far greater degree in the past ten years, owing in part to the
development of powerful, well-documented frameworks.
1
Testing the potential of deep learning presents unique challenges because any single application
brings together various disciplines. Applying deep learning requires simultaneously understand-
ing (i) the motivations for casting a problem in a particular way; (ii) the mathematics of a given
modeling approach; (iii) the optimization algorithms for fitting the models to data; and (iv) the
engineering required to train models efficiently, navigating the pitfalls of numerical computing
and getting the most out of available hardware. Teaching both the critical thinking skills required
to formulate problems, the mathematics to solve them, and the software tools to implement those
solutions all in one place presents formidable challenges. Our goal in this book is to present a
unified resource to bring would-be practitioners up to speed.
At the time we started this book project, there were no resources that simultaneously (i) were
up to date; (ii) covered the full breadth of modern machine learning with substantial technical
depth; and (iii) interleaved exposition of the quality one expects from an engaging textbook with
the clean runnable code that one expects to find in hands-on tutorials. We found plenty of code
examples for how to use a given deep learning framework (e.g., how to do basic numerical com-
puting with matrices in TensorFlow) or for implementing particular techniques (e.g., code snip-
pets for LeNet, AlexNet, ResNets, etc) scattered across various blog posts and GitHub repositories.
However, these examples typically focused on how to implement a given approach, but left out the
discussion of why certain algorithmic decisions are made. While some interactive resources have
popped up sporadically to address a particular topic, e.g., the engaging blog posts published on
the website Distill3 , or personal blogs, they only covered selected topics in deep learning, and
often lacked associated code. On the other hand, while several textbooks have emerged, most no-
tably (Goodfellow et al., 2016), which offers a comprehensive survey of the concepts behind deep
learning, these resources do not marry the descriptions to realizations of the concepts in code,
sometimes leaving readers clueless as to how to implement them. Moreover, too many resources
are hidden behind the paywalls of commercial course providers.
We set out to create a resource that could (i) be freely available for everyone; (ii) offer sufficient
technical depth to provide a starting point on the path to actually becoming an applied machine
learning scientist; (iii) include runnable code, showing readers how to solve problems in practice;
(iv) allow for rapid updates, both by us and also by the community at large; and (v) be comple-
mented by a forum4 for interactive discussion of technical details and to answer questions.
These goals were often in conflict. Equations, theorems, and citations are best managed and laid
out in LaTeX. Code is best described in Python. And webpages are native in HTML and JavaScript.
Furthermore, we want the content to be accessible both as executable code, as a physical book,
as a downloadable PDF, and on the Internet as a website. At present there exist no tools and no
workflow perfectly suited to these demands, so we had to assemble our own. We describe our
approach in detail in Section 19.6. We settled on GitHub to share the source and to allow for edits,
Jupyter notebooks for mixing code, equations and text, Sphinx as a rendering engine to generate
multiple outputs, and Discourse for the forum. While our system is not yet perfect, these choices
provide a good compromise among the competing concerns. We believe that this might be the
first book published using such an integrated workflow.
3
http://distill.pub
4
http://discuss.d2l.ai
2 Contents
Learning by Doing
Many textbooks teach a series of topics, each in exhaustive detail. For example, Chris Bishopʼs
excellent textbook (Bishop, 2006), teaches each topic so thoroughly, that getting to the chapter on
linear regression requires a non-trivial amount of work. While experts love this book precisely
for its thoroughness, for beginners, this property limits its usefulness as an introductory text.
In this book, we will teach most concepts just in time. In other words, you will learn concepts at the
very moment that they are needed to accomplish some practical end. While we take some time at
the outset to teach fundamental preliminaries, like linear algebra and probability, we want you to
taste the satisfaction of training your first model before worrying about more esoteric probability
distributions.
Aside from a few preliminary notebooks that provide a crash course in the basic mathematical
background, each subsequent chapter introduces both a reasonable number of new concepts and
provides single self-contained working examples—using real datasets. This presents an organi-
zational challenge. Some models might logically be grouped together in a single notebook. And
some ideas might be best taught by executing several models in succession. On the other hand,
there is a big advantage to adhering to a policy of one working example, one notebook: This makes
it as easy as possible for you to start your own research projects by leveraging our code. Just copy
a notebook and start modifying it.
We will interleave the runnable code with background material as needed. In general, we will
often err on the side of making tools available before explaining them fully (and we will follow up
by explaining the background later). For instance, we might use stochastic gradient descent before
fully explaining why it is useful or why it works. This helps to give practitioners the necessary
ammunition to solve problems quickly, at the expense of requiring the reader to trust us with
some curatorial decisions.
This book will teach deep learning concepts from scratch. Sometimes, we want to delve into fine
details about the models that would typically be hidden from the user by deep learning frame-
worksʼ advanced abstractions. This comes up especially in the basic tutorials, where we want you
to understand everything that happens in a given layer or optimizer. In these cases, we will often
present two versions of the example: one where we implement everything from scratch, relying
only on the NumPy interface and automatic differentiation, and another, more practical exam-
ple, where we write succinct code using high-level APIs of deep learning frameworks. Once we
have taught you how some component works, we can just use the high-level APIs in subsequent
tutorials.
Contents 3
Content and Structure
The book can be roughly divided into three parts, which are presented by different colors in Fig.
1:
• The first part covers basics and preliminaries. Chapter 1 offers an introduction to deep learn-
ing. Then, in Chapter 2, we quickly bring you up to speed on the prerequisites required for
hands-on deep learning, such as how to store and manipulate data, and how to apply various
numerical operations based on basic concepts from linear algebra, calculus, and probabil-
ity. Chapter 3 and Chapter 4 cover the most basic concepts and techniques of deep learning,
such as linear regression, multilayer perceptrons and regularization.
• The next five chapters focus on modern deep learning techniques. Chapter 5 describes the
various key components of deep learning calculations and lays the groundwork for us to
subsequently implement more complex models. Next, in Chapter 6 and Chapter 7, we intro-
duce convolutional neural networks (CNNs), powerful tools that form the backbone of most
modern computer vision systems. Subsequently, in Chapter 8 and Chapter 9, we introduce
recurrent neural networks (RNNs), models that exploit temporal or sequential structure in
data, and are commonly used for natural language processing and time series prediction.
In Chapter 10, we introduce a new class of models that employ a technique called attention
mechanisms and they have recently begun to displace RNNs in natural language processing.
These sections will get you up to speed on the basic tools behind most modern applications
of deep learning.
• Part three discusses scalability, efficiency, and applications. First, in Chapter 11, we dis-
cuss several common optimization algorithms used to train deep learning models. The next
4 Contents
chapter, Chapter 12 examines several key factors that influence the computational perfor-
mance of your deep learning code. In Chapter 13, we illustrate major applications of deep
learning in computer vision. In Chapter 14 and Chapter 15, we show how to pretrain lan-
guage representation models and apply them to natural language processing tasks.
Code
Most sections of this book feature executable code because of our belief in the importance of an
interactive learning experience in deep learning. At present, certain intuitions can only be devel-
oped through trial and error, tweaking the code in small ways and observing the results. Ideally,
an elegant mathematical theory might tell us precisely how to tweak our code to achieve a desired
result. Unfortunately, at present, such elegant theories elude us. Despite our best attempts, for-
mal explanations for various techniques are still lacking, both because the mathematics to char-
acterize these models can be so difficult and also because serious inquiry on these topics has only
just recently kicked into high gear. We are hopeful that as the theory of deep learning progresses,
future editions of this book will be able to provide insights in places the present edition cannot.
At times, to avoid unnecessary repetition, we encapsulate the frequently-imported and referred-to
functions, classes, etc. in this book in the d2l package. For any block such as a function, a class,
or multiple imports to be saved in the package, we will mark it with #@save. We offer a detailed
overview of these functions and classes in Section 19.7. The d2l package is light-weight and only
requires the following packages and modules as dependencies:
#@save
import collections
from collections import defaultdict
from IPython import display
import math
from matplotlib import pyplot as plt
import os
import pandas as pd
import random
import re
import shutil
import sys
import tarfile
import time
import requests
import zipfile
import hashlib
d2l = sys.modules[__name__]
Most of the code in this book is based on Apache MXNet. MXNet is an open-source framework for
deep learning and the preferred choice of AWS (Amazon Web Services), as well as many colleges
and companies. All of the code in this book has passed tests under the newest MXNet version.
However, due to the rapid development of deep learning, some code in the print edition may not
work properly in future versions of MXNet. However, we plan to keep the online version up-to-
date. In case you encounter any such problems, please consult Installation (page 9) to update your
code and runtime environment.
Here is how we import modules from MXNet.
Contents 5
#@save
from mxnet import autograd, context, gluon, image, init, np, npx
from mxnet.gluon import nn, rnn
Target Audience
This book is for students (undergraduate or graduate), engineers, and researchers, who seek a
solid grasp of the practical techniques of deep learning. Because we explain every concept from
scratch, no previous background in deep learning or machine learning is required. Fully explain-
ing the methods of deep learning requires some mathematics and programming, but we will only
assume that you come in with some basics, including (the very basics of) linear algebra, calcu-
lus, probability, and Python programming. Moreover, in the Appendix, we provide a refresher
on most of the mathematics covered in this book. Most of the time, we will prioritize intuition
and ideas over mathematical rigor. There are many terrific books which can lead the interested
reader further. For instance, Linear Analysis by Bela Bollobas (Bollobas, 1999) covers linear alge-
bra and functional analysis in great depth. All of Statistics (Wasserman, 2013) is a terrific guide to
statistics. And if you have not used Python before, you may want to peruse this Python tutorial5 .
Forum
Associated with this book, we have launched a discussion forum, located at discuss.d2l.ai6 . When
you have questions on any section of the book, you can find the associated discussion page link at
the end of each chapter.
Acknowledgments
We are indebted to the hundreds of contributors for both the English and the Chinese drafts. They
helped improve the content and offered valuable feedback. Specifically, we thank every con-
tributor of this English draft for making it better for everyone. Their GitHub IDs or names are
(in no particular order): alxnorden, avinashingit, bowen0701, brettkoonce, Chaitanya Prakash
Bapat, cryptonaut, Davide Fiocco, edgarroman, gkutiel, John Mitro, Liang Pu, Rahul Agarwal,
Mohamed Ali Jamaoui, Michael (Stu) Stewart, Mike Müller, NRauschmayr, Prakhar Srivastav,
sad-, sfermigier, Sheng Zha, sundeepteki, topecongiro, tpdi, vermicelli, Vishaal Kapoor, Vish-
wesh Ravi Shrimali, YaYaB, Yuhong Chen, Evgeniy Smirnov, lgov, Simon Corston-Oliver, Igor
Dzreyev, Ha Nguyen, pmuens, Andrei Lukovenko, senorcinco, vfdev-5, dsweet, Mohammad
Mahdi Rahimi, Abhishek Gupta, uwsd, DomKM, Lisa Oakley, Bowen Li, Aarush Ahuja, Prasanth
Buddareddygari, brianhendee, mani2106, mtn, lkevinzc, caojilin, Lakshya, Fiete Lüer, Surbhi
Vijayvargeeya, Muhyun Kim, dennismalmgren, adursun, Anirudh Dagar, liqingnz, Pedro Lar-
roy, lgov, ati-ozgur, Jun Wu, Matthias Blume, Lin Yuan, geogunow, Josh Gardner, Maximilian
Böther, Rakib Islam, Leonard Lausen, Abhinav Upadhyay, rongruosong, Steve Sedlmeyer, Rus-
lan Baratov, Rafael Schlatter, liusy182, Giannis Pappas, ati-ozgur, qbaza, dchoi77, Adam Ger-
son, Phuc Le, Mark Atwood, christabella, vn09, Haibin Lin, jjangga0214, RichyChen, noelo,
hansent, Giel Dops, dvincent1337, WhiteD3vil, Peter Kulits, codypenta, joseppinilla, ahmaurya,
karolszk, heytitle, Peter Goetz, rigtorp, Tiep Vu, sfilip, mlxd, Kale-ab Tessera, Sanjar Adilov,
5
http://learnpython.org/
6
https://discuss.d2l.ai/
6 Contents
MatteoFerrara, hsneto, Katarzyna Biesialska, Gregory Bruss, Duy–Thanh Doan, paulaurel, gray-
towne, Duc Pham, sl7423, Jaedong Hwang, Yida Wang, cys4, clhm, Jean Kaddour, austinmw,
trebeljahr, tbaums, Cuong V. Nguyen, pavelkomarov, vzlamal, NotAnotherSystem, J-Arun-Mani,
jancio, eldarkurtic, the-great-shazbot, doctorcolossus, gducharme, cclauss, Daniel-Mietchen,
hoonose, biagiom, abhinavsp0730, jonathanhrandall, ysraell, Nodar Okroshiashvili, UgurKap,
Jiyang Kang, StevenJokes, Tomer Kaftan, liweiwp, netyster, ypandya, NishantTharani, heiligerl,
SportsTHU, Hoa Nguyen, manuel-arno-korfmann-webentwicklung, aterzis-personal, nxby, Xi-
aoting He, Josiah Yoder, mathresearch, mzz2017, jroberayalas, iluu, ghejc, BSharmi, vkramdev,
simonwardjones, LakshKD, TalNeoran, djliden, Nikhil95, Oren Barkan, guoweis, haozhu233,
pratikhack, 315930399, tayfununal, steinsag, charleybeller, Andrew Lumsdaine, Jiekui Zhang,
Deepak Pathak, Florian Donhauser, Tim Gates, Adriaan Tijsseling, Ron Medina, Gaurav Saha,
Murat Semerci, Lei Mao7 .
We thank Amazon Web Services, especially Swami Sivasubramanian, Raju Gulabani, Charlie Bell,
and Andrew Jassy for their generous support in writing this book. Without the available time,
resources, discussions with colleagues, and continuous encouragement this book would not have
happened.
Summary
• Deep learning has revolutionized pattern recognition, introducing technology that now
powers a wide range of technologies, including computer vision, natural language process-
ing, automatic speech recognition.
• To successfully apply deep learning, you must understand how to cast a problem, the math-
ematics of modeling, the algorithms for fitting your models to data, and the engineering
techniques to implement it all.
• This book presents a comprehensive resource, including prose, figures, mathematics, and
code, all in one place.
• To answer questions related to this book, visit our forum at https://discuss.d2l.ai/.
• All notebooks are available for download on GitHub.
Exercises
7
https://github.com/leimao
8
https://discuss.d2l.ai/
9
https://discuss.d2l.ai/t/18
Contents 7
8 Contents