Machine Learning with Python Cookbook Practical

Solutions from Preprocessing to Deep Learning 2nd Ed
Release 5 2nd Edition Chris Albon

Machine Learning with Python Cookbook, 2nd Edition Kyle


Machine Learning with Python Cookbook, 2nd Edition

(First Early Release) Kyle Gallatin

Ensemble Machine Learning With Python: 7-Day Mini-

Course Jason Brownlee
Natural Language Processing Recipes: Unlocking Text
Data with Machine Learning and Deep Learning Using
Python 2nd Edition Akshay Kulkarni

Natural Language Processing Recipes: Unlocking Text

Data with Machine Learning and Deep Learning Using
Python 2nd Edition Akshay Kulkarni

Deep Learning with Python 2nd Edition Nikhil Ketkar

Adaptive Machine Learning Algorithms with Python: Solve

Data Analytics and Machine Learning Problems on Edge
Devices 1st Edition Chanchal Chatterjee

Learn TensorFlow 2.0: Implement Machine Learning and

Deep Learning Models with Python 1st Edition Pramod
Machine Learning with Python Cookbook
Practical Solutions from Preprocessing to Deep Learning

Kyle Gallatin and Chris Albon

Machine Learning with Python Cookbook
by Kyle Gallatin and Chris Albon
Copyright © 2023 Kyle Gallatin. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions
are also available for most titles ( For more information, contact our
corporate/institutional sales department: 800-998-9938 or
Acquisitions Editor: Nicole Butterfield
Development Editor Jeff Bleiel
Production Editor: Christopher Faucher
Interior Designer: David Futato
Cover Designer: Karen Montgomery
April 2018: First Edition
October 2023: Second Edition
Revision History for the Early Release
2022-08-24: First Release
2022-10-05: Second Release
2022-12-08: Third Release
2023-01-18: Fourth Release
See for release details.
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Machine Learning with Python
Cookbook, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.
The views expressed in this work are those of the authors and do not represent the publisher’s views.
While the publisher and the authors have used good faith efforts to ensure that the information and
instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility
for errors or omissions, including without limitation responsibility for damages resulting from the use of
or reliance on this work. Use of the information and instructions contained in this work is at your own
risk. If any code samples or other technology this work contains or describes is subject to open source
licenses or the intellectual property rights of others, it is your responsibility to ensure that your use
thereof complies with such licenses and/or rights.
Chapter 1. Working with Vectors, Matrices
and Arrays in NumPy


This will be the 1st chapter of the final book.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the authors at
within this chapter, please reach out to the authors at

1.0 Introduction
NumPy is a foundational tool of the Python machine learning stack. NumPy allows for efficient
operations on the data structures often used in machine learning: vectors, matrices, and tensors. While
NumPy is not the focus of this book, it will show up frequently throughout the following chapters. This
chapter covers the most common NumPy operations we are likely to run into while working on machine
learning workflows.

1.1 Creating a Vector

You need to create a vector.

Use NumPy to create a one-dimensional array:

# Load library
import numpy as np

# Create a vector as a row

vector_row = np.array([1, 2, 3])

# Create a vector as a column

vector_column = np.array([[1],

NumPy’s main data structure is the multidimensional array. A vector is just an array with a single
dimension. In order to create a vector, we simply create a one-dimensional array. Just like vectors, these
arrays can be represented horizontally (i.e., rows) or vertically (i.e., columns).

See Also
Vectors, Math Is Fun
Euclidean vector, Wikipedia

1.2 Creating a Matrix

You need to create a matrix.

Use NumPy to create a two-dimensional array:

# Load library
import numpy as np

# Create a matrix
matrix = np.array([[1, 2],
[1, 2],
[1, 2]])

To create a matrix we can use a NumPy two-dimensional array. In our solution, the matrix contains
three rows and two columns (a column of 1s and a column of 2s).
NumPy actually has a dedicated matrix data structure:

matrix_object = np.mat([[1, 2],

[1, 2],
[1, 2]])

matrix([[1, 2],
[1, 2],
[1, 2]])

However, the matrix data structure is not recommended for two reasons. First, arrays are the de facto
standard data structure of NumPy. Second, the vast majority of NumPy operations return arrays, not
matrix objects.

See Also
Matrix, Wikipedia
Matrix, Wolfram MathWorld
1.3 Creating a Sparse Matrix

Given data with very few nonzero values, you want to efficiently represent it.

Create a sparse matrix:

# Load libraries
import numpy as np
from scipy import sparse

# Create a matrix
matrix = np.array([[0, 0],
[0, 1],
[3, 0]])

# Create compressed sparse row (CSR) matrix

matrix_sparse = sparse.csr_matrix(matrix)

A frequent situation in machine learning is having a huge amount of data; however, most of the
elements in the data are zeros. For example, imagine a matrix where the columns are every movie on
Netflix, the rows are every Netflix user, and the values are how many times a user has watched that
particular movie. This matrix would have tens of thousands of columns and millions of rows! However,
since most users do not watch most movies, the vast majority of elements would be zero.
A sparse matrix is a matrix in which most elements are 0. Sparse matrices only store nonzero elements
and assume all other values will be zero, leading to significant computational savings. In our solution,
we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view
the sparse matrix we can see that only the nonzero values are stored:

# View sparse matrix


(1, 1) 1
(2, 0) 3

There are a number of types of sparse matrices. However, in compressed sparse row (CSR) matrices,
(1, 1) and (2, 0) represent the (zero-indexed) indices of the non-zero values 1 and 3, respectively.
For example, the element 1 is in the second row and second column. We can see the advantage of sparse
matrices if we create a much larger matrix with many more zero elements and then compare this larger
matrix with our original sparse matrix:

# Create larger matrix

matrix_large = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[3, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

# Create compressed sparse row (CSR) matrix

matrix_large_sparse = sparse.csr_matrix(matrix_large)

# View original sparse matrix


(1, 1) 1
(2, 0) 3

# View larger sparse matrix


(1, 1) 1
(2, 0) 3

As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse
representation is exactly the same as our original sparse matrix. That is, the addition of zero elements
did not change the size of the sparse matrix.
As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list
of lists, and dictionary of keys. While an explanation of the different types and their implications is
outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there
are meaningful differences between them and we should be conscious about why we are choosing one
type over another.

See Also
Sparse matrices, SciPy documentation
101 Ways to Store a Sparse Matrix

1.4 Pre-allocating Numpy Arrays

You need to pre-allocate arrays of a given size with some value.

NumPy has functions for generating vectors and matrices of any size using 0s, 1s, or values of your

# Load library
import numpy as np

# Generate a vector of shape (1,5) containing all zeros

vector = np.zeros(shape=5)

# View the vector


array([0., 0., 0., 0., 0.])

# Generate a matrix of shape (3,3) containing all ones
matrix = np.full(shape=(3,3), fill_value=1)

# View the vector


array([[1., 1., 1.],

[1., 1., 1.],
[1., 1., 1.]])

Generating arrays prefilled with data is useful for a number of purposes, such as making code more
performant or having synthetic data to test algorithms with. In many programming languages, pre-
allocating an array of default values (such as 0s) is considered common practice.

1.5 Selecting Elements

You need to select one or more elements in a vector or matrix.

NumPy’s arrays make it easy to select elements in vectors or matrices:

# Load library
import numpy as np
# Create row vector
vector = np.array([1, 2, 3, 4, 5, 6])

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Select third element of vector


# Select second row, second column


Like most things in Python, NumPy arrays are zero-indexed, meaning that the index of the first element
is 0, not 1. With that caveat, NumPy offers a wide variety of methods for selecting (i.e., indexing and
slicing) elements or groups of elements in arrays:

# Select all elements of a vector

array([1, 2, 3, 4, 5, 6])

# Select everything up to and including the third element


array([1, 2, 3])

# Select everything after the third element


array([4, 5, 6])

# Select the last element


# Reverse the vector


array([6, 5, 4, 3, 2, 1])

# Select the first two rows and all columns of a matrix


array([[1, 2, 3],
[4, 5, 6]])

# Select all rows and the second column



1.6 Describing a Matrix

You want to describe the shape, size, and dimensions of the matrix.

Use the shape, size, and ndim attributes of a NumPy object:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])

# View number of rows and columns

(3, 4)

# View number of elements (rows * columns)



# View number of dimensions


This might seem basic (and it is); however, time and again it will be valuable to check the shape and
size of an array both for further calculations and simply as a gut check after some operation.

1.7 Applying Functions Over Each Element

You want to apply some function to all elements in an array.

Use NumPy’s vectorize method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Create function that adds 100 to something

add_100 = lambda i: i + 100

# Create vectorized function

vectorized_add_100 = np.vectorize(add_100)
# Apply function to all elements in matrix

array([[101, 102, 103],

[104, 105, 106],
[107, 108, 109]])

NumPy’s vectorize class converts a function into a function that can apply to all elements in an array
or slice of an array. It’s worth noting that vectorize is essentially a for loop over the elements and
does not increase performance. Furthermore, NumPy arrays allow us to perform operations between
arrays even if their dimensions are not the same (a process called broadcasting). For example, we can
create a much simpler version of our solution using broadcasting:

# Add 100 to all elements

matrix + 100

array([[101, 102, 103],

[104, 105, 106],
[107, 108, 109]])

Broadcasting does not work for all shapes and situations, but a common way of applying simple
operations over all elements of a numpy array.

1.8 Finding the Maximum and Minimum Values

You need to find the maximum or minimum value in an array.

Use NumPy’s max and min methods:

# Load library
import numpy as np
# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
# Return maximum element

# Return minimum element


Often we want to know the maximum and minimum value in an array or subset of an array. This can be
accomplished with the max and min methods. Using the axis parameter we can also apply the operation
along a certain axis:

# Find maximum element in each column

np.max(matrix, axis=0)

array([7, 8, 9])

# Find maximum element in each row

np.max(matrix, axis=1)
array([3, 6, 9])

1.9 Calculating the Average, Variance, and Standard Deviation

You want to calculate some descriptive statistics about an array.

Use NumPy’s mean, var, and std:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Return mean


# Return variance


# Return standard deviation



Just like with max and min, we can easily get descriptive statistics about the whole matrix or do
calculations along a single axis:

# Find the mean value in each column

np.mean(matrix, axis=0)

array([ 4., 5., 6.])

1.10 Reshaping Arrays

You want to change the shape (number of rows and columns) of an array without changing the element
Use NumPy’s reshape:

# Load library
import numpy as np

# Create 4x3 matrix

matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]])
# Reshape matrix into 2x6 matrix
matrix.reshape(2, 6)

array([[ 1, 2, 3, 4, 5, 6],
[ 7, 8, 9, 10, 11, 12]])

reshape allows us to restructure an array so that we maintain the same data but it is organized as a
different number of rows and columns. The only requirement is that the shape of the original and new
matrix contain the same number of elements (i.e., the same size). We can see the size of a matrix using



One useful argument in reshape is -1, which effectively means “as many as needed,” so reshape(1,
-1) means one row and as many columns as needed:

matrix.reshape(1, -1)

array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]])

Finally, if we provide one integer, reshape will return a 1D array of that length:


array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

1.11 Transposing a Vector or Matrix

You need to transpose a vector or matrix.

Use the T method:
# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Transpose matrix

array([[1, 4, 7],
[2, 5, 8],
[3, 6, 9]])

Transposing is a common operation in linear algebra where the column and row indices of each element
are swapped. One nuanced point that is typically overlooked outside of a linear algebra class is that,
technically, a vector cannot be transposed because it is just a collection of values:

# Transpose vector
np.array([1, 2, 3, 4, 5, 6]).T

array([1, 2, 3, 4, 5, 6])

However, it is common to refer to transposing a vector as converting a row vector to a column vector
(notice the second pair of brackets) or vice versa:

# Tranpose row vector

np.array([[1, 2, 3, 4, 5, 6]]).T


1.12 Flattening a Matrix

You need to transform a matrix into a one-dimensional array.

Use flatten:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])

# Flatten matrix

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

flatten is a simple method to transform a matrix into a one-dimensional array. Alternatively, we can
use reshape to create a row vector:

matrix.reshape(1, -1)

array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])

One more common method to flatten arrays is the ravel method. Unlike flatten which returns a copy
of the original array, ravel operates on the original object itself and is therefore slightly faster. It also
lets us flatten lists of arrays, which we can’t do with the flatten method. This operation is useful for
flattening very large arrays and speeding up code.

# Create one matrix

matrix_a = np.array([[1, 2],
[3, 4]])

# Create a second matrix

matrix_b = np.array([[5, 6],
[7, 8]])

# Create a list of matrices

matrix_list = [matrix_a, matrix_b]
# Flatten the entire list of matrices

array([1, 2, 3, 4, 5, 6, 7, 8])

1.13 Finding the Rank of a Matrix

You need to know the rank of a matrix.

Use NumPy’s linear algebra method matrix_rank:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 1, 1],
[1, 1, 10],
[1, 1, 15]])
# Return matrix rank

The rank of a matrix is the dimensions of the vector space spanned by its columns or rows. Finding the
rank of a matrix is easy in NumPy thanks to matrix_rank.

See Also
The Rank of a Matrix, CliffsNotes

1.14 Getting the Diagonal of a Matrix

You need to get the diagonal elements of a matrix.

Use diagonal:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 2, 3],
[2, 4, 6],
[3, 8, 9]])
# Return diagonal elements

array([1, 4, 9])

NumPy makes getting the diagonal elements of a matrix easy with diagonal. It is also possible to get a
diagonal off from the main diagonal by using the offset parameter:

# Return diagonal one above the main diagonal


array([2, 6])

# Return diagonal one below the main diagonal


array([2, 8])
1.15 Calculating the Trace of a Matrix

You need to calculate the trace of a matrix.

Use trace:

# Load library
import numpy as np
# Create matrix
matrix = np.array([[1, 2, 3],
[2, 4, 6],
[3, 8, 9]])
# Return trace


The trace of a matrix is the sum of the diagonal elements and is often used under the hood in machine
learning methods. Given a NumPy multidimensional array, we can calculate the trace using trace. We
can also return the diagonal of a matrix and calculate its sum:

# Return diagonal and sum elements



See Also
The Trace of a Square Matrix

1.16 Calculating Dot Products

You need to calculate the dot product of two vectors.

Use NumPy’s dot:

# Load library
import numpy as np
# Create two vectors
vector_a = np.array([1,2,3])
vector_b = np.array([4,5,6])

# Calculate dot product, vector_b)


The dot product of two vectors, a and b, is defined as:

where ai is the ith element of vector a and bi is the ith element of vector b. We can use NumPy’s dot
function to calculate the dot product. Alternatively, in Python 3.5+ we can use the new @ operator:

# Calculate dot product

vector_a @ vector_b


See Also
Vector dot product and vector length, Khan Academy
Dot Product, Paul’s Online Math Notes

1.17 Adding and Subtracting Matrices

You want to add or subtract two matrices.

Use NumPy’s add and subtract:

# Load library
import numpy as np

# Create matrix
matrix_a = np.array([[1, 1, 1],
[1, 1, 1],
[1, 1, 2]])

# Create matrix
matrix_b = np.array([[1, 3, 1],
[1, 3, 1],
[1, 3, 8]])

# Add two matrices

np.add(matrix_a, matrix_b)

array([[ 2, 4, 2],
[ 2, 4, 2],
[ 2, 4, 10]])

# Subtract two matrices

np.subtract(matrix_a, matrix_b)

array([[ 0, -2, 0],

[ 0, -2, 0],
[ 0, -2, -6]])

Alternatively, we can simply use the + and - operators:

# Add two matrices

matrix_a + matrix_b

array([[ 2, 4, 2],
[ 2, 4, 2],
[ 2, 4, 10]])

1.18 Multiplying Matrices

You want to multiply two matrices.

Use NumPy’s dot:

# Load library
import numpy as np

# Create matrix
matrix_a = np.array([[1, 1],
[1, 2]])
# Create matrix
matrix_b = np.array([[1, 3],
[1, 2]])

# Multiply two matrices, matrix_b)

array([[2, 5],
[3, 7]])

Alternatively, in Python 3.5+ we can use the @ operator:

# Multiply two matrices

matrix_a @ matrix_b

array([[2, 5],
[3, 7]])

If we want to do element-wise multiplication, we can use the * operator:

# Multiply two matrices element-wise

matrix_a * matrix_b

array([[1, 3],
[1, 4]])

See Also
Array vs. Matrix Operations, MathWorks

1.19 Inverting a Matrix

You want to calculate the inverse of a square matrix.

Use NumPy’s linear algebra inv method:

# Load library
import numpy as np

# Create matrix
matrix = np.array([[1, 4],
[2, 5]])
# Calculate inverse of matrix

array([[-1.66666667, 1.33333333],
[ 0.66666667, -0.33333333]])

The inverse of a square matrix, A, is a second matrix A–1, such that:

where I is the identity matrix. In NumPy we can use linalg.inv to calculate A–1 if it exists. To see this
in action, we can multiply a matrix by its inverse and the result is the identity matrix:

# Multiply matrix and its inverse

matrix @ np.linalg.inv(matrix)

array([[ 1., 0.],

[ 0., 1.]])

See Also
Inverse of a Matrix

1.20 Generating Random Values

You want to generate pseudorandom values.

Use NumPy’s random:

# Load library
import numpy as np

# Set seed

# Generate three random floats between 0.0 and 1.0


array([ 0.5488135 , 0.71518937, 0.60276338])

NumPy offers a wide variety of means to generate random numbers, many more than can be covered
here. In our solution we generated floats; however, it is also common to generate integers:

# Generate three random integers between 0 and 10

np.random.randint(0, 11, 3)

array([3, 7, 9])

Alternatively, we can generate numbers by drawing them from a distribution (note this is not technically

# Draw three numbers from a normal distribution with mean 0.0

# and standard deviation of 1.0
np.random.normal(0.0, 1.0, 3)

array([-1.42232584, 1.52006949, -0.29139398])

# Draw three numbers from a logistic distribution with mean 0.0 and scale of 1.0
np.random.logistic(0.0, 1.0, 3)

array([-0.98118713, -0.08939902, 1.46416405])

# Draw three numbers greater than or equal to 1.0 and less than 2.0
np.random.uniform(1.0, 2.0, 3)
array([ 1.47997717, 1.3927848 , 1.83607876])

Finally, it can sometimes be useful to return the same random numbers multiple times to get predictable,
repeatable results. We can do this by setting the “seed” (an integer) of the pseudorandom generator.
Random processes with the same seed will always produce the same output. We will use seeds
throughout this book so that the code you see in the book and the code you run on your computer
produces the same results.
Chapter 2. Loading Data


With Early Release ebooks, you get books in their earliest form—the authors’ raw and unedited content as they write—so you can
take advantage of these technologies long before the official release of these titles.
This will be the 2nd chapter of the final book.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the authors at
within this chapter, please reach out to the authors at

2.0 Introduction
The first step in any machine learning endeavor is to get the raw data into our system. The raw data
might be a logfile, dataset file, database, or cloud blob store such as Amazon S3. Furthermore, often we
will want to retrieve data from multiple sources.
The recipes in this chapter look at methods of loading data from a variety of sources, including CSV
files and SQL databases. We also cover methods of generating simulated data with desirable properties
for experimentation. Finally, while there are many ways to load data in the Python ecosystem, we will
focus on using the pandas library’s extensive set of methods for loading external data, and using scikit-
learn—​an open source machine learning library in Python—​for generating simulated data.

2.1 Loading a Sample Dataset

You want to load a preexisting sample dataset from the scikit-learn library.

scikit-learn comes with a number of popular datasets for you to use:

# Load scikit-learn's datasets

from sklearn import datasets

# Load digits dataset

digits = datasets.load_digits()
# Create features matrix
features =

# Create target vector

target =

# View first observation

array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13.,
15., 10., 15., 5., 0., 0., 3., 15., 2., 0., 11.,
8., 0., 0., 4., 12., 0., 0., 8., 8., 0., 0.,
5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0.,
1., 12., 7., 0., 0., 2., 14., 5., 10., 12., 0.,
0., 0., 0., 6., 13., 10., 0., 0., 0.])

Often we do not want to go through the work of loading, transforming, and cleaning a real-world dataset
before we can explore some machine learning algorithm or method. Luckily, scikit-learn comes with
some common datasets we can quickly load. These datasets are often called “toy” datasets because they
are far smaller and cleaner than a dataset we would see in the real world. Some popular sample datasets
in scikit-learn are:
Contains 150 observations on the measurements of Iris flowers. It is a good dataset for exploring
classification algorithms.
Contains 1,797 observations from images of handwritten digits. It is a good dataset for teaching
image classification.
To see more details on any of the datasets above, you can print the DESCR attribute:

# Load scikit-learn's datasets

from sklearn import datasets

# Load digits dataset

digits = datasets.load_digits()

# Print the attribute


.. _digits_dataset:

Optical recognition of handwritten digits dataset


**Data Set Characteristics:**

:Number of Instances: 1797

:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@'
:Date: July; 1998

See Also
scikit-learn toy datasets
The Digit Dataset

2.2 Creating a Simulated Dataset

You need to generate a dataset of simulated data.

scikit-learn offers many methods for creating simulated data. Of those, three methods are particularly
useful: make_regression, make_classification, and make_blobs.
When we want a dataset designed to be used with linear regression, make_regression is a good choice:

# Load library
from sklearn.datasets import make_regression

# Generate features matrix, target vector, and the true coefficients

features, target, coefficients = make_regression(n_samples = 100,
n_features = 3,
n_informative = 3,
n_targets = 1,
noise = 0.0,
coef = True,
random_state = 1)

# View feature matrix and target vector

print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ 1.29322588 -0.61736206 -0.11044703]
[-2.793085 0.36633201 1.93752881]
[ 0.80186103 -0.18656977 0.0465673 ]]
Target Vector
[-10.37865986 25.5124503 19.67705609]

If we are interested in creating a simulated dataset for classification, we can use


# Load library
from sklearn.datasets import make_classification

# Generate features matrix and target vector

features, target = make_classification(n_samples = 100,
n_features = 3,
n_informative = 3,
n_redundant = 0,
n_classes = 2,
weights = [.25, .75],
random_state = 1)

# View feature matrix and target vector

print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ 1.06354768 -1.42632219 1.02163151]
[ 0.23156977 1.49535261 0.33251578]
[ 0.15972951 0.83533515 -0.40869554]]
Target Vector
[1 0 0]

Finally, if we want a dataset designed to work well with clustering techniques, scikit-learn offers
# Load library
from sklearn.datasets import make_blobs

# Generate feature matrix and target vector

features, target = make_blobs(n_samples = 100,
n_features = 2,
centers = 3,
cluster_std = 0.5,
shuffle = True,
random_state = 1)
# View feature matrix and target vector
print('Feature Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Feature Matrix
[[ -1.22685609 3.25572052]
[ -9.57463218 -4.38310652]
[-10.71976941 -4.20558148]]
Target Vector
[0 1 1]

As might be apparent from the solutions, make_regression returns a feature matrix of float values and
a target vector of float values, while make_classification and make_blobs return a feature matrix of
float values and a target vector of integers representing membership in a class.
scikit-learn’s simulated datasets offer extensive options to control the type of data generated. scikit-
learn’s documentation contains a full description of all the parameters, but a few are worth noting.
In make_regression and make_classification, n_informative determines the number of features
that are used to generate the target vector. If n_informative is less than the total number of features
(n_features), the resulting dataset will have redundant features that can be identified through feature
selection techniques.
In addition, make_classification contains a weights parameter that allows us to simulate datasets
with imbalanced classes. For example, weights = [.25, .75] would return a dataset with 25% of
observations belonging to one class and 75% of observations belonging to a second class.
For make_blobs, the centers parameter determines the number of clusters generated. Using the
matplotlib visualization library, we can visualize the clusters generated by make_blobs:

# Load library
import matplotlib.pyplot as plt

# View scatterplot
plt.scatter(features[:,0], features[:,1], c=target)
See Also
make_regression documentation
make_classification documentation
make_blobs documentation

2.3 Loading a CSV File

You need to import a comma-separated values (CSV) file.

Use the pandas library’s read_csv to load a local or hosted CSV file into a Pandas DataFrame:

# Load library
import pandas as pd

# Create URL
url = ''

# Load dataset
dataframe = pd.read_csv(url)

# View first two rows


integer datetime category

0 5 2015-01-01 00:00:00 0
1 5 2015-01-01 00:00:01 0

There are two things to note about loading CSV files. First, it is often useful to take a quick look at the
contents of the file before loading. It can be very helpful to see how a dataset is structured beforehand
and what parameters we need to set to load in the file. Second, read_csv has over 30 parameters and
therefore the documentation can be daunting. Fortunately, those parameters are mostly there to allow it
to handle a wide variety of CSV formats.
CSV files get their names from the fact that the values are literally separated by commas (e.g., one row
might be 2,"2015-01-01 00:00:00",0); however, it is common for “CSV” files to use other (termed
“TSVs”). pandas’ sep parameter allows us to define the delimiter used in the file. Although it is not
always the case, a common formatting issue with CSV files is that the first line of the file is used to
define column headers (e.g., integer, datetime, category in our solution). The header parameter
allows us to specify if or where a header row exists. If a header row does not exist, we set
The read_csv function returns a Pandas DataFrame: a common and useful object for working with
tabular data that we’ll cover in more depth throughout this book.

2.4 Loading an Excel File

You need to import an Excel spreadsheet.

Use the pandas library’s read_excel to load an Excel spreadsheet:

# Load library
import pandas as pd

# Create URL
url = ''

# Load data
dataframe = pd.read_excel(url, sheet_name=0, header=1)

# View the first two rows


5 2015-01-01 00:00:00 0
0 5 2015-01-01 00:00:01 0
1 9 2015-01-01 00:00:02 0

This solution is similar to our solution for reading CSV files. The main difference is the additional
parameter, sheetname, that specifies which sheet in the Excel file we wish to load. sheetname can
accept both strings containing the name of the sheet and integers pointing to sheet positions (zero-
indexed). If we need to load multiple sheets, include them as a list. For example, sheetname=[0,1,2,
"Monthly Sales"] will return a dictionary of pandas DataFrames containing the first, second, and third
sheets and the sheet named Monthly Sales.

2.5 Loading a JSON File

You need to load a JSON file for data preprocessing.

The pandas library provides read_json to convert a JSON file into a pandas object:

# Load library
import pandas as pd

# Create URL
url = ''
# Load data
dataframe = pd.read_json(url, orient='columns')

# View the first two rows


category datetime integer

0 0 2015-01-01 00:00:00 5
1 0 2015-01-01 00:00:01 5

Importing JSON files into pandas is similar to the last few recipes we have seen. The key difference is
the orient parameter, which indicates to pandas how the JSON file is structured. However, it might
take some experimenting to figure out which argument (split, records, index, columns, and values)
is the right one. Another helpful tool pandas offers is json_normalize, which can help convert
semistructured JSON data into a pandas DataFrame.

See Also
json_normalize documentation

2.6 Loading a parquet file

You need to load a parquet file.

The pandas read_parquet function allows us to read in parquet files:

# Load library
import pandas as pd

# Create URL
url = ''

# Load data
dataframe = pd.read_parquet(url)

# View the first two rows


category datetime integer

0 0 2015-01-01 00:00:00 5
1 0 2015-01-01 00:00:01 5

Paruqet is a popular data storage format in the large data space. It is often used with big data tools such
as hadoop and spark. While Pyspark is outside the focus of this book, it’s highly likely companies
operating a large scale will use an efficient data storage format such as parquet and it’s valuable to know
how to read it into a dataframe and manipulate it.

See Also
Apache Parquet Documentation

2.7 Loading a avro file

You need to load an avro file into a pandas dataframe.

The use the pandavro library’s read_avro method:

# Load library
import pandavro as pdx

# Create URL
url = ''

# Load data
dataframe = pdx.read_avro(url)

# View the first two rows


category datetime integer

0 0 2015-01-01 00:00:00 5
1 0 2015-01-01 00:00:01 5

Apache Avro is an open source, binary data format that relies on schemas for the data structure. At the
time of writing it is not as common as parquet. However, large binary data formats such as avro, thrift
and protocol buffers are growing in popularity due to the efficient nature of these formats. If you work
with large data systems, you’re likely to run into one of these formats (such as avro) in the near future.

See Also
Apache Avro Docs

2.8 Loading a TFRecord file

You need to load a TFRecord file into a pandas dataframe.


category datetime integer

0 0 2015-01-01 00:00:00 5
1 0 2015-01-01 00:00:01 5

Like avro, TFRecord is a binary data format (in this case it is based on protocol buffers) - however it is
specific to TensorFlow.

See Also
TFRecord Docs

2.9 Querying a SQLite Database

You need to load data from a database using the structured query language (SQL).

pandas’ read_sql_query allows us to make a SQL query to a database and load it:
# Load libraries
import pandas as pd
from sqlalchemy import create_engine

# Create a connection to the database

database_connection = create_engine('sqlite:///sample.db')

# Load data
dataframe = pd.read_sql_query('SELECT * FROM data', database_connection)
# View first two rows

first_name last_name age preTestScore postTestScore

0 Jason Miller 42 4 25
1 Molly Jacobson 52 24 94

SQL is the lingua franca for pulling data from databases. In this recipe we first use create_engine to
define a connection to a SQL database engine called SQLite. Next we use pandas’ read_sql_query to
query that database using SQL and put the results in a DataFrame.
SQL is a language in its own right and, while beyond the scope of this book, it is certainly worth
knowing for anyone wanting to learn machine learning. Our SQL query, SELECT * FROM data, asks the
database to give us all columns (*) from the table called data.
Note that this is one of a few recipes in this book that will not run without extra code. Specifically,
create_engine('sqlite:///sample.db') assumes that an SQLite database already exists.

See Also
W3Schools SQL Tutorial

2.10 Querying a Remote SQL Database

You need to connect to, and read from, a remote SQL database.

Create a connection with pymysql and read it into a dataframe with pandas:

# Import libraries
import pymysql
import pandas as pd

# Create a DB connection
# Use the example below to start a DB instance
conn = pymysql.connect(
password = "",

# Read the SQL query into a dataframe

dataframe = pd.read_sql("select * from data", conn)
# View the first 2 rows

integer datetime category

0 5 2015-01-01 00:00:00 0
1 5 2015-01-01 00:00:01 0

Out of all of the recipes presented in this chapter, this recipe is probably the one we will use most in the
real world. While connecting and reading from an example sqlite database is useful, it’s likely not
representative of tables you’ll need to connect to in the an enterprise environment. Most SQL instances
that you’ll connect to will require you to connect to the host and port of a remote machine, specifying a
username and password for authentication. This example requires you to start a running SQL instance
locally that mimics a remote server (the host is actually your localhost) so that you can get a
sense for the workflow.

See Also
Pymysql Documentation
Pandas Read SQL

2.11 Loading Data from a Google Sheet

You need to read data in directly from a Google Sheet.

Use Pandas read CSV and a URL that exports the Google Sheet as a CSV:

# Import libraries
import pandas as pd

# Google Sheet URL that downloads the sheet as a CSV

url = "

# Read the CSV into a dataframe

dataframe = pd.read_csv(url)

# View the first 2 rows


integer datetime category

0 5 2015-01-01 00:00:00 0
1 5 2015-01-01 00:00:01 0

While Google Sheets can also easily be downloaded, it’s sometimes helpful to be able to read them
directly into Python without any intermediate steps. The /export?format=csv query parameter at the
end of the URL above creates an endpoint from which we can either download the file or read it directly
into pandas.

See Also
Google Sheets API

2.12 Loading Data from an S3 Bucket

You need to read a CSV file from an S3 bucket you have access to.

Add storage options to pandas giving it access to the S3 object:

# Import libraries
import pandas as pd

# S3 path to csv
s3_uri = "s3://machine-learning-python-cookbook/data.csv"

# Set AWS credentails (replace with your own)

ACCESS_KEY_ID = "xxxxxxxxxxxxx"
SECRET_ACCESS_KEY = "xxxxxxxxxxxxxxxx"

# Read the csv into a dataframe

dataframe = pd.read_csv(s3_uri,storage_options={

# View first two rows


integer datetime category

0 5 2015-01-01 00:00:00 0
1 5 2015-01-01 00:00:01 0
In the modern day, many enterprises keep data in cloud provider blob stores such as Amazon S3 or
Google Cloud Storage (GCS). It’s common for machine learning practitioners to connect to these
sources in order to retrieve data. Although the S3 URI above (s3://machine-learning-python-
cookbook/data.csv) is public, it still requires you to provide your own AWS access credentials in
order to access it. It’s worth noting that public objects also have http urls from which they can download
files such as this one for the CSV file above.

See Also
Amazon S3
Setting up AWS access credentials

2.13 Loading Unstructured Data

You need to load in unstructured data like text or images.

Use the base Python open function to load the information:

# import libraries
import requests
# URL to download the txt file from
txt_url = ""

# Get the txt file

r = requests.get(txt_url)
# Write it to text.txt locally
with open('text.txt', 'wb') as f:

# Read in the file

with open('text.txt', 'r') as f:
text =
# Print the content

Hello there!

While structured data can easily be read in from CSV, JSON, or various databases, unstructured data
can be more challegning and may require custom processing down the line. Sometimes, it’s helpful to
open and read in files using Python’s basic open function. This allows us to open files, and then read the
content of that file.
See Also
Python’s open function
Context managers in Python
Chapter 3. Data Wrangling


With Early Release ebooks, you get books in their earliest form—the authors’ raw and unedited content as they write—so you can
take advantage of these technologies long before the official release of these titles.
This will be the 3rd chapter of the final book.
If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the authors at
within this chapter, please reach out to the authors at

3.0 Introduction
Data wrangling is a broad term used, often informally, to describe the process of transforming raw data
to a clean and organized format ready for use. For us, data wrangling is only one step in preprocessing
our data, but it is an important step.
The most common data structure used to “wrangle” data is the data frame, which can be both intuitive
and incredibly versatile. Data frames are tabular, meaning that they are based on rows and columns like
you would see in a spreadsheet. Here is a data frame created from data about passengers on the Titanic:

# Load library
import pandas as pd
# Create URL
url = ''

# Load data as a dataframe

dataframe = pd.read_csv(url)
# Show first 5 rows

Name PClass Age Sex Survived SexCode

0 Allen, Miss Elisabeth Walton 1st 29.00 female 1 1
1 Allison, Miss Helen Loraine 1st 2.00 female 0 1
2 Allison, Mr Hudson Joshua Creighton 1st 30.00 male 0 0
3 Allison, Mrs Hudson JC (Bessie Waldo Daniels) 1st 25.00 female 0 1
4 Allison, Master Hudson Trevor 1st 0.92 male 1 0

There are three important things to notice in this data frame.

First, in a data frame each row corresponds to one observation (e.g., a passenger) and each column
corresponds to one feature (gender, age, etc.). For example, by looking at the first observation we can
see that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, was female, and survived
the disaster.
Second, each column contains a name (e.g., Name, PClass, Age) and each row contains an index number
(e.g., 0 for the lucky Miss Elisabeth Walton Allen). We will use these to select and manipulate
observations and features.
Third, two columns, Sex and SexCode, contain the same information in different formats. In Sex, a
woman is indicated by the string female, while in SexCode, a woman is indicated by using the integer
1. We will want all our features to be unique, and therefore we will need to remove one of these
In this chapter, we will cover a wide variety of techniques to manipulate data frames using the pandas
library with the goal of creating a clean, well-structured set of observations for further preprocessing.

3.1 Creating a Data Frame

You want to create a new data frame.

pandas has many methods of creating a new DataFrame object. One easy method is to instantiate a
DataFrame using a Python dictionary. In the dictionary, each key is a column name and the value is a
list - where each item corresponds to a row:

# Load library
import pandas as pd

# Create a dictionary
dictionary = {
"Name": ['Jacky Jackson', 'Steven Stevenson'],
"Age": [38, 25],
"Driver": [True, False]

# Create DataFrame
dataframe = pd.DataFrame(dictionary)

# Show DataFrame

Name Age Driver

0 Jacky Jackson 38 True
1 Steven Stevenson 25 False

It’s easy to add new columns to any dataframe using a list of values:

# Add a column for eye color

dataframe["Eyes"] = ["Brown", "Blue"]

# Show DataFrame

Name Age Driver Eyes

0 Jacky Jackson 38 True Brown
1 Steven Stevenson 25 False Blue

pandas offers what can feel like an infinite number of ways to create a DataFrame. In the real world,
creating an empty DataFrame and then populating it will almost never happen. Instead, our DataFrames
will be created from real data we have loading from other sources (e.g., a CSV file or database).

3.2 Getting Information about the Data

You want to view some characteristics of a DataFrame.

One of the easiest things we can do after loading the data is view the first few rows using head:

# Load library
import pandas as pd

# Create URL
url = ''
# Load data
dataframe = pd.read_csv(url)

# Show two rows


Name PClass Age Sex Survived SexCode

0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1
1 Allison, Miss Helen Loraine 1st 2.0 female 0 1

We can also take a look at the number of rows and columns:

# Show dimensions

(1313, 6)

We can get descriptive statistics for any numeric columns using describe:

# Show statistics

Age Survived SexCode

count 756.000000 1313.000000 1313.000000
mean 30.397989 0.342727 0.351866
std 14.259049 0.474802 0.477734
min 0.170000 0.000000 0.000000
25% 21.000000 0.000000 0.000000
50% 28.000000 0.000000 0.000000
75% 39.000000 1.000000 1.000000
max 71.000000 1.000000 1.000000

Additionally, the `info` method can also show some helpful information:
# Show info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 1313 non-null object
1 PClass 1313 non-null object
2 Age 756 non-null float64
3 Sex 1313 non-null object
4 Survived 1313 non-null int64
5 SexCode 1313 non-null int64
dtypes: float64(1), int64(2), object(3)
memory usage: 61.7+ KB

After we load some data, it is a good idea to understand how it is structured and what kind of
information it contains. Ideally, we would view the full data directly. But with most real-world cases,
the data could have thousands to hundreds of thousands to millions of rows and columns. Instead, we
have to rely on pulling samples to view small slices and calculating summary statistics of the data.
In our solution, we are using a toy dataset of the passengers of the Titanic on her last voyage. Using
head we can take a look at the first few rows (five by default) of the data. Alternatively, we can use
tail to view the last few rows. With shape we can see how many rows and columns our DataFrame
contains. And finally, with describe we can see some basic descriptive statistics for any numerical
It is worth noting that summary statistics do not always tell the full story. For example, pandas treats the
columns Survived and SexCode as numeric columns because they contain 1s and 0s. However, in this
case the numerical values represent categories. For example, if Survived equals 1, it indicates that the
passenger survived the disaster. For this reason, some of the summary statistics provided don’t make
sense, such as the standard deviation of the SexCode column (an indicator of the passenger’s gender).

3.3 Slicing DataFrames

