Nothing Special   »   [go: up one dir, main page]

Course3 Notes

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 44

-To make a histogram graph, we can use three type of codes; df_can[‘2013’].plot.

hist(),
some_data.plot(kind=’type_plot’,…), and some_data.plot.type_plot(…). The (…) is the frequency.

-We can also plot multiple histograms on the same plot. For example,

df_can.loc[[‘Denmark’, ’Norway’, ’Sweden’], years]

-To generate the histogram

df_can.loc[[‘Denmark’, ‘Norway’, ‘Sweden’], years].plot.hist()

The results of this code doesn’t seem look right, therefore it must be transposed as,

df_t = df_can.log[[‘Denmark’, ‘Norway’, ‘Sweden’], years}.transpose()

df_t.head()

-To generate the histogram


df_t.plot(kind=’hist’, figsize=(10,6))

plt.title(‘Histogram of Immigration from Denmark, Norway, and Sweden’)

plt.ylabel(‘Number of Years’)

plt.xlabel(‘Number of Immigrants’)

plt.show()

-Now, let’s make some modifcations to improve the visualization, like increasing the bin size to 15 in
bins parameter, set transparency to 60% in alpha parameter, label the x-axis by passing in x-label
parameter, and change the colors of the plots by passing in color parameter.

count, bin_edges = np.histogram(df_t, 15)

df_t.plot(kind=’hist’, figsize=(10,6), bins=15, xsticks=bid_edges, color=[‘coral’,


‘darkslateblue’,’mediumseagreen’])
plt.title(‘……………..’)
plt.ylabel(‘………….’)
plt.xlabel(‘…………..’)

plt.show()

Tip: for a full listing of colors available in Matplotlib, run the following code in the python shell:
Import matplotlib
for name, hex in matplotlib.colors.cnames.items()
print(name, hex)

-We also can stack them using the stacked parameter and adjust the min and max x-axis by using a
tuple with xlim parameter,

count, bin_edges = np.histogram(df_t,15)


xmin = bin_edges[0]-10 #first bin value is 31.0, substracting of 10 for asthetic purposes
xmax = bin_edges[-1]+10 #last bin value is 308.0, adding buffer of 10 for asthetic purposes

#stacked histogram
df_t.plot(kind=’hist’,
figsize=(10,6),
bins=15,
xticsks=bin_edges,
color=[‘coral’,’darkslateblue’,’mediumseagreen’]
stacked=True,
xlim=(xmin, xmax))

Plt.title(‘…….’)
plt.ylabel(‘…….’)
plt.xlabe;(‘…….’)

plt.show()

2.4 Bar Charts (Dataframe)

To create a bar plot, we can pass one of two arguments via kind parameter in plot(): where kind=bar
creates a vertical bar plot, and kind=barh creates a horizontal bar plot.

-get the data


df_iceland = df_can.loc[‘Iceland’, years]
df_iceland.head()

# plot data
df_iceland.plot(kind=’bar’, figsize=(10,6))
plt.xlabel(‘Year’)
plt.ylabel(‘Number of Immigrants’)
plt.title(‘Icelandic immigrant to Canada from 1980 to 2013’)

plt.show()

To annotate this on the plot using the annotate method of the scripting layer or the pyplot
interface. We will pass in the following parameters:

s: str, the text of annotation


xy: Tuple specifying the (x,y) point to annotate (in this case, end point of arrow)
xytext: Tuple specifying the (x,y) point to place the text (in this case, start point of arrow)
xycoords: The coordinate system that xy is given in – ‘data’ uses the coordinate system of the object
being annotated (default).
arrowprops: Takes a dictionary of properties to draw the arrow:
arrowstyle: specifies the arrow style, ‘->’ is standard arrow.
connectionstyle: specifies the connection type. arc3 is a straight line.
color: specifies the color of arrow
lw: specifies the line width.
additional parameters:
rotation: rotation angle of text in degrees (counter clockwise)
va: vertical alignment of text [‘center’|’top’|’bottom’|’baseline’]
ha: horizontal alignment of text [‘center’|’right’|’left’]

df_iceland.plot(kind='bar', figsize=(10, 6), rot=90) plt.xlabel('Year')


plt.ylabel('Number of Immigrants')

plt.title('Icelandic Immigrants to Canada from 1980 to 2013')

# Annotate arrow

plt.annotate('', # s: str. will leave it blank for no text

xy=(32, 70), # place head of the arrow at point (year 2012 , pop 70)

xytext=(28, 20), # place base of the arrow at point (year 2008 , pop 20)

xycoords='data', # will use the coordinate system of the object being annotated

arrowprops=dict(arrowstyle='->', connectionstyle='arc3', color='blue', lw=2)

# Annotate Text

plt.annotate('2008 - 2011 Financial Crisis', # text to display

xy=(28, 30), # start the text at at point (year 2008 , pop 30)

rotation=72.5, # based on trial and error to match the arrow

va='bottom', # want the text to be vertically 'bottom' aligned

ha='left', # want the text to be horizontally 'left' algned.

plt.show()
Another example;

#Get the data pertanining to the top 15 countries

df_can.sort_values(['Total'], ascending=False, inplace=True)

df_top15=df_can['Total'].head(15)

df_top15

#Plot data using horizontal bar chart

df_top15.plot(kind='barh', figsize=(10,6))

plt.xlabel('Number of Immigrants')

plt.ylabel('Country')

plt.title('LALALALA')

for index, value in enumerate(df_top15):

Label = format(int(value), ',')

plt.annotate(Label, xy=(value-47000, index-0.10), color='white')

plt.show()

Pie charts, box plots, scatter plots, and bubble plots


Pie Chart

Uses kind=pie keyword

Let’s use a pie chart to explore the proportion (percentage) of new immigrants grouped by
continents for the entire time period from 1980 to 2013.

Step 1: Gather data


We will use pandas groupby method to summarize the immigration data by continent. The general
process of groupby involves the following steps:
1.Split: Splitting the data into groups based on some criteria.
2.Apply: Applying a function to each group independently:
.sum()
.count()
.mean()
.std()
.aggregate()
.apply()
.etc..
3.Combine: Combining the results into a data structure.

Example:

# group countries by continents and apply sum() function


df_continents=df_can.groupby(‘Continent’, axis=0).sum()

#note: the output of the groupby method is a ‘groupby’ object.


#we can not use it further until we apply a function (eg. Sum())
print(type(df_can.groupby(‘Continent’,axis=0)))

df_continents.head()
Step 2: Plot the data. We will pass in kind = ‘pie’ keyword, along with the following additional
parameters:
autopct-is a string or function used to label the wedges (slices) with their numeric value. The
label will be placed inside the wedge. If it is a format string, the label will be fmt%pct.
startangle-rotates the start of the pie chart by angle degrees counerclockwise from the x-
axis.
shadow-Draws a shadow beneath the pie (to give a 3D feel)

df_continents[‘Total’].plot(kind=’pie’, figsize=(5,6), autopct=’%1.1%%’, startangle=90, shadow=True)

#maksud 1.1 yg dilabel merah, itu mksdnya 1 dibelakang koma. Kalo 1.2, itu 2 dibelakang koma, 1.3
itu 3 dibelakang koma, and so on.

plt.title(‘Immigration to Canada by Continent [1980-2013]’)


plt.axis(‘equal’) #sets the pie chart to look like a circle

plt.show()

The above visual is not very clear, the numbers and text overlap in some instances. Let’s make a few
modifications to improve the visuals:
legend or plt.legend is to put legends anywhere we want
pctdistancec is to apply distances of the percentages
colors is to change every wedge’s color
explode is to emphasize the wedges (in this case, the lowest three continents which are
Africa, North America, and Latin America and Carribean)

colors_list = [‘gold’, ‘yellowgreen’, ‘lightcoral’, ‘lightskyblue’, ‘lightgreen’,’pink’]


explode_list=[0.1,0,0,0,0.1,0.1] #ratio for each continent with which to offset each wedge
df_continents[‘Total’].plot(kind=’pie’, figsize=(15,6), autopct=’%1.1f%%’, startangle=90, sadow=True
labels=None, pctdistance=1.12, colors=colors_list, explode=explode_list)

plt.title(‘Immigration to Canada by Continent [1980-2013]’, y=1.12) #red is to give distance in y-axis


plt.axis(‘equal’)
plt.legend(labels=df_continents.index, loc=’upper left’)

plt.show()

To make only the pie chart only in 2013, type,

df_continents[‘2013’].plot(kind=’pie’,……..) #the rest is the same as the previous one

Box Plots

A box plot is a way of statistically representing the distribution of the data through five main
dimensions:
minimum: smallest number in the dataset
first quartile: middle number between the minimum and the median
second quartile (median): middle number of the (sorted) dataset
third quartile: middle number between median and maximum
maximum: highest number in the dataset
To make box plot, it uses kind=box in plot.

Step1: Get the dataset.

#to get a dataframe, place extra square brackets around ‘Japan’.

df_japan = df_can.loc[[‘Japan’], years]. transpose()


df_japan.head()

Step 2: plot by passing in kind=’box’

df_japan.plot(kind=’box’, figsize=(8,6))

plt.title(‘Box plot of Japanese Immigrants from 1980-2013’)


plt.ylabel(‘Number of Immigrants’)

plt.show()

We can immediately make a few key observations from the plot above:
1. The minimum number of immigrants s around 200 (min), max around 1300m and median around
900.
2. 25% of the years for period 1980-2013 had an annual immigrant count of ~500 or fewer (first
quartile).
3. 75% of the years for period 1980-2013 had an annual immigrant count of ~1100 or fewer (third
quartile)

To make sure, we can use describe()


df_japan.describe()

to make a horizontal box plots, we can use vert parameter in the plot function and assign it to False.
For examples, if the distribution of both China and India are analysed using dataframe of df_CI, the
code are as follows:

df_CI.plot(kind=’box’, figsize=(10,7), color=’blue’, vert=False)

plt.title(‘Box plots of Immigrants from China and India (1980-2013)’)


plt.xlabel(‘Number of Immigrants’)

plt.show

Subplots

Often times we might want to plot multiple plots and put them in the same figure.

To visualize multiple plots together, we can create a figure (overall canvas) and divide it into
subplots, each containing a plot. With subplots, we usually work with the artist layer instead of the
scripting layer.

Typical syntax:

fig = plt.figure() #create figure


ax = fig.add_subplot(nrows, ncols, plot_numver) # create subplots
Where:
nrows and ncols are used to notionally split the figure into (nrows*ncols) sub-axes,
plot_number is used to identify the particular subplot (first, second, third, and so on)

Example:

Fig=plt.figure() #create figure

ax0=fig.add_subplot(1,2,1) # add subplot 1 (1 row, 2 columns, first plot)


ax1=fig.add_subplot(1,2,2) # add subplot 2 (1 row, 2 columns, second plot)

# Subplot 1: Box plot


df_CI.plot(kind='box', color='blue', vert=False, figsize=(20, 6), ax=ax0) # add to subplot 1
ax0.set_title('Box Plots of Immigrants from China and India (1980 - 2013)')
ax0.set_xlabel('Number of Immigrants')
ax0.set_ylabel('Countries')

# Subplot 2: Line plot


df_CI.plot(kind='line', figsize=(20, 6), ax=ax1) # add to subplot 2
ax1.set_title ('Line Plots of Immigrants from China and India (1980 - 2013)')
ax1.set_ylabel('Number of Immigrants')
ax1.set_xlabel('Years')

plt.show()
Additional info: subplot(211) == subplot(2,1,1)

Scatter Plot
Step1: Get dataset

#we can use the sum() method to get the total population per year
df_tot=pd.DataFrame(df_can[years].sum(axis=0))

#change the years to type int (useful regression later on)


df_top.index=map(int, df_tot.index)

#reset the index to put in back in as a column in the df_tot dataframe


df_tot.reset_index(inplace=True)

#rename columns
df_tot.columns=[‘year’,’tota’]

#view the final dataframe


df_tot.head()

Step 2: Plot the data. In matplotlib, scatter plot is created by kind=’scatter’ along with specifying the
x and y (not automated)

df_tot.plot(kind=’scatter’, x=’year’, y=’total’, figsize=(10,6), color=’darkblue’)

plt.title(‘Total Immigration to Canada from 1980-2013’)


plt.xlabel(‘Year’)
plt.ylabel(‘Number of Immigrants’)

plt.show()
Now, let’s try to plot a linear line of best fit, and use it to predict number of immigrants in 2015.

Step 1: Get the equation of line of best fit. We will use Numpy’s polyfit() method by passing in the
following
x = x-coordinates of the data
y = y-coordinates of the data
deg = Degree of fitting polynomial. 1=Linear, 2=quadratic, and so on.

x=df_tot[‘year’]
y=df_tot[‘total’]
fit=np.polyfit(x,y,deg=1)
fit

In this case the slop is 5.56+03 with position in 0, and the intercept is -1.0926+07 with position in 1.

Step2: Plot the regression line on the scatter plot.

df_tot.plot(kind=’scatter’, x=’year’, y=’total’, figsize=(10,6), color=’darkblue’)

plt.title(‘Total Immigration to Canada from 1980-2013’)


plt.xlabel(‘Year’)
plt.ylabel(‘Number of Immigrants’)

plt.plot(x, fit[0]*x + fit[1], color=’red’) #recall that x is the years


plt.annotate(‘y={0:0.f} x + {1:.0f}’.format(fit[0], fit[1]), xy=(2000, 150000))

plt.show()

#print out the line of best fit


‘No. Immigrants = {0:.0f}*Year + {1:.0f}’.format(fit[0], fit[1])

Now, we can predict the no. immigrants in 2015. To predict,


No. Immigrants = 5567*Year – 10926195
No. Immigrants = 5567*2015 – 10926195
No. Immigrants = 291,310
Another example, create a scatter plot f the total immigration from Denmark, Norway, and Sweden
to Canada from 1980 to 2013.

#create df_countries dataframe


df_countries=df_can.loc[['Denmark','Norway','Sweden'],years].transpose()

#create df_total by summing across three countries for each year


df_total=pd.DataFrame(df_countries.sum(axis=1))

#reset index in place


df_total.reset_index(inplace=True)

#rename columns
df_total.columns=['year','total']

#change column year from string to int to create scatter plot


df_total['year']=df_total['year'].astype(int)

#show resulting dataframe


df_total.head()

#plot the scatter plot

df_total.plot(kind=’scatter’, x=’year’, y=’total’, figsize=(10,6), color=’darkblue’)

plt.title(‘Immigration from Denmark, Norway, and Sweden to Canada from 1980-2013')


plt.xlabel('Year')
plt.ylabel('Number of Immigrants')
plt.show()

Bubble Plot
Anayzing argentina’s great depression and compare it with Brazil

Step 1: Get data for Brazil and Argentina. Like in the previous example, we will convert the Years to
type int and bring it in the dataframe.

df_can_t = df_can[years].transpose() # transposed dataframe


# cast the Years (the index) to type int
df_can_t.index = map(int, df_can_t.index)

# let's label the index. This will automatically be the column name when we reset the index
df_can_t.index.name = 'Year'

# reset index to bring the Year in as a column


df_can_t.reset_index(inplace=True)

# view the changes


df_can_t.head()

Step 2: Create the normalized weights.

There are several methods of normalizations in statistics, each with its own use. In this case, we will
use feature scalling to bring all values into the range [0,1]. The general formula is:

Therefore:

# normalize Brazil data


norm_brazil = (df_can_t['Brazil'] - df_can_t['Brazil'].min()) / (df_can_t['Brazil'].max() -
df_can_t['Brazil'].min())

# normalize Argentina data


norm_argentina = (df_can_t['Argentina'] - df_can_t['Argentina'].min()) / (df_can_t['Argentina'].max()
- df_can_t['Argentina'].min())

Step 3: Plot the data.


-To plot two different scatter plots in one plot, we can include the axes one plot into the other by
passing it via ax parameter.
-We will also pass in the weights using the s parameter. Given that the normalized weights are
between 0-1, they won’t be visible, therefore:
-multiply weights by 2000 to scale it up on the graph, and,
-add 10 to compensate for the min value (which has a 0 weight and therefore scale with
x2000).

#Brazil
ax0=df_can_t.plot(kind=’scatter’, x=’year’, y=’Brazil’, figsize=(14,8), alpha=0.5, color=’green’,
s=norm_brazil*2000+10, xlim=(1975,2015))

#Argentina
ax1=df_can_t.plot(kind=’scatter’, x=’Year’, y=’Argentina’, alpha=0.5, color=”blue”,
s=norm_argentina*2000+10, ax = ax0)

ax0.set_ylabel(‘Number of Immigrants’)
ax0.set_title(‘Immigration from Brazil and Argentina from 1980-2013’)
ax0.legend([‘Brazil’,’Argentina’], loc=’upper left’, fontsize=’x-large’)

WEEK 3-Advanced Visualizations and Geospatial Data


WAFFLE CHARTS, WORD CLOUDS, and REGRESSION PLOTS

These codes follow after the data pandas and numpys import, and the data preprocessing.

-Import Matplotlib

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matpltlib.patches as mpatches #needed for waffle Charts

mpl.style.use(‘ggplot’) #optional: for ggplot-like style

#check for latest version of Matplotlib


print(‘Matplotlib version:’,mpl.__version__) # >= 2.0.0
WAFFLE CHART

-revisit the previous case study about Denmark, Norway, and Sweden

#create a new dataframe for these three countries


df_dsn = df_can.loc[[‘Denmark’,’Norway’,’Sweden’], :]

#let’s take a look at our dataframe


df_dsn

-Unfortunately, unlike R, waffle charts are not built into any of he Python visualization libraries.
Therefore, we will learn how to create them from scratch.

Step 1. The first step into creating a waffle chart is determining the proportion of each category with
respect to the total.

#compute the proportion of each category with respect to the total


total_values=sum(df_dsn[‘Total’])
category_proportions=[(float(value)/total_values) for value in df_dsn[‘Total’]]

#print out proportions


for i, proportion in enumerate(category_proportions):
print(df_dsn.index.values[i] + ‘: ‘ + str(proportion))

Step 2. The second step is defining the overall size of the waffle chart

width = 40 #width of chart


height = 10 # height of chart

total_num_titles=width*height #total number of tiles

print(‘Total number of tiles is’, total_num_tiles)

Step 3. The third step is using the proportion of each category to determine it respective number of
tiles

# compute the number of tiles for each category


tiles_per_category = [round(proportion*total_num_tiles) for proportion in category_proportions]
# print out number of tiles per category
for i, tiles in enumerate(tiles_per_category):
print(df_dsn.index.values[i] + ‘: ‘ + str(tiles))

Based on the calculated proportions, Denmark will occupy 129 tiles of the waffle chart, Norway will
occupy 77 tiles, and Sweden will occupy 194 tiles.

Step 4. The fourth step is creating a matrix that resembles the waffle chart and populating it.

#initialize the waffle chart as an empty matrix


waffle_chart = np.zeros((height, width))

#define indices to Loop through waffle chart


category_index=0
title_index=0

#populate the waffle chart


for col in range(width):
for row in range(height):
title_index += 1

#if the number of tiles populated for the current category is equal to its
corresponding allocated tiles…
if tile_index > sum(tiles_per_category[0:category_index]):
#...proceed to the next category
category_index +=1

# set the class value to an integer, which increases with class


waffle_chart[row, col] = category_index

Print (‘Waffle chart populated!’)

waffle_chart #to see the matrix looks like


Step 5. Map the waffle chart matrix into a visual

#instantiate a new figure object


fig = plt.figure()

#use matshow to display the waffle chart


colormap=plt.cm.coolwarm
plt.matshow(waffle_chart, cmap=colormap)
plt.colorbar()

Step 6. Prettify the chart.

#instantiate a new figure object


fig = plt.figure()

#use matshow to display the waffle chart


colormap=plt.cm.coolwarm
plt.matshow(waffle_chart, cmap=colormap)
plt.colorbar()

#get the axis


ax = plt.gca()

#set minor tricks


ax.set_xticks(np.arrange(-.5, (width), 1), minor=True)
ax.set_yticks(np.arrange(-.5, (height), 1), minor=True)

#add gridlines based on minor ticks


ax.grid(which=’minor’, color=’w’, linestyle=’-‘, linewidth=2)

plt.yticks([])
plt.yticks([])
Step 7 Create a legend and add it to chart

Now it would very


inefficient to repeat these seven steps every time we wish to create a waffle chart. So let’s combine
all seven steps into one function called create_waffle_chart. This function called
create_waffle_chart. This function would take the following parameters as input:
Word Clouds

A Python package already exists in Python for generating word clouds. The package, called
word_cloud was developed by Andreas Mueller.

-First, let’s install the package.

#install wordcloud
!conda install -c conda -forge wordlcloud==1.4.1 –yes

#import package and its set of stopwords


from wordcloud import WordCloud, STOPWORDS

print(‘Wordcloud is installed and imported!’)

Word clouds are commonly used to perform high-level analysis and visualization of text data. Now,
let’s digress the immigration to Canada data and work analysing a short novel written by Lewis
Caroll titled Alice’s Adventures in Wonderland.

-First, download a .txt file of the novel.

#download file and save as alice_novel.txt


!we –quiet https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/...... .txt

#open the file and read it into a variable alice_novel


alice_novel=open(‘alice_novel_txt’, ‘r’).read()

print(‘File downloaded and saved!)

-next, let’s use the stopwords that we imported from word_cloud. We use the function set to
remove any redundant stopwords.

stopwords = set(STOPWORDS)
Create a world cloud object and generate a world cloud. For simplicity, let’s generate a world cloud
using only the first 2000 words in the novel.

#instantiate a word cloud object


alice_wc = WordCloud(
background_color=’white’,
max_words=2000,
stopwords=stopwords)

#generate the word cloud


alice_wc.generate(alice_novel)

#display the word cloud


plt.imshow(alice_wc, interpolation=’bilinear’)
plt.axis(‘off)
plt.show()

The bigger the words, assumingly the more common words within those 2000 words. Now, resize
the cloud so that we can see the less frequent words a little better.

fig=plt.figure()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height

#display the cloud


plt.imshow(alice_wc, interpolation=’billinear’)
plt.axis(‘off’)
plt.show()

said isn’t really an informative word. So let’s add it to our stopwords and re-generate the cloud

stopwords.add(‘said’) #add the words said to stopwords


#re-generate the word cloud
alice_wc.generate(alice_novel)

#display the cloud


fig=plt.figure()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height

plt.imshow(alice_wc, interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()

word_cloud also provide the package to superimpose the words onto a mask of any shape. For
example, using a mask of Alice and her rabbit.

# download image in png


!wget --quiet https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DV0101EN/labs/Images/alice_mask.png

# save mask to alice_mask


alice_mask = np.array(Image.open(‘alice_mask.png’)

#show the png


fig=plt.figure()
fig=set_figwidth(14)
fig=set_figheight(18)

plt.imshow(alice_mask, cmap=plt.cm.gray, interpolation=’bilinear’)


plt.axis(‘off’)
plt.show()
Shaping the word cloud

#instantiate a word cloud object


alice_wc = WordClound(background_color=’white’, max_words=2000, mask=alice_mask,
stopwords=stopwords)

#generate the word cloud


alice_wc.generate(alice_novel)

#display the word cloud


fig=plt.figure()
fig.set_figwidth(14) #set width
fig.set_figheight(18) #set height

plt.imshow(alice_wc, interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()

Regression Plot

-Install seaborn

#install seaborn
!conda install -c anaconda seaborn –yes
#import library
import seaborn as sns

print(‘Seaborn installed and imported!’)

-Create a new dataframe that stores that total number of landed immigrants to Canada per year
from 1980 to 2013.

#using the sum() method to get the total population per year
df_tot = pd.DataFrame(df_can[years].sum(axis=0))

#change the years to type float(useful for regression later on)


df_tot.index = map(float, df_tot.index)

#reset the index to put in back in as a column in the df_tot dataframe


df_tot.reset_index(inplace=True)

#rename columns
df_tot.columns = [‘year’, ‘total’]

#view the final dataframe


df_tot.head()

-generating the regression plot

Import seaborn as sns


ax = sns.regpot(x=’year’, y=’total’, data=df_tot)

-customize color

Import seaborn as sns


ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)
-customize the marker shape, so instead of circular markers, let’s use ‘+’.

Import seaborn as sns


ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)

-blow up the plot a little bit so that it is more appealing to the sight

plt.figure(figsize=(15,10))
ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’)
-increase the size of markers so they match the new size of the figure, and add a title and x- and y-
labels.

plt.figure(figsize=(15,10))
ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})

ax.set(xlabel=’year’, ylabel=’Total Immigration’) #add x- and y-labels


ax.set_title(‘Total Immigration to Canada from 1980-2013’) #add title

-increase the font size of the tickmark labels, the title, and the x- and y-labels.

plt.figure(figsize=15,10))

sns.set(font_scale=1.5)

ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})


ax.set(xlabel=’Year’, ylabel=’Total Immigration’)
ax.set_title(‘Total Immigration to Canada from 1980-2013’)
-change the background to a white plain background

plt.figure(figsize=(15,10))

sns.set(font_scale=1.5)
sns.set_style(‘ticks’) # change background to white background

ax=sns.regplot(x=’year’, y=’total’, data=df_total, color=’green’, marker=’+’, scatter_kws={‘s’:200})


ax.set(xlabel=’Year’, ylabel=’Total Immigration’)
ax.set_title(‘Total Immigration to Canada fom 1980-2013’)

-or to a white background with gridlines.

plt.figure(figsize=(15,10))

sns.set(font_scale=1.5)
sns.set_style(‘whitegrid’)

ax=sns.regplot(x=’year’, y=’total’, data=df_tot, color=’green’, marker=’+’, scatter_kws={‘s’:200})


ax.set(xlabel=’Year’, ylabel=’Total Immigration’)
ax.set_title(‘Total Immigration to Canada from 1980-2013’)
Another example, using seaborn to create a scatter plot with a regression line to visualize the total
immigration from Denmark, Sweden and Norway to Canada from 1980 to 2013.
WEEK 3-2 GENERATING MAPS WITH PYTHON
In this session, we wil learn how to creawte maps for different objectives. We will use python
visualization library, namely Folium instead of Matplotlib.Folium was developed for the sole purpose
of visualizing geospatial data. Other libraries are available to visualize data such as plotly, but they
might have a cap on how many API calls you can make within a defined time frame. Folium, on the
other hand, is completely free.

Two datasets used:


-San Francisco Police Department Incidents for the year 2016
-Immigration to Canada from 1980 to 2013

-Install Folium

!conda install -c conda -forge folium=0.5.0 –yes


import folium

print(‘Folium installed and imported!’) #if it’s printed, meaning that the folium is successfully
installed.

-Generating the world map is straightforward in Folium. You simply create Folium Map object and
then you display it. What is attractive about Folium maps is that they are interactive, so you can
zoom into any region of interest despite the initial zoom level.

#define the world map


world_map = folium.Map()

#display world map


world_map

All locations on a map are defined by their respective latitude and longitude values. So you can
create a map and pass in a center of Latitude and Longitude values of [0,0].

For a defined center, you can also define the intial zoom level into that location when the map is
rendered. The higher he zoom level the more the map is zoomed into the center.

#define the world map centered around Canada with a low zoom level
world_map=folium.Map(location=[56.130, -106.35], zoom_start=4)

#display world map


world_map
-let’s create the map again with a higher zoom level

# define the world map centered around Canada with a higher zoom level
world_map=folium.Map(location=[56.130, -106.35], zoom_start=8) #blue is latitude, and red is
longitude.

#display world map


world_map

A. Stamen Toner Maps


There are high-contrast B+W (black and white) maps. They are perfect for data mashups and
exploring river meanders and coastal zones.

#create a Stamen Toner map of the world centered around Canada


world_map = folium.Map(location=[56.130, -106.35], zoom_start=4, tiles=’Stamen Toner’)

#display map
world_map

B. Stamen Terrain Maps

These are maps that feature hill shading and natural vegetation colors. They showcase advanced
labelling and linework generalization of dual-carriageway roads.

#create a Stamen Toner map of the world centered around Canada


world_map = folium.Map(location=[56.130, -106.35], zoom_start=4, tiles=’Stamen Terrain’)

#display map
world_map
C. Mapbox Bright Maps

These are maps that quite similar to the default style, except that the borders are not visible with a
low zoom level.

#create a world map with a Mapbox Bright style.


world_map=folium.Map(tiles=’Mapbox Bright’)

#display the map


world_map
MAPS WITH MARKERS

Let’s download and import the data on police incidents using pands read_csv() method

-download the dataset and read it into a pandas dataframe:

df_incidents = pd.read_csv(‘https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DV0101EN/labs/Data_Files/Police_Department_Incidents_-
_Previous_Year__2016_.csv')

print(‘Dataset downloaded and read into a pandas dataframe!’)

-take a look at the first five items in our dataset.

df_incidents.head()

Each row consists of 13 features:

df_incidents.shape
(150500,13)

-So the dataframe consist of 150,500 crimes, which took place in the year 2016. In order to reduce
computational cost, let’s just work with the first 100 incidents in this dataset.

#get the first 100 crimes in the df_incidents dataframe


limit=100
df_incidents=df_incidents.iloc[0:limit, :]

df_incidents.shape

(100,13)

Now that we reduce the data a little bit, let’s visualize where these crimes took place in the city of
San Fransisco. We will use the default style and we will initialize the zoom level to 12.

#San Fransisco latitude and longitude valus


latitude = 37.77
longitude = -122.42

#create map and display it


sanfran_map=folium.Map(location=[latitude, longitude], zoom_start=12)

#display the map of San Fransisco


sanfran_map

*additional= zip in python

The zip() function returns a zip object, which is an iterator of tuples where the first item in each
passed iterator is paired together, and then the second item in each passed iterator are paired
together etc.
Example:
a = (“John”, “Charles”, “Mike”)
b = (“Jenny”, “Christy, “Monica”)

x=zip(a,b)

print(tuple(x))

Results:

((‘John’, ‘Jenny’), (‘Charles’,’Christy’),(‘Mike’,’Monica’))

-Now let’s superimpose the locations of the crimes onto the map. The way to do that in Folium is to
create a feature group with its own features and style and then add it to the sanfran map.

#instantiate a feature group for the incidents in the dataframe


incidents=folium.map.FeatureGroup()

#Loop through the 100 crimes and add each to the incidents feature group
for lat, lng, in zip(df_incidents.Y, df_incidents.X):
incidents.add_child(
[lat, lng],
radius=5, #define how big you want the circle markers to be
color=’yellow’,
fill=True,
fill_color=’blue’,
fill_opacity=0.6))

# add incidents to map


sanfran_map.add_child(incidents)
-You can also add some pop-up text that would get displayed when you hover over a marker. Let’s
make each marker display the category of the crime when hovered over.

#instantiate a feature group for the incidents in the dataframe


incidents = folium.map.FeatureGroup()

#loop through the 100 crimes and add each to the incidents feature group
for lat, lng in zip(df_incidents.Y, df_incidents.X):
incidents.add_child(
folium.features.CircleMarker(
[lat, lng]
radius=5 #define how big you want the circle markers to be
color=’yellow’
fill=True
fill_color=’blue’,
fill_opacity=0.6))

#add pop-up text to each marker on the map


latitudes=list(df_incidents.Y)
longitudes=list(df_incidents.X)
labels=list(df_incidents.Category)

#to recap, the Y, X, and Category is the three columns in the dataframe

for lat, lng, label in zip (latitudes, longitudes, labels):


folium.Marker([lat, lng], popup=label).add_to(sanfran_map)

#add incidents to map


sanfran_map.add.child(incidents)
We may find the map to be so congested. Therefore, there are two remedies that can solve this
problem.

1. The simpler solution is to remove these locations markers and just add the text to the circle
markers themselves as follows:

#create map and display it


sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)

#loop through the 100 crimes and add each to the map
for lat, lng, label in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
folium.features.CircleMarker(
[lat, lng],
radius=5 #define how big you want the circle markers to be
color=’yellow’,
fill=True,
popup=label,
fill_color=’blue’,
fill_opacity=0.6).add_to(sanfran_map)

#show map
sanfran_map
2. The second way which is much proper is to group the markers into different clusters. Each cluster
is then represented by the number of crimes in each neighbourhood. These clusters can be thought
of as pockets of San Fransisco which you can then analyse separately.

To implement this, we start off by instantiating a MarkerCluster object and adding all the data points
in the dataframe t othis object

from folium import plugins

#starting again with a clean copy of the map of San Fransisco


sanfran_map=folium.Map(location=[latitude, longitude], zoom_start=12)

#instantiate a mark cluster object for the incidents in the dataframe


incidents=plugins.MarkerCluster().add_to(sanfran_map)

#loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df_incidents.Y, df_incidents.X, df_incidents.Category):
folium.Marker(
location=[lat,lng],
icon=None,
popup=label).add_to(incidents)

#display map
sanfran_map
When you zoom out all the way, all markers are groupd into one cluster.

Choropleth Maps

-Download the dataset and read it into a pandas dataframe: (n.p. if the xlrd is not installed, install it
first by typing a code, !conda install -c anaconda xlrd –yes

df_can=pd.read_excel(‘https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-
data/CognitiveClass/DV0101EN/labs/Data_Files/Canada.xlsx', sheet_name=’Canada by Citizenship’,
skiprows=range(20), skipfooter=2)

print(‘Data downloaded and read into a dataframe!’)

-take a look at the first five items in our dataset.


df_can.head()

-Clean up data/pre-processing the data

#clean up the dataset to remove unnecessary columns (eg. REG)


df_can.drop([‘AREA’,’REG’,’DEV’,’Type’,’Coverage’], axis=1, inplace=True)
#let’s rename the columns so that they make sense
df_can.rename(columns={‘OdName’:’Country’, ‘AreaName’:’Continent’,’RegName’:’Region’},
inplace=True)

#for sake of consistency, let’s also make all column labels of type string
df_can.columns=list(map(str, df_can.columns))

#add total column


df_can[‘Total’]=df_can.sum(axis=1)

#years that we will be using in this lesson-useful for plotting later on


years=list(map(str, range(1980, 2014)))
print(‘data dimensions:’, df_can.shape)

-take a look at the first five cleaned dataframe

df_can.head()

-In order to create a Choropleth map, we need a GeoJSON file that defines the areas/boundaries of
the state, country, or country that we are interested in. In our case, since we are endeavouring to
create a world map, we want a GeoJSON that defines the boundaries of all world countries. For our
convenience, the developer has provided us with a file, and able to be downloaded. Let’s name it
world_countries.json.

#download countries geojson file


!wget –quite https://s3-api.us-geo.objectstorage.softlwayer.net/cf-courses-data/CognitiveClass/
/DV0101EN/labs/Data_Files/world_countries.json -O world_countries.json

Print(‘GeoJSON file downloaded!’)

-Now that we have GeoJSON file, let’s create a world map, centered [0,0] latitude and longitude
values, with an initial zoom level of 2, and using Mapbox Brigth style.

world_geo=r’world_countries.json’ #gejson file

#create a plain world map


world_map=folium.Map(location=[0,0], zoom start=2, tiles=Mapbox Bright’)

-And now to create a Choropleth map, we will use the choropleth method with the following main
parameters:

#generate choropleth map using the total immigration of each country to Canada from 1980 to 2013

World_map.choropleth(
geo_data=world_gep
data=df_can
columns=[‘Country’,’Total’],
key_on=’feature.properties.name’,
fill_color=’Y10rRd’,
fill_opacity=0.7,
line_opacity=0.2,
legend_name=’Immigration to Canada’

#display map
world_map

Notice how the legend is displaying a negative boundary or threshold. Let’s fix that by defining our
own thresholds and starting with 0 instead of -6,918!

world_geo=r’world_countries.json’

#create a numpy array of length 6 and has linear spacing from the minimum total immigration to the
maximum total immigration
threshold_scale=np.linspace(df_can[‘Total’].min(), df_can[‘Total’].max(), 6, dtype=int)
threshold_scale=threshold_scale.tolist() change the numpy array to a list #make sure that the last
value of the list is greater than the maximum immigration

#Let Folium determine the scale.


world_map=folium.Map(location=[0,0], zoom_start=2, tiles=’Mapbox Bright’)
world_map.choropleth(
geo_data=world_geo,
data=df_can,
columns=[‘Country’, ‘Total’],
key_on=’featurete.properties.name’,
threshold_scale=threshold_scale,
fill_color=’Y10rRd’,
fill_opacity=0.7,
line_opacity=0.2,
legend_name=’Immigration to Canada’,
reset=True)

world_map

You might also like