Nothing Special   »   [go: up one dir, main page]

Fundamentals of Data Visualization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Fundamentals of Data

Visualization
Data Visualization through Microsoft Power BI
Marcelo Guerra Hahn
MARCELO HAHN

FUNDAMENTALS OF
DATA VISUALIZATION
DATA VISUALIZATION
THROUGH MICROSOFT
POWER BI

2
Fundamentals of Data Visualization: Data Visualization through Microsoft Power BI
© Marcelo Hahn 2024
ISBN 978-87-403-4909-2

3
FUNDAMENTALS OF DATA VISUALIZATION Contents

CONTENTS
About the Author 6

Preface 7

1 Visualization 8
1.1 Introduction 8
1.2 Why Do Data Visualizations Work? 9
1.3 What Are Data Visualizations Used For? 11
1.4 When not to Use Visualizations 11

2 Data Fields 12
2.1 Introduction 12
2.2 Basic Field Types 12

3 Field Transformations 15
3.1 Introduction 15
3.2 Splitting 15
3.3 Concatenation 15
3.4 Simple Math 16

4 Bar Charts 17
4.1 Introduction 17
4.2 Line Charts 23
4.3 Maps 28

5 Single Variable Statistics 30


5.1 Introduction 30
5.2 Frequency Tables 31
5.3 Histograms 32
5.4 Central Tendency and Variability 36
5.5 Percentile Ranks 38
5.6 Reference Lines 39
5.7 Aggregate Functions 40
5.8 Box Plot 42

6 Correlations 45
6.1 Introduction 45
6.2 Scatterplots 46
6.3 Trend Lines 48
6.4 Measures of Fit 51

4
FUNDAMENTALS OF DATA VISUALIZATION Contents

7 Time Series 54
7.1 Introduction 54
7.2 Formatting Dates 56
7.3 Using Cycle Plots to Depict Seasonality 57
7.4 Forecasting 60

8 Storytelling 62
8.1 Introduction 62
8.2 Aspects of A Good Story 63
8.3 Advantages of Storytelling 64
8.4 Creating A Story 64

Conclusion 68

Table of Figures 69

References 71

5
FUNDAMENTALS OF DATA VISUALIZATION About the Author

ABOUT THE AUTHOR


Marcelo Guerra Hahn currently leads the Bachelor of Science Department at Lake Washington
Institute of Technology. With over 19 years of experience in data analysis and software
development, Marcelo continues to work on what he’s most passionate about, helping
people see and understand data. Before joining LWIT, Marcelo was Director of Engineering
for SoundCommerce a startup in the Big Data space, and a Senior Manager at Tableau
and Microsoft. He holds a master’s in computer science from Universidad de la República,
an MBA and master’s in applied mathematics from the University of Washington, and a
master’s in applied statistics from Texas A&M University.

6
FUNDAMENTALS OF DATA VISUALIZATION Preface

PREFACE
Data Visualization simplifies analyzing data by presenting it in a format more accessible for
human brains to understand. Since our eyes are more drawn to colors and patterns than
words and numbers, this tool helps communicate information faster and more efficiently
using graphical representations such as charts, tables, and maps.

For example, an organization can choose to represent its sales data with a map on top
of using words and numbers. This visual map, color-coded based on the sales numbers,
would help them understand how these sales increase or decrease or how the trend
moves over a certain period.

Data visualization is a helpful tool for every field in which data plays an important role:
industries ranging from marketing to finance to tech. As a result, the ability to produce
visualizations has become an important, increasingly sought-after skill for coding in today’s
data analysis. Also, this skill has led to the rise of visualization tools, such as PowerBI,
Tableau, and Looker.

This book introduces the basic concepts behind visualizations and provides examples in
PowerBI of each. PowerBI offers step-by-step tutorials on how to create each visualization.

7
FUNDAMENTALS OF DATA VISUALIZATION Visualization

1 VISUALIZATION

1.1 INTRODUCTION
PowerBI is a Microsoft-owned Data Visualization Tool, Microsoft Official website says,
“PowerBI is a business analytics solution that allows you to visualize your data and share
insights across your organization, or embed them in your app or website.” Coming from a
different technology area. I was not aware of the full power of visualizations or how they
even worked, for that matter. As I started researching the subject, I came across Hans
Rosling’s TED Talk: The best stats you’ve ever seen | Hans Rosling

If you have 20 minutes to spare, I recommend watching it. What Doctor Rollins did in this
presentation showed me the real power of visualizations. In short, he had come up with an
insight that, even though factually correct, goes against people’s perceptions. He then used
a very well-crafted visualization to explain said insight to the public. This showed me the
real power of visualizations and the key reason they are becoming increasingly prevalent. By
the end of this book, you should have acquired the tools needed to build the visualization
Hans showed in his speech.

Data visualization uses graphic elements to represent data in a more understandable and
applicable manner. The effective use of visualizations has become an essential tool in this era
of massive amounts of complex data, mainly when it provides a practical way of representing
patterns and trends.

In the past, visualization tools were not as widespread as today; only in the last decades
has it become a common tool that anyone can use. Thus, 20 years ago, Data Analytics was
mainly reserved for statisticians with the specialized skills to collect, analyze, and interpret
data and tap into the creation of visualizations and support companies with massive
supercomputers. Conversely, the works of important figures such as Edward Tufte and John
Tukey paved the way for visualization techniques for more than just these statisticians.
Therefore, along with the advancement of technology came data visualization, facilitating
data exploration and analysis.

8
FUNDAMENTALS OF DATA VISUALIZATION Visualization

1.2 WHY DO DATA VISUALIZATIONS WORK?


The human brain has evolved to rely on shapes and colors as primary information indicators.
As a result, visual cues are more intuitively processed than text. Since our brain processes
color pre-attentively, meaning it does not need any conscious effort to recognize it, we can
process hue, saturation, and brightness more quickly than words and numbers alone.

For example, look at the following sequence of numbers and identify all instances of the
number 5.

16385024654420

39321856807581

93071210762917

71908091154386

36048914702605

Now look at this new sequence and try to identify all instances of the number 5.

16385024654420

39321856807581

93071210762917

71908091154386

36048914702605

Notice how, as an instinct, the brain groups visual characteristics together much more rapidly
than the shape of a number. We can observe how using color makes interpreting sequences
and identifying patterns easier with less effort.

As a second example, look at the following table containing the World Population and
Annual Growth Rate between 2010 and 2020, and identify if these numbers are increasing
or decreasing.

9
FUNDAMENTALS OF DATA VISUALIZATION Visualization

Year World Population Annual Growth Rate

2010 6,956,823,603 1.22%

2011 7,041,194,301 1.21%

2012 7,125,828,059 1.20%

2013 7,210,581,976 1.19%

2014 7,295,290,765 1.17%

2015 7,379,797,139 1.16%

2016 7,464,022,049 1.14%

2017 7,547,858,925 1.12%

2018 7,631,091,040 1.10%

2019 7,713,468,100 1.08%

2020 7,794,798,739 1.05%

Table 1.1 World Population and Annual Growth Rate in The Last Decade

Then look at the following graph and identify if the population and rate are increasing or
decreasing. Notice how that data is more easily processed and the rate of increase more
quickly interpreted when the numbers are represented in a graph.

Figure 1.1 World Population and Annual Growth Rate in The Last Decade

10
FUNDAMENTALS OF DATA VISUALIZATION Visualization

1.3 WHAT ARE DATA VISUALIZATIONS USED FOR?


Data visualization, as previously mentioned, facilitates the analysis of data. Subsequently, data
visualization application proves crucial for every field in which data plays an important role:
finance, healthcare, and marketing. In the list below, we exhibit some ways visualizations
can be applied in different fields.

Finance: In this field, data visualizations help reveal patterns, anomalies, fluctuations in the
price of assets, etc. Financial visualizations usually emphasize redirecting the insight from
retrospective to forecasting to help make a data-driven decision on, for example, entering
new markets, financing and investing, analyzing assets, capital, and more.

Healthcare: Some public health variables, such as mortality rates for a particular disease
across a specific area, could be represented through a colored map. The use of infographics, in
addition, is a pivotal device for educating the public about public health matters. Moreover,
visualization techniques can help health personnel provide insights into a patient’s care
coordination, tracking and monitoring a patient’s health status, records, etc.

Marketing: Visualizations can help marketing teams and stakeholders tap into sales figures,
market research analytics, campaign statistics, marketing strategies, etc., aiding in their
decision-making process. Marketing teams could, for example, create visualizations as a tool
when crafting content for campaigns and sharing data with consumers.

1.4 WHEN NOT TO USE VISUALIZATIONS


Data visualizations are powerful tools that could potentially be dangerous, given that people
take less time processing visual cues than they do processing numbers. Consequently, they
are more likely to arrive at the wrong conclusion quicker. As a result, visualizations tend
to make the viewer believe that they understand the data presented better than they do.

This can be done intentionally by unscrupulous individuals who abuse visualizations to


deliver wrong information or inadvertently misinterpret data. Inaccurate visuals, incongruence
between the data and the visual representation, logical fallacies, creative misrepresentations,
or improper techniques will result in the viewer being misled and reaching false conclusions.

11
FUNDAMENTALS OF DATA VISUALIZATION Data Fields

2 DATA FIELDS

2.1 INTRODUCTION
Among the critical aspects of data visualization, choosing the best possible visualization for
the data we observe is essential. Naturally, multiple factors play a role in this determination.
The first factor we must tap into is determining the type of data field.

2.2 BASIC FIELD TYPES


When we look at a data field, we can determine two dimensions: syntax values, and semantics.
Syntax refers to a series of digits and letters (i.e., numbers and words). On the other hand,
semantics refers to a more specific meaning (i.e., the United States represents a particular
part of the world, and October represents an exact moment in time). Fundamentally, every
concept can be described as a number or a series of characters.

2.2.1 NUMBERS

Mathematically, numbers can be classified in multiple ways. However, we can find two
separations from creating visualizations: whole vs. decimal and discrete vs. continuous.

2.2.1.1 Whole Vs. Decimal


A whole number is a number that doesn’t have decimals. For example, if we measure the
number of customers, we will use a whole number: we cannot have 1.5 customers. On the
other hand, if we measure product sales, we will likely get a decimal number: we can sell
a product for $99.99.

2.2.1.2 Discrete Vs. Continuous


A discrete variable is a whole number that can only take values in a range. For example,
numbered months are discrete since we can only choose from 1 to 12. Once again, if we
ask a customer to rate the quality of a product from 1 to 10, that would also be a discrete
variable. Conversely, a continuous variable can take any number between two others in

12
FUNDAMENTALS OF DATA VISUALIZATION Data Fields

the scale. For example, for the number of customers in a restaurant, the value can be any
number from 0 to, in theory, an infinite value. Furthermore, the price of a product can
take any value between 0 and, in theory, an endless number.

2.2.1.3 Categorial Vs. Non-Categorical


In addition to the previous two separations, we can look at numbers as categorical and non-
categorical. A categorical value is a whole number that represents a non-numerical concept.
For example, we can assign countries a value of development, such as 1 (First World) and 3
(Third World). In this case, the numbers do not represent a numerical concept but describe
something completely different. On the other hand, non-categorical values can be used to
make mathematical calculations, such as sum, average, standard deviation, etc.

2.2.2 STRINGS

Strings are a sequence of characters that can include letters, numbers, and separators, which
can be either visible or invisible and may be repeated. Strings are categorical, meaning they
represent a concept (such as a name, a location, a product, and such). For example, “Apple”
is a string, and “221B Baker St.” is a string. In addition, a string can also be a constant or
a variable. We will look into two specific strings: dates and locations.

2.2.3 DATES

Dates are a particular number and characters (i.e., the separators) representing a specific
moment. They usually include three components: Year, Month, and day, and can be grouped
by year, quarter, month, and day of the week. Additionally, dates have mathematical and
logical operations, for example:

2021-01-02 occurs “after” 2020-01-01

2021-01-02 + 3 days = 2021-01-05

2021-01-02 + 1 year = 2022-01-02

13
FUNDAMENTALS OF DATA VISUALIZATION Data Fields

2.2.4 LOCATIONS

Locations are strings that contain semantics: they represent places in the world. Locations
can be countries, cities, zip codes, etc. Moreover, locations can be global coordinates. Figure
2/1 shows locations in PowerBI.

Figure 2.1 Locations

14
FUNDAMENTALS OF DATA VISUALIZATION Field Transformations

3 FIELD TRANSFORMATIONS

3.1 INTRODUCTION
Now that we have discussed the purpose and basic types of data fields, it’s time to shed
light on the next step – field transformation.

Once the data is stored in appropriate fields, we may need to process it to ensure it is ready
for further data analysis. For example, an organization records daily sales in its system. Those
records may have dates as separate columns representing Day, Month, and Year, while for
analysis purposes, we need this information as only one field. Similarly, the system could
have customer addresses as one field when we need separate fields that include state and
zip code information.

When we process data from the data source, we need not replace the existing data with the
output. We can instead create new fields that are referred to as Calculated Fields. Therefore,
we can transform fields in the following ways: Splitting string data, Concatenating string
fields, performing simple math calculations, Transforming data from one data type to another

3.2 SPLITTING
When we extract data from a source, specific fields may contain string data. For instance,
it can be payroll information about employees working at a retail store. Data fields may
include a single field for employees’ first and last names, but we can separate the first and
last names from this information. For this purpose, we must specify a common separator.
This separator guides the software about dividing or splitting data into multiple fields.

3.3 CONCATENATION
Along with the Split function, PowerBI has another process called Concatenation. The
concatenation function helps us combine two strings with the help of the ‘&’ operator.
When it comes to strings, this operator attaches the second string to the first string.

The PowerBI CONCATENATE function is a DAX function that combines two text strings
into one. Text, integers, Boolean values displayed as text, or a mix of those elements can
be connected. If the column has appropriate values, you may utilize the following formula
to concatenate strings in PowerBI:

CONCATENATE(“Hello”, “World”)

15
FUNDAMENTALS OF DATA VISUALIZATION Field Transformations

Once we specify this formula, the tool combines the mentioned strings and shows the result
in a new field. However, In the above example, a space is deliberately left after “Hello” to
ensure a space between the two arguments while concatenating them. Otherwise, the output
would have been “HelloWorld” if no space had been left.

3.4 SIMPLE MATH


Let’s take the example of a business. We may need to determine the profit margin ratio for
the company. While the data source includes information about total revenue generated and
expenses incurred, we need to determine the ratio for data analysis. Therefore, we need to
apply the profit margin ratio formula.

We can do so through a simple math formula which is as follows:

Profit Margin = DIVIDE([Profit], [Total Sales])

This formula can be applied to the existing data by creating a new data field and entering
this formula in the Calculation Editor after naming the field as ‘Profit Margin.’ The software
marks this new data field as a Measures data type since it includes numerical data.

Be the One
who Makes the
Breakthrough.
Discovery means many different things at SLB,
but it’s the spirit that unites every single one of us.
It doesn’t matter whether you join our business,
engineering, or technology teams, you’ll push
boundaries and deliver the exceptional. If that
excites you, we want to hear from you.

careers.slb.com/job-listing
© 2023 SLB. All rights reserved.

16
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4 BAR CHARTS

4.1 INTRODUCTION
Bar charts are one of the most widely used methods of data visualization. They allow us to
scan information presented as vertical or horizontal bars. We can use bar charts to manage
large data sets by categorizing them based on their numerical values.

Imagine an ice cream parlor that conducts a user survey of 1000 people regarding the likability
of various experimental flavors it offers –nutty vanilla, chocolate smoothie, caramel brownie,
hazelnut cookie fudge, and strawberry swirl fudge. The survey results revealed the following:

Ice Cream Caramel Chocolate Hazel Nut Nutty Strawberry


Flavor Brownie Smoothie Cookie Fudge Vanilla Swirl Fudge

No. of People 169 248 132 344 107

Table 4.1 Ice Cream Preference Table

A possible representation of this data in the form of a bar graph is:

Figure 4.1 Ice Cream preferences

17
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

As seen above, the data in the bar graph is plotted according to the frequency of each
category of ice cream. The length or height of each bar is equivalent to the data it represents.
Thus, the X-axis shows various flavors, and the corresponding number of people who like
a particular flavor is depicted on the Y-axis.

4.1.1 CHARACTERISTICS OF A BAR CHART

Bar Charts or bar graphs represent grouped data presented in vertical or horizontal bars. The
length of the bars equals the measure of data. A Bar Chart possesses the following characteristics:

1. The bars are of uniform width.


2. The distance between each bar is the same.
3. The variable quantity is represented on the X-axis, and its discrete value is
represented on the Y-axis.
4. Bar Charts enable you to compare data sets among other groups easily.
5. They allow you to depict changes in data over some time.
6. These Charts are preferred over other charts when large numbers and data are
represented.

4.1.2 WHEN SHOULD A BAR CHART BE USED?

We use the following pointers to determine when to use a Bar Chart for data visualization.

A Bar Chart allows you to compare different categories and changes in these categories. So,
to plot profits and losses from other departments in a store, you will have a bar for each
department. This bar will extend to the positive vertical axis to depict a profit or down to
the negative axis to describe a loss.

We may use a Bar Chart to depict changes in data over some time. For example, changes
from year to year or quarter to quarter. Continuing the above example, you can show a
trend over time with the help of bars representing each quarter for the whole store.

However, you should avoid using a Bar Chart when the number of groups you wish to compare
is more than 10. Also, avoid using the Bar Chart if the data you want to visualize is continuous.

18
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4.1.3 TYPES OF BAR CHARTS

Bar Charts can be categorized into the following groups –

1. Vertical Bar Chart


2. Horizontal Bar Chart
3. Grouped Bar Chart
4. Stacked Bar Chart

4.1.3.1 Vertical Bar Chart


Also known as a Column graph, a Vertical Bar Chart is the most common type of bar chart.
The vertical bars represent the numerical value of variables, and the length of these bars is
proportional to the quantities they represent. For example, you can define the salary ranges
of people in a particular area with the help of the following Bar Chart.

Figure 4.2 Vertical Bar Chart

In the above diagram, the X-axis represents the range of income. The corresponding number
of people falling in each field is depicted on the Y-axis.

19
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4.1.3.2 Horizontal Bar Chart


Usually, Horizontal Bar Charts are used when you need more space to fit in long labels, which
would otherwise look cluttered on a Vertical Bar Chart. Have a look at the following example.

Figure 4.3 Horizontal Bar Chart

4.1.3.3 Grouped Bar Chart


Grouped Bar Charts are also known as Cluster Charts. These allow you to indicate
relationships between the different subcategories of a dataset. Grouped Bar Charts are ideal
for depicting several subgroups of each category. Consider the scenario where a doctor is
demonstrating ailments that are prevalent among senior citizens. Here is what the Grouped
Bar Chart shows.

20
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

Figure 4.4 Grouped Bar Chart

As evident from the Grouped Bar Chart, we can decipher the number of people with
cardiac and orthopedic issues for two age groups: 70-79 years and 80-100 years. The X-axis
represents the age groups of the ailments, and the Y-axis represents the percentage of people
having these ailments.

4.1.3.4 Stacked bar chart


A Stacked Bar Chart displays relationships between subgroups, just like a Grouped Bar
Chart. However, the subgroups are stacked in a Stacked Bar Chart on the same bar. The
data categories are stacked so that each bar shows the total number of subcategories that
comprise a data set.

Let us assume that a departmental store tabulates the data showing sales made by various
salespeople.

21
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

Clothing Kitchen Shoes Bedding Toys

Kenneth 8.8 12.8 3.6 11.5 3.5

Edward 6.0 8.9 5.6 14.8 6.9

Maria 4.5 6.4 4.3 8.7 5.3

Richard 10.1 3.7 4.9 6.6 6.7

Susan 7.5 8.0 9.7 4.2 8.5

Steven 16.2 9.8 8.5 6.9 5.6

Table 4.2 Sales by Person Table

A Stacked Bar Chart created using PowerBI looks like this:

Figure 4.5 Stacked Bar Chart

The X-axis represents salespeople’s names, while the Y-axis represents sales in dollars (in
thousands). You can see the sales value for clothing, kitchens, Shoes, Bedding, and Toys.

22
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4.1.4 ADVANTAGES AND DISADVANTAGES OF BAR CHARTS

4.1.4.1 Advantages
Using Bar Charts as a tool of data visualization offers the following benefits. Bar Charts:

1. Summarize large sets of data in visual form


2. They depict the trend of data in a better manner than tables
3. Help to make estimates quickly and accurately
4. Display each category of data in a frequency distribution

4.1.4.2 Disadvantages
The disadvantages of using bar charts are listed below. Bar Charts:

1. Often need additional explanation


2. Fail to reveal patterns, causes, effects, etc.
3. It can be manipulated to provide false information

4.2 LINE CHARTS


A-Line Chart or a Line Graph is used to depict the change in information over some
time. This implies that the horizontal axis, i.e., the X-axis, usually represents a time scale
in minutes, hours, days, weeks, months, quarters, or years.

Line graphs typically aid in analyzing the trend, allowing you to gauge whether a quantity
on the Y-axis increases or decreases over time.

The table below shows Norah’s height (in feet) over a time interval of 2 years.

Age 2 4 6 8 10 12

Height (in feet) 1.4 1.9 2.6 2.9 3.2 3.8

Table 4.3 Height by age

23
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

This data can be shown with the help of a line graph in the following manner.

Figure 4.6 Line Chart

Note that the X-axis represents age, and the Y-axis shows a change in height over the defined
time intervals. In a Line Chart, an upward slope indicates that the values have increased
(like in the above example), and a downward slope indicates that the values have decreased.

4.2.1 COMPONENTS OF A LINE CHART

Let us understand the various components of a Line Chart with the help of the above
example.

24
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

Figure 4.7 Line Chart Components

Title: The title reveals what the graph is about. So, this lets you know what information
is depicted in the Line Chart.

Labels: Both axes have labels, which help you gauge the data shown in the graph. Here,
the X-axis is labeled as age, and the Y-axis is labeled as height.

Scale: The Scale of the Line Chart tells you the number of units used to define each point
on the graph.

Points: The points represent the x and y coordinates. As seen above, the data on the X-axis
shows an independent variable, and the information on the Y-axis is the dependent variable.

Lines: The lines connecting the points estimate the value between each point. We can
conclude that the line is the actual graph, while other parts of the chart are guides that
help you to understand the sequence.

4.2.2 TYPES OF LINE CHARTS

Line Charts can be in the form of a Simple Line Chart or Multiple Line Chart.

25
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4.2.2.1 Simple Line Chart


A Simple Line Chart is plotted with a single line only. Usually, one of the variables is
independent, whereas the other is the dependent variable.

Let us assume that Robin bought a car in 2015. The table below shows its depreciated
value in the subsequent years.

Year Value ($)

2015 54,795

2016 49,316

2017 44,384

2018 39,946

2019 35,951

2020 32,356

2021 29,120

Table 4.4 Car Values per Year

Figure 4.8 Simple Line Chart

Observe that the line graph has a downward slope, which tells you that the car’s value has
decreased.

26
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4.2.2.2 Multiple Line Chart


A Multiple Line Chart is plotted using two or more lines. It represents two or more variables
that change over the same period.

The table below shows the number of students who enrolled in college and opted for Economics
as their major from 2014 to 2021. The states being represented are Ohio and Illinois.

Year Ohio Illinois

2014 300 400

2015 450 460

2016 180 340

2017 400 450

2018 620 470

2019 580 500

2020 460 620

2021 650 670

Table 4.5 Sales by Person Table

Figure 4.9 Multiple Line Chart

In the Multiple Line Chart above, the students of Ohio are represented by the yellow Line
Chart, and the blue Line Chart represents that of Illinois.

27
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

4.3 MAPS
Maps are used to analyze and display geographically related data. Maps provide a visual
representation of each region’s distribution or proportion of data. This allows for deciphering
deeper information to make better decisions.

Using Maps for data visualization may offer you the following benefits:

1. better ability to understand the distribution of your organization’s presence


across the continents, countries, states, districts
2. compare activities across various locations
3. governments widely use maps to depict performance across multiple geographies

Consider the map below. It demonstrates region-wise profit for a superstore. Notice that the
regions are divided into- Central, East, South, and West. Using maps for data visualization
provides a clear-cut picture of the company’s performance in terms of geography.

Figure 4.10 Profit Map

28
FUNDAMENTALS OF DATA VISUALIZATION Bar Charts

The map below shows the state-wise sales, which are represented by dots. This map has
been created using PowerBI, and when the cursor is placed over a particular state, the sales
value is displayed, as shown in the figure below.

Figure 4.11 Sales Map

29
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

5 SINGLE VARIABLE STATISTICS

5.1 INTRODUCTION
Single variable or univariate statistics refers to data given as a list of numbers. You have often
made lists when you go grocery shopping. That is an example of single-variable statistics.

However, you may have to use techniques to summarize and display data effectively when
analyzing the data. The following methods available for the single variables can be of great help:

1. Frequency tables and histograms


2. The measures of central tendency and variability (mean and standard deviation)
3. Percentile ranks
4. Reference lines that can be included in charts
5. Box plots

Surpass your
Expectations.
By joining SLB, you’ll be part of the most multicultural
and diverse team of experts in any industry. Working
collaboratively, with agility, and alongside talented
colleagues across the company, you’ll realize
your full potential. The scope of what you’ll learn
is limitless. Apply now and broaden your horizons.

careers.slb.com/job-listing
© 2023 SLB. All rights reserved.

30
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

5.2 FREQUENCY TABLES


Every variable has a distribution that indicates how scores are distributed across the levels
of that variable. For example, a health practitioner studied the dietary habits of people.
The following data was revealed for 100 people – 10 are Vegan, 30 are Vegetarian, 50 are
non-vegetarian, and 10 are Eggetarian.

In the same sample, the distribution of the variable ‘sex’ indicates that 45 people have a
‘male’ and ‘55’ score of ‘female.’

Frequency tables offer an efficient way to show the distribution of a variable. The following
table shows the test score of 20 students.

Number of Students
Marks Obtained
(Frequency)

8 2

12 1

15 2

19 2

20 3

22 2

23 1

25 3

27 1

29 2

30 1

Table 5.1 Test Score Frequency

31
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

The frequency distribution table above is an ungrouped frequency distribution table. Such
tables are apt for representing smaller data sets. However, greater data clarity is achieved by
grouping it into class intervals for larger data sets. Thus, the above table can be grouped
into class intervals as follows.

Number of Students
Marks Obtained
(Frequency)

5-10 2

10-15 1

15-20 4

20-25 6

25-30 6

30-35 1

Table 5.2 Grade bins

Here, it is worth noting that the upper-class interval repeats itself as the lower limit of the
next class interval. So, the values corresponding to the upper-class interval are included in
the next class interval. As we can observe, two students obtained 15 marks. This has been
included in the class interval of 15-20, not 10-15.

5.3 HISTOGRAMS
A histogram represents the same information as a frequency table graphically. A histogram
always groups numbers into ranges, and the height of each bar shows the numbers that
fall into each range.

The X-axis represents the variable, and the Y-axis represents the frequency. For example,
Jonathan has pear trees with varying heights in his orchard. The heights and their corresponding
frequencies are listed in the table below.

32
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

Height Range Number of Trees


(Feet) (Frequency)

12-18 4

18-24 6

24-30 7

30-36 8

36-42 10

42-48 13

48-54 5

Table 5.3 Range Bins

The following histogram represents this data.

Figure 5.1 Range Histogram

The X-axis represents the variable, and Y-axis represents the corresponding frequency. Each
vertical bar represents the number of trees within a particular range.

Note there should be no gap between the bars of a histogram.

33
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

5.3.1 HISTOGRAM SHAPES

A histogram can have six main shapes, depicting different distribution types. These shapes are:

5.3.1.1 Bell Shaped Histogram


A Bell-Shaped Histogram is a histogram with a single peak. It presents a normal distribution.

Figure 5.2 Bell Shaped Histogram

5.3.1.2 Bimodal Histogram


A bimodal histogram is a histogram with two peaks.

Figure 5.3 Bi-Modal Histogram

34
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

5.3.1.3 Right Skewed Histogram


Some histograms show a distribution that is skewed towards the right. Such distribution is
also called a positively skewed distribution.

Figure 5.4 Right Skewed Histogram

5.3.1.4 Left Skewed Histogram


Some histograms show a distribution that is skewed towards the left. Such distribution is
also called negatively skewed.

Figure 5.5 – Left Skewed Histogram

35
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

5.3.1.5 Uniform Histogram


Each class has approximately the same number of elements in a Uniform Histogram.

Figure 5.6 Uniform Histogram

5.3.1.6 Random Histogram


A Random Histogram represents a distribution that does not have a specific pattern.

Figure 5.7 Random Histogram

5.4 CENTRAL TENDENCY AND VARIABILITY


The characteristics of a distribution can be defined more precisely by using two essential
concepts – their central tendency and their variability.

36
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

The central tendency of distribution represents its middle. This is a point around which the
scores of distributions tend to cluster.

Here, we would talk of the mean as a measure of central tendency and Standard deviation
as a measure of variability.

5.4.1 MEAN

Data’s Mean (also known as average) can be determined by adding all the numbers in a
dataset and then dividing this sum by the number of values in that set.

5.4.2 STANDARD DEVIATION

Standard deviation is a measure of variability. Variability, also called dispersion, indicates


how to spread out a set of data. Variability allows you to ascertain how much data sets vary.
It also provides you the advantage of comparing data in different sets.

The distribution in the two charts below shows that the lower one has lower variability than
the upper one. The scores of the upper distribution are spread across a much greater range,
while those of the lower distribution are relatively closer to the center.

Figure 5.8 Higher Variability Distribution

37
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

Figure 5.9 Lower Variability Distribution

Numerically we can also consider the following example where the mean of two sets of
data is the same:

10, 10, 10, 14, 16 and 2, 8, 19, 15, 16

However, the second one is more spread out.

The most widely used measure of variability is the standard deviation. Standard deviation
is a measure that indicates the dispersion of a set of data from its mean. The higher the
variability, the greater the standard deviation.

5.5 PERCENTILE RANKS


Percentile ranks indicate the percentage of scores that fall below a particular value. For
example, consider Table 5/1, which shows students’ marks. Notice that three students have
received 25 marks. In this distribution, 13 out of 20 scores (65 %) are lower than 25.
Therefore, each of these students has a percentile rank of 65. You can easily express this by
saying these students scored at the 65th percentile.

38
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

Percentile ranks are typically used to report standardized tests like GRE based on ability or
achievement. So, a student’s total GRE score of 1380 marks will not impart any meaning.
However, if the total score of 1380 is approximately the 90th percentile, this student performed
better than the other 90 % of students who took the GRE.

5.6 REFERENCE LINES


You can add lines of different shapes, such as horizontal, vertical, diagonal, etc., to emphasize
a specific value in your chart. For example, you want to set sales goals for people in your
team to ensure that the business is headed in the right direction. Alternatively, you might
want to highlight the average sales. A reference line can be a good data visualization aid
in this case.

Suppose the sales goal of your team is 90,000 $. This can be shown as a reference line along
with the sales data of your team.

Figure 5.10 Reference Line in PowerBI

The reference line allows you to decipher which team members have achieved their sales
targets and those who have not.

39
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

5.7 AGGREGATE FUNCTIONS


Aggregate functions are used to make various calculations on your data. These functions
return a single value calculated from values in a column. For example, the SUM function
adds values. While working on any software, you can add individual values or cell references
or combine the two.

Figure 5.11 Aggregate Functions in Excel

Figure 5.12 Aggregate Functions in Excel (Result of aggregate formula)

40
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

The above graphic shows the usage of the SUM function in Excel. You may use different
types of aggregate functions to make specific calculations.

Figure 5.13 Aggregate Functions in PowerBI

41
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

Below is a brief description of these aggregate functions.

SUM Returns the SUM of specified values

AVERAGE Returns the average (SUM÷COUNT)

MEDIAN Takes the values in the middle

COUNT Provides the count of all rows

COUNT (DISTINCT) It gives the rows of all distinct values in a dataset

MINIMUM Provides the lowest value in a dataset

MAXIMUM Provides the highest value in a dataset

STD. DEV Returns the standard deviation of all values in a


dataset based on a sample population

VARIANCE Returns the variance of all values in a dataset


based on a sample of the population

Table 5.4 Description of aggregate functions

5.8 BOX PLOT


A box is also called a box and whisker plot. This displays a five-number summary of a
dataset: first quartile, median, third quartile, maximum, and minimum.

The box plot draws a box from the first to the third quartile. A vertical line passes through
the box at the median (the mid-value). The whiskers extend from each quartile to the
maximum or minimum.

Figure 5.14 Box Plot

42
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

Let us study the following example

Take two car manufacturers, Honda and Toyota. Now we use Box Plot Graph to measure
each car’s mileage per gallon. The data that will be used is given below:

Make Model MPG

Honda Accord 27

Honda Civic 34

Honda CRV 27

Honda CR-Z 33

Honda Fit 36

Honda Odyssey 22

Honda Pilot 23

Toyota 4Runner 18

Toyota Avalon 24

Toyota Camry 28

Toyota Corolla 31

Toyota Highlander 21

Toyota RAV4 25

Toyota Sienna 21

Table 5.5 Miles per gallon by car

43
FUNDAMENTALS OF DATA VISUALIZATION Single Variable Statistics

The box plot created using PowerBI looks as follows.

Figure 5.15 Box Plot in PowerBI

44
FUNDAMENTALS OF DATA VISUALIZATION Correlations

6 CORRELATIONS

6.1 INTRODUCTION
Quite often, in statistics, you might be required to study the relationship between two or
more quantitative variables. You can do so with the help of correlation.

Correlation implies a relationship pattern between the values of two or more variables. For
example, there is a correlation between the sales of hot beverages and coats. As the sales of
hot drinks increase, the sales of coats also increase.

Figure 6.1 Looking at correlations

6.1.1 CAUSATION

While talking about correlation, we need to remember another vital concept: causation.
Causation takes a step further from correlation. It states that any change in the value of
one variable will cause a difference in the value of another variable. This implies that one
event is a result of the occurrence of another event. So, you may refer to it as cause and
effect. For example, smoking leads to an increase in the risk of developing lung cancer.

45
FUNDAMENTALS OF DATA VISUALIZATION Correlations

We can identify the distinction between correlation and causation because correlation does
not automatically imply that the change in one variable will lead to a change in the value of
another variable. So, smoking is correlated to alcoholism, but it does not cause alcoholism.

6.1.2 TYPES OF CORRELATION

The following types of correlations exist:

1. Positive correlation – When one variable increases, so does the other. For
example, the number of calories burnt increases with the hours put into
exercise.
2. Negative correlation – When one variable increases, the other decreases. For
example, when the price of commodities increases, their demand decreases.
3. No correlation – The two variables show no statistical relationship. For
example, there exists no correlation between height and GPA.

Figure 6.2 Potential Correlations

6.2 SCATTERPLOTS
Using scatterplots, scatter graphs, charts, or diagrams, you can visually express a correlation.
These scatter plots provide a graphical view depicting relationships between two numerical
variables. The correlation is shown by marking a dot for each value.

Observe the scatter diagram shown in section 6.1.1. The closeness of dots towards each
other in a particular direction shows a higher degree of correlation. When the dots are
scattered and show neither similarity nor direction, it indicates a low degree of correlation.

However, it would help if you remembered that scatter diagrams show an approximation of the
relationship or closeness of data. It does not offer a precise measurement of the relationship.

46
FUNDAMENTALS OF DATA VISUALIZATION Correlations

Linda wanted to ascertain if there exists a correlation between the diameter of a tree and
its height. She has the following data:

Diameter 4.1 5.2 3.3 6.8 2.4 7.2

Height 3.1 3.7 2.6 4.2 1.2 4.8

Table 6.1 Height by Diameter

To make a scatter plot, Linda needs to do the following.

1. Draw an axis. She represents the diameter on the X-axis and the height on the
Y-axis.
2. Make a dot corresponding to the value of the variable y w.r.t that of the
variable x. Thus, she uses the coordinates 4.1 and 3.1 to draw the first dot.
3. Repeat this process for all the variables; now, she has a scatter plot.

Figure 6.3 Scatter Plot in PowerBI with Correlation

47
FUNDAMENTALS OF DATA VISUALIZATION Correlations

As evident from the above diagram, a positive correlation exists between the diameter and
height.

Consider another example where Steven wants to determine the correlation between age
and the number of pets people own. His scatter plot looks as follows.

Figure 6.4 Scatter Plot in PowerBI without Correlation

It can be seen that there is no correlation between age and the number of pets a person owns.

6.3 TREND LINES


A trend line is also called the line of best fit. It is added to a graph to depict a general
direction in which points of a scatter plot seem to be going. It provides a clearer view of
the pattern, which helps you determine a positive, negative, or no trend. These trend lines
further help to predict the future course of data.

48
FUNDAMENTALS OF DATA VISUALIZATION Correlations

Below is a scatter plot that depicts a particular store’s sales and profit data.

Figure 6.5 Scatter Plot in PowerBI Profit vs. Sales

49
FUNDAMENTALS OF DATA VISUALIZATION Correlations

From this data, you can decipher that the general pattern of the graph is sloping upwards.
Consecutively, a trend line can be drawn to depict this trend, as shown in the diagram below.

Figure 6.6 Scatter Plot in PowerBI Profit vs. Sales with Trendline

A trend line may go through some points but need not go through all. Looking at this trend
line, you can see a positive trend in the data. Furthermore, you can predict that when the
sales go up to $20,000, a profit of $9,000 can be estimated.

Consider another example where a teacher surveyed her students to determine the correlation
between the number of hours they watched television and their test scores. She got the
following scatter plot and the corresponding trend line, which shows a negative trend.

50
FUNDAMENTALS OF DATA VISUALIZATION Correlations

Figure 6.7 Scatter Plot in PowerBI with Negative Trend

A trend line can only be drawn for positive or negative correlations. It cannot be drawn
for data where no correlation exists.

6.4 MEASURES OF FIT


As discussed above, correlation is denoted by r, and it measures the amount of linear
association between two variables. Also, r lies between 1 and -1 inclusive.

Let us talk about the R-squared value, denoted by R2, and thus the square of correlation.
It allows you to measure the proportion of variation in the dependent variable (usually
represented on the Y-axis) attributed to the independent variable (traditionally represented
on the X-axis).

Thus, the R-squared value, also known as the regression value, tells you how correlated the
independent and dependent variables are. Suppose the R-squared value is closer to 1. In
that case, it suggests that the independent and dependent variables are closely correlated. If
the R-squared value is more relative to 0, it indicates that the independent and dependent
variables are uncorrelated.

51
FUNDAMENTALS OF DATA VISUALIZATION Correlations

You might often hear various examples of correlations, such as –

1. Eating certain kinds of fish may improve your health.


2. Taller people have more weight.
3. An employee’s paycheck increases proportionately with the number of hours s/
he works.
4. A student with many absences has a decrease in grades.
5. When the speed of a train increases, the time it takes to reach the destination
decreases.
6. The birth rate tends to decrease in wealthier countries.

In the above correlations, the R-squared value will allow you to ascertain how correlated
your independent and dependent variables are.

Look at the following table, which gives the values of r and R2 and the associations displayed
by the variables. It also indicates how the variables appear on the trend line of a scatter plot.

Location of variables
Value of r Value of R2 Type of association
on the trend line

The points lie exactly


1 1.00 Perfect positive linear association
on the trend line.

The points lie close to


0.9 0.81 Large positive linear association
the linear trend line.

The points lie far from


0.45 0.2025 Small positive linear association
the trend line.

There exists no association


0.00 0.0 No association
between the variables.

The points lie far from


-0.3 0.09 Small negative association
the trend line.

The points lie close to


-0.95 0.9025 Large negative association
the linear trend line.

The points lie exactly


-1 1.00 Perfect negative association
on the trend line.

Table 6.2 Values of R^2

The value of R2 always lies between 0 and 1 inclusive.

52
FUNDAMENTALS OF DATA VISUALIZATION Correlations

Notice that in the scatter plot shown below, the value of R2 is displayed. This value is 0.82, which
indicates a significant positive linear association. So, the points lie close to the linear trend line.

Figure 6.8 R^2 in PowerBI

53
FUNDAMENTALS OF DATA VISUALIZATION Time Series

7 TIME SERIES

7.1 INTRODUCTION
Time series means a presentation of data in chronological order. For this purpose, the
statistical data is collected over time, usually at equal intervals (hourly, daily, weekly, monthly,
quarterly, annually, etc.). The examples listed below represent data that is chronological:

1. The annual unemployment rate in a state for the past 25 years.


2. Monthly SUV sales of an automaker for the last four years.
3. The quarterly sales results for Amazon

Using a time-series graph, you can plot repeated measurements over regular intervals. The
time is displayed on the X-axis, and the dependent variable is on the Y-axis. The data points
are joined, usually with straight lines. Suppose you want to analyze the number of views
on a particular YouTube channel over six months.

Month Views

January 1,80,000

February 90,000

March 1,40,000

April 2,20,000

May 2,52,000

June 2,80,000

Table 7.1 YouTube views data

54
FUNDAMENTALS OF DATA VISUALIZATION Time Series

This data can be depicted with the help of a time series graph shown below.

Figure 7.1 Time Series in PowerBI

7.1.1 COMPONENTS OF TIME SERIES GRAPH

Changes in time-series graphs result from various factors: natural, economic, social, natural,
industrial, or political. These factors are known as components of time series. The features
of a time series graph are listed below.

1. Secular trend or long-term trend: This can be seen with the help of peaks and
troughs in the time-series graphs. This depicts the general tendency of data to
increase, decrease or stagnate over some time. For example, time series relating
to the business may show an upward tendency, whereas time series about death
rates may show a downward trend.
2. Seasonal variations: Such variations include changes that take place due to the
rhythmic forces which occur in a regular periodic manner. Seasonal variations
are calculated when you record data in weeks, months, years, etc. For example,
the sale of ice cream increases in the summer season. Also, sales in department
stores are more during the festive seasons than on regular days.

55
FUNDAMENTALS OF DATA VISUALIZATION Time Series

3. Identify the cyclical variations: These refer to the ups and downs recurring over
some time. Cyclical variations are of a longer duration and may not follow
precisely similar patterns after equal intervals of time. For example, cyclical
variations can be seen in a business cycle. These cycles entail intervals of
prosperity, recession, depression, and recovery. The usual period of a business
cycle may range between 5-11 years.
4. Identify random or irregular variations due to unforeseen and unpredictable
circumstances. For example, variations caused due to famines, floods, strikes,
landslides, wars, etc.

7.2 FORMATTING DATES


While representing your data in a time-series graph, you might be required to display the
dates in different formats to suit your requirements. For example, the date July 18, 2021,
can be represented in various forms, which include:

18-07-2021, July 18, 2021, 18.7.21, 2021-07-18, and so on.

PowerBI provides various options for date formatting.

Figure 7.2 Date Formatting in PowerBI

56
FUNDAMENTALS OF DATA VISUALIZATION Time Series

7.3 USING CYCLE PLOTS TO DEPICT SEASONALITY


You have observed that the time series graphs are represented as line graphs. These line
charts are usually used to show effects, such as a quarter of the year, a month of the year,
or a day of the week. This helps us understand the impact of the month of the year or day
of the week on the data we are analyzing.

You can represent line charts in multiple line charts or single line charts. Consider the
following sales data:

Value
Year Quarter
(in 1000$)

2017 Q1 78

2017 Q2 56

2017 Q3 62

2017 Q4 66

2018 Q1 85

2018 Q2 36

2018 Q3 46

2018 Q4 96

2019 Q1 96

2019 Q2 37

2019 Q3 76

2019 Q4 65

2020 Q1 56

2020 Q2 83

2020 Q3 28

2020 Q4 20

Table 7.2 Sales by Person Table

57
FUNDAMENTALS OF DATA VISUALIZATION Time Series

Let us represent this data with the help of a multiple-line chart.

Figure 7.3 Time Series in PowerBI

The representation of quarters is as follows:

1. quarter 1 – Blue line


2. quarter 2 – Orange line
3. quarter 3 – Red line
4. quarter 4 – Green line

As is evident from the multiple-line chart, the sales peak was observed for quarter 1 in
2019. The value of quarter-one sales dipped in 2020.

58
FUNDAMENTALS OF DATA VISUALIZATION Time Series

However, the limitation here is that you cannot see the general trend for the data. Now,
let us try to plot this data using a single-line chart.

Figure 7.4 Aggregated Time Series in PowerBI

The data plotted on a single line chart shows the presence or absence of trends/cycles. We can
see the trend that the sale value increased gradually between 2017 – 2019 and declined in 2020.

However, in the above single-line chart, it isn’t easy to see the effect of each quarter on
sales. This is where cycle plots can prove to be helpful.

7.3.1 WHAT MAKES CYCLE PLOTS USEFUL?

Cycle plots allow you to incorporate both types of data – the quarter-of-the-year effect and
the trend/cycle data.

59
FUNDAMENTALS OF DATA VISUALIZATION Time Series

Figure 7.5 Cycle Plot in PowerBI

This cycle plot allows you to view yearly sales and the quarter-of-the-year effect. Here, you
can see the annual sales value for individual quarters. Observe the peak of sales achieved
for quarter 1 in 2019 and quarter 4 in 2018.

7.4 FORECASTING
The time series graphs can be used for making predictions since these graphs depict time-
based data. The process of making predictions is also called forecasting or extrapolation.
Forecasting involves considering historical data to predict future observations. Below are a
few examples that provide a better picture of this concept.

1. We are forecasting the rice yield by the state for each year.
2. We are forecasting the birth rate in each city for each year.
3. Forecasting weather conditions in each city for each month.
4. We are forecasting electricity consumption by each household for each month.
5. Forecasting sales of each product for each day.
6. We are forecasting the number of passengers traveling through a train each day.

While forecasting time series data, the primary aim is to estimate how the sequence of
observations will move into the future.

60
FUNDAMENTALS OF DATA VISUALIZATION Time Series

7.4.1 USING A GRAPH TO DISPLAY FORECASTED DATA

Automated software like PowerBI allows you to forecast data. Observe that the graph below
predicts the profits for 2021 and 2022 based on the profits earned up to 2020.

Figure 7.6 Forecasting a Time Series in PowerBI

61
FUNDAMENTALS OF DATA VISUALIZATION Storytelling

8 STORYTELLING

8.1 INTRODUCTION
Data plays a significant role in business operations and their related decisions. Data
visualization allows you to generate various types of charts and tables. It, thus, will enable
you to present the substance of your matrices visually.

However, to communicate effectively with your customers, employees, or other stakeholders,


you may be required to highlight your products and services value by emphasizing certain
aspects. The audience will benefit more from your data if presented effectively, engaging, and
convincingly. This can be achieved with the help of a compelling story that supports your position.

Storytelling with data links data with human communication to create an exciting narrative
supported by facts. Storytelling uses data visualization techniques like charts, tables, and
graphs. Data-driven stories are tailored to the specific audience and the context to which
they cater. This renders cognitive clarity to data, and the audience can better absorb the
message in the data.

Figure 8.1 Narrative Added to a Visualization

Observe the narrative along with the data. It allows better data comprehension by providing
a visual aid to the audience.

62
FUNDAMENTALS OF DATA VISUALIZATION Storytelling

Two common approaches can be used for storytelling: explorative and narrative.

1. Explorative approach: This approach encourages the viewers to conclude by


exploring data. They are led to pay attention to stories relevant to the data.
For example, a story might depict a chronological event. The viewers might be
encouraged to find out why the event occurred.
2. Narrative approach: In this approach, the viewers are led through a narrative
providing a specific conclusion. For example, a story might narrate why two
items, A and B, differ.

8.2 ASPECTS OF A GOOD STORY


Good storytelling has various aspects, such as those listed below.

1. Quality of data: Your company’s data should be factual and good.


2. Message: A good story should provide clear messages to the audience.
3. Narrative: Data insights should be translated into a story.
4. Definition of objectives: The story’s purpose should be clear to you. This
provides various advantages, allowing you to focus on specific trends, data
subsets, and information categorization for better understanding.
5. Definition of target audience: While framing a data story, you should consider
the audience profile for which it is presented. For example, what is essential to
the management might not be relevant to the end customers.
6. Using appropriate data visualization tools: For compelling storytelling and
visualization, you must familiarize yourself with tools like Tableau and
Microsoft Power BI.
7. Eliminate clutter: The story should provide the proper context to the analyzed
data. For this purpose, you should include only those visuals that increase and
aid the audience’s understanding. Avoid displaying unnecessary information.
8. Color selection: Colors play a vital role in storytelling. Usually, the visualization
tools provide built-in color pallets from which you can choose. Be prudent in
the usage of colors. For example, use appropriate contrast in shades and more
than one color to distinguish correctly different aspects of data.

63
FUNDAMENTALS OF DATA VISUALIZATION Storytelling

8.3 ADVANTAGES OF STORYTELLING


1. Provides meaning to data: Storytelling provides meaning and value in a world
surrounded by data. For example, a story might assist in ascertaining why two
items, say, A and B, are different. How can B perform like A? Which area
needs greater attention, and which site serves well?
2. Incorporates versatility in data visualization: This can be achieved by
combining different data visualizations with appropriate insights. For example,
for storytelling, you may use techniques like infographics, case studies,
presentations, reports, motion graphics, etc.
3. Creates credibility: People become more inclined to trust your product offering
when you use credible data in your story.
4. Helps people to retain your message: With the help of a story, people tend to
remember your message due to increased comprehension, which results from
combining visuals with text. As a result, people remember your message and are
encouraged to engage with your business.

8.4 CREATING A STORY


To build a story, you must first create data visualization charts. Here, we have used PowerBI
to create various charts. PowerBI allows the creation of a story for multiple charts located
in different sheets (like Excel).

In the example below, we have created a story using three sheets. Each sheet forms a story
point. The graphic below demonstrates this.

Figure 8.2 Story points in PowerBI

64
FUNDAMENTALS OF DATA VISUALIZATION Storytelling

Figure 8.3 Story point Narrative

The first story shows the quarterly sales trend for 2014, 2015, 2016, and 2017. The
accompanying narrative supports the chart in the story.

The following story point demonstrates sales made in consumer, Corporate, and home office
segments. The highest percentage of sales is contributed to the Consumer sector, which is
supported by the narrative at the bottom.

Figure 8.4 Story point Flow

65
FUNDAMENTALS OF DATA VISUALIZATION Storytelling

The third and last story point explains the profits earned in different states.

Figure 8.5 Story point Conclusion

To summarize, supporting your data visualizations with appropriate stories allows viewers
to understand and engage with data better. This will enable businesses to communicate
effectively with their audience and deliver the desired message.

PowerBI provides a dynamic narrative capability that allows changing the narrative according
to the selected date label without changing the sheet.

Figure 8.6 Dynamic Narrative PowerBI before

66
FUNDAMENTALS OF DATA VISUALIZATION Storytelling

Figure 8.7 Dynamic Narrative PowerBI after

67
FUNDAMENTALS OF DATA VISUALIZATION Conclusion

CONCLUSION
Power BI is a powerful data visualization tool that finds extensive usage across various industries
and sectors. One key application is in business intelligence and analytics. Organizations
use Power BI to transform raw data into visually appealing and interactive reports and
dashboards. This enables decision-makers to gain insights quickly and make data-driven
choices. Whether it’s tracking sales performance, monitoring key performance indicators, or
analyzing market trends, Power BI provides a versatile platform to create custom visualizations
tailored to specific business needs. This not only enhances data comprehension but also
fosters collaboration among teams by sharing real-time, interactive dashboards.

Another vital use of Power BI is in data storytelling. In today’s data-driven world,


communicating insights effectively is crucial. Power BI enables users to create compelling
narratives through visuals, making it easier to convey complex data findings to a broader
audience. This is especially valuable in presentations, board meetings, and client interactions,
where clear and impactful data visualization can influence decisions and strategies. Moreover,
Power BI allows integration with various data sources, enabling organizations to consolidate
data from multiple platforms and gain a holistic view of their operations. Overall, Power
BI’s versatility and user-friendly interface make it an invaluable tool for data visualization
and storytelling in both business and academic settings.

68
FUNDAMENTALS OF DATA VISUALIZATION Table of Figures

TABLE OF FIGURES
Figure 1.1 World Population and Annual Growth Rate in The Last Decade 10
Figure 2.1 Locations 14
Figure 4.1 Ice Cream preferences 17
Figure 4.2 Vertical Bar Chart 19
Figure 4.3 Horizontal Bar Chart 20
Figure 4.4 Grouped Bar Chart 21
Figure 4.5 Stacked Bar Chart 22
Figure 4.6 Line Chart 24
Figure 4.7 Line Chart Components 25
Figure 4.8 Simple Line Chart 26
Figure 4.9 Multiple Line Chart 27
Figure 4.10 Profit Map 28
Figure 4.11 Sales Map 29
Figure 5.1 Range Histogram 33
Figure 5.2 Bell Shaped Histogram 34
Figure 5.3 Bi-Modal Histogram 34
Figure 5.4 Right Skewed Histogram 35
Figure 5.5 – Left Skewed Histogram 35
Figure 5.6 Uniform Histogram 36
Figure 5.7 Random Histogram 36
Figure 5.8 Higher Variability Distribution 37
Figure 5.9 Lower Variability Distribution 38
Figure 5.10 Reference Line in PowerBI 39
Figure 5.11 Aggregate Functions in Excel 40
Figure 5.12 Aggregate Functions in Excel (Result of aggregate formula) 40
Figure 5.13 Aggregate Functions in PowerBI 41
Figure 5.14 Box Plot 42
Figure 5.15 Box Plot in PowerBI 44
Figure 6.1 Looking at correlations 45
Figure 6.2 Potential Correlations 46
Figure 6.3 Scatter Plot in PowerBI with Correlation 47
Figure 6.4 Scatter Plot in PowerBI without Correlation 48
Figure 6.5 Scatter Plot in PowerBI Profit vs. Sales 49
Figure 6.6 Scatter Plot in PowerBI Profit vs. Sales with Trendline 50
Figure 6.7 Scatter Plot in PowerBI with Negative Trend 51
Figure 6.8 R^2 in PowerBI 53

69
FUNDAMENTALS OF DATA VISUALIZATION Table of Figures

Figure 7.1 Time Series in PowerBI 55


Figure 7.2 Date Formatting in PowerBI 56
Figure 7.3 Time Series in PowerBI 58
Figure 7.4 Aggregated Time Series in PowerBI 59
Figure 7.5 Cycle Plot in PowerBI 60
Figure 7.6 Forecasting a Time Series in PowerBI 61
Figure 8.1 Narrative Added to a Visualization 62
Figure 8.2 Story points in PowerBI 64
Figure 8.3 Story point Narrative 65
Figure 8.4 Story point Flow 65
Figure 8.5 Story point Conclusion 66
Figure 8.6 Dynamic Narrative PowerBI before 66
Figure 8.7 Dynamic Narrative PowerBI after 67

70
FUNDAMENTALS OF DATA VISUALIZATION References

REFERENCES
Chapter 1

1. https://learn.microsoft.com/en-us/power-bi/fundamentals/power-bi-overview
2. https://www.ted.com/talks/hans_rosling_the_best_stats_you_ve_ever_
seen?language=en
3. https://www.techtarget.com/searchbusinessanalytics/definition/data-visualization
4. https://www.globaldata.com/data-insights/macroeconomic/world-population-from/
5. https://www.nobledesktop.com/learn/data-visualization/industries-and-professions

Chapter 2

6. https://www.geeksforgeeks.org/difference-between-syntax-and-semantics/
7. https://amplitude.com/blog/data-types

Chapter 3

1. https://www.tibco.com/reference-center/what-is-data-transformation
2. https://learn.microsoft.com/en-us/dax/concatenate-function-dax
3. https://www.techrepublic.com/article/calculate-profit-margin-power-bi-calculated-
column/

Chapter 4

1. https://www.fe.training/free-resources/power-bi-data-visualization/bar-and-
column-charts-in-power-bi/
2. https://www.indeed.com/career-advice/career-development/types-of-bar-graphs
3. https://www.techquintal.com/advantages-and-disadvantages-of-bar-diagram/
4. https://www.investopedia.com/terms/l/linechart.asp
5. https://www.geeksforgeeks.org/power-bi-format-line-chart/
6. https://powerbidocs.com/2020/10/26/small-multiple-line-chart-visual-in-power-bi/
7. https://spreadsheeto.com/power-bi-map/

Chapter 5

1. https://www.hellovaia.com/explanations/math/statistics/single-variable-data/
2. https://statisticsbyjim.com/basics/frequency-table/
3. https://www.statology.org/describe-shape-of-histogram/

71
FUNDAMENTALS OF DATA VISUALIZATION References

4. https://soc.utah.edu/sociology3112/central-tendency-variability.php
5. https://web.mnstate.edu/malonech/Psy%20230/Notes/Percentiles%20GW2.htm
6. https://www.tableau.com/drive/reference-lines-as-visual-statistics
7. https://www.wallstreetmojo.com/power-bi-aggregate/
8. https://www.khanacademy.org/math/statistics-probability/summarizing-
quantitative-data/box-whisker-plots/a/box-plot-review

Chapter 6

1. https://amplitude.com/blog/causation-correlation
2. https://www.simplypsychology.org/correlation.html
3. https://visme.co/blog/scatter-plot/
4. https://zebrabi.com/guide/how-to-add-trendline-in-power-bi-2/
5. https://corporatefinanceinstitute.com/resources/data-science/r-squared/

Chapter 7

1. https://study.com/academy/lesson/time-series-plots-definition-features.html
2. https://www.toppr.com/guides/business-mathematics-and-statistics/time-series-
analysis/components-of-time-series/
3. https://simplexct.com/to-find-seasonality-use-cycle-plots
4. https://thirdspacelearning.com/gcse-maths/statistics/time-series-graph/

Chapter 8

1. https://www.techtarget.com/searchcio/definition/data-storytelling
2. https://www.analyticsvidhya.com/blog/2020/05/art-storytelling-analytics-data-
science/

72

You might also like