Nothing Special   »   [go: up one dir, main page]

Business Intelligence and Data Warehousing-Merged

Download as pdf or txt
Download as pdf or txt
You are on page 1of 401

Business Intelligence and

Data Warehousing

Dr. Atul Garg


Data

• Data is the raw building blocks


• Data are raw, individual, and unarguable facts.
• Information, often in the form of facts or figures obtained from
experiments or surveys, used as a basis for making calculations or
drawing conclusions
• Information, for example, numbers, text, images, and sounds, in a
form that is suitable for storage in or processing by a computer
Information
• Information is the combination of data into a form that can answer an
everyday question.
• Definite knowledge acquired or supplied about something or somebody
• The collected facts and data about a particular subject
• A telephone service that supplies telephone numbers to the public on
request.
• The communication of facts and knowledge
• Computer data that has been organized and presented in a systematic
fashion to clarify the underlying meaning
• A formal accusation of a crime brought by a prosecutor, as opposed to an
indictment brought by a grand jury
Knowledge
• Familiarity or understanding on a specific topic gained through
experience or study
• Knowledge is used in terms of a persons skills or expertise in a given
area.
• Knowledge typically reflects an empirical.
• General awareness or possession of information, facts, ideas, truths, or
principles
• Clear awareness or explicit information, for example, of a situation or
fact
• All the information, facts, truths, and principles learned throughout time
Types of Knowledge
• Explicit knowledge is knowledge covering topics that are easy to
systematically document (in writing), and share out at scale: what we
think of as structured information. Explicit knowledge includes things
like FAQs, instructions, raw data and related reports, diagrams, one-
sheets, and strategy slide decks.
• Implicit knowledge is, essentially, learned skills or know-how. It is
gained by taking explicit knowledge and applying it to a specific
situation. Implicit knowledge is what is gained when you learn the best
way to something.
• Tacit knowledge is intangible information that can be difficult to explain
in a straightforward way, such as things that are often “understood”
without necessarily being said, and are often personal or cultural.
Intelligence
• Intelligence is the combination of information into a form that tells a
story and informs decisions. Examples might include that in the
festival seasons the COVID cases
• Intelligence is decision-support. It’s a tool for making intelligent
predictions about the future—based on solid understanding of the
present—in order to take a course of action that improves outcomes.
may increase.
• Intelligence combines information to form a predictive narrative that
enables better decision-making.
Wisdom

• The knowledge and experience needed to make sensible decisions


and judgments, or the good sense shown by the decisions and
judgments made
• Accumulated knowledge of life or in a particular sphere of activity
that has been gained through experience
• An opinion that almost everyone seems to share or express
• Ancient teachings or sayings
Interrelationships
Data Warehousing
• Data warehouse is perfectly named from physical warehouse, it
operates as storage for data that has been extracted from another
source.
• The concept of the data warehouse has existed since the 1980s, when it
was developed to help transition data from merely powering
operations to fueling decision support systems that reveal business
intelligence.
• Many organizations have proprietary data warehouses that store
information on performance metrics, sales quotas, lead generation stats
and a variety of other information.
• A data warehouse is a large collection of business data used to help
an organization make decisions.
Data Warehousing
• The large amount of data in data warehouses comes from different
places such as internal applications such as marketing, sales,
finance, customers and external partner systems, among others.
• Data warehouses can perform some analytics capabilities: using the
extract, transform, load (ETL) process, data warehouses can
perform the complex queries that transactional databases cannot
handle.
• Once data has entered a warehouse, it cannot be altered. Data
warehouses only perform analysis of historical data.
Characteristics of Data Warehouse
• Uses large historical data sets
• Allows both planned and ad hoc queries
• Controls data load
• Make an organization's information easily accessible
• Retrieves large volumes of data
• Manage user schema like tables, indexes, etc.
• Generate reports
• Backs up data
Advantages of Data Warehouse
• Saves times
• Enhances data quality and consistency
• Generates a high Return on Investment (ROI)
• Provides competitive advantage
• Improves the decision-making process
• Enables organizations to forecast with confidence
• Streamlines the flow of information
• Increasing data quality
• Increase searching probability of more information
Need & evolution of Data Warehouse
• Evolution of Web
• During the 1990s major cultural and technological changes were taking place.
The internet was surging in popularity.
• Competition had increased due to new free trade agreements, computerization,
globalization, and networking.
• During this time, the use of application systems exploded.
• By the year 2000, many businesses discovered with the expansion of
databases and application systems, their systems had been badly integrated
and that their data was inconsistent.
• Data Warehouses were developed by businesses to consolidate the data they
were taking from a variety of databases, and to help support their strategic
decision-making efforts
• Use of NoSQL
Business Intelligence
• BI(Business Intelligence) is a set of processes, architectures and technologies that
convert raw data into meaningful information that drives profitable business actions. It is
a suite of software and services to transform data into actionable intelligence and
knowledge.
• Business intelligence is defined by Gartner as “an umbrella term that includes
the applications, infrastructure and tools, and best practices that enable access to and
analysis of information to improve and optimize decisions and performance.”
• BI has a direct impact on organization’s strategic, tactical and operational business
decisions. BI supports fact-based decision making using historical data rather than
assumptions and gut feeling.
• BI is a category of intelligence systems that gather proprietary data then organize, analyze
and visualize it to help users draw business insights. It can blend data from a variety of
sources, discover data trends or patterns, and suggest best practices for visualizations and
next actions.
Importance of BI

• Measurement: creating Key Performance Indicators (KPI) based on historic


data.
• Identify and set benchmarks for varied processes.
• Identify market trends and spot business problems that need to be addressed.
• Data visualization that enhances the data quality and thereby the quality of
decision making.
• BI systems can be used not just by enterprises but SME (Small and Medium
Enterprises)
Implementation of BI

• Raw Data from corporate databases is extracted. The data could be


spread across multiple systems heterogeneous systems.
• The data is cleaned and transformed into the data warehouse. The
table can be linked and data cubes are formed.
• Using BI system can handle quires, request, ad-hoc reports or
conduct any other analysis.
Example of BI

Reference: https://www.guru99.com/business-intelligence-definition-example.html
Other Examples
• Example 2:
• A hotel owner uses BI analytical applications to gather statistical information
regarding average occupancy and room rate. It helps to find aggregate revenue
generated per room.
• It also collects statistics on market share and data from customer surveys from
each hotel to decides its competitive position in various markets.
• By analyzing these trends year by year, month by month and day by day helps
management to offer discounts on room rentals.
• Example 3:
• A bank gives branch managers access to BI applications. It helps branch manager
to determine who are the most profitable customers and which customers they
should work on.
• The use of BI tools frees information technology staff from the task of generating
analytical reports for the departments. It also gives department personnel access to
a richer data source.
• E-commerce: AMAZON, Flipkart etc.
Pros and Cons of BI
Pros Cons

Boost productivity Cost

Improve visibility Complexity

Fix Accountability Limited Use

Bird’s eye view Time Consuming Implementation

Streamlines business processes Small Scale Business

Allows for easy analytics Illiteracy


Business Intelligence & Data Warehousing
Data warehousing and Business Intelligence often go hand in hand, because the
data made available in the data warehouses are central to the Business
Intelligence tools’ use.
• BI tools like Tableau, Sisense, Chartio, Looker etc, used to retrieve or analyse
data from the data warehouses for purposes like query, reporting, analytics, and
data mining.
• In any enterprise, Business Intelligence plays a central role in the smooth and
cost-effective functioning of it. Thus, BI is helpful in operational efficiency
which includes reporting, risk management, product profitability, costing,
logistics etc.
• BI also, helps in customer interaction which includes, sales analysis, sales
forecasting, segmentation, campaign planning, customer profitability etc.
References
• https://www.guru99.com/business-intelligence-definition-
example.html#:~:text=Step%201)%20Raw%20Data%20from,across%20mu
ltiple%20systems%20heterogeneous%20systems.&text=The%20table%20c
an%20be%20linked,or%20conduct%20any%20other%20analysis.
• https://datawarehouseinfo.com/data-warehouse/benefits-of-a-data-
warehouse/
• https://www.youtube.com/watch?v=7qcdcBfxuH0
• https://www.passionned.com/nine-reasons-to-build-a-data-warehouse/
• https://www.c-sharpcorner.com/blogs/goals-of-a-data-warehouse1
• https://www.talend.com/resources/what-is-data-warehouse/
• https://www.diyotta.com/data-warehouse-definition-history-and-evolution
Thanks
Digital Data

Dr. Atul Garg

1
Digital data

• The term of digital data is a binary format of information. The computer is


converted into some machine-readable digital format.
• Digital data is data that represents other forms of data using specific machine
language systems that can be interpreted by various technologies.
• The most fundamental of these systems is a binary system, which simply
stores complex audio, video or text information in a series of binary
characters, traditionally ones and zeros, or "on" and "off" values.
• These days, digital data is everywhere. Whenever you send an email, read a
social media post, or take pictures with your digital camera, you are working
with digital data.

2
Types of Digital data
Digital data can be classified into three forms:

• Unstructured data: This is the data which does not conform to a data model or is not
in a form which can be used easily by a computer program. About 80—90% data of
an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, body of an email, etc.
• Semi-structured data: This is the data which does not conform to a data model but
has some structure. However, it is not in a form which can be used easily by a
computer program; for example XML, mark-up languages like HTML, etc. Metadata
for this data is available but is not sufficient.
• Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. Data stored in databases is an
example of structured data.

3
Formats of Digital Data

• Usually, data is in the unstructured format which makes


extracting information from it difficult.
• According to Merrill Lynch, 80–90% of business data is
either unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes
80% of the whole enterprise data.
4
Characteristics of Un-structured Data

5
Sources of Un-structured Data
Broadly speaking, anything in a non-database form is unstructured data.

6
Managing Un-structured Data
Few generic tasks to be performed to enable storage and search of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database Management System(RDBMS). In this system,
data is indexed to enable faster search and retrieval. On the basis of some value in the data, index is defined which is
nothing but an identifier and represents the large record in the data set. In the absence of an index, the whole data set/
document will be scanned for retrieving the desired information. In the case of unstructured data too, indexing helps in
searching and retrieval. Based on text or some other attributes, e.g. file name, the unstructured data is indexed. Indexing in
unstructured data is difficult because neither does this data have any predefined attributes nor does it follow any pattern or
naming conventions. Text can be indexed based on a text string but in case of non-text based files, e.g. audio/video, etc.,
indexing depends on file names.
Tags/Metadata: Using metadata, metadata, data in a document, document, etc. can be tagged. This enables search and
retrieval. But in unstructured data, this is difficult as little or no metadata is available. Structure of data has to be
determined which is very difficult as the data itself has no particular format and is coming from more than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the relationships that exist between data. Data can
be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an organization. However, classifying
unstructured data is difficult as identifying relationships between data is not an easy task. In the absence of any structure or
metadata or schema, identifying accurate relationships and classifying is not easy. Since the data is unstructured, naming
conventions or standards are not consistent across an organization, thus making it difficult to classify data.
CAS (Content Addressable Storage): It stores data based on their metadata. It assigns 2 unique name to every object
stored in it. The object is retrieved based on its content and not its location. It is used extensively to store emails, etc
7
Challenges to store Un-structured Data

8
Possible Solutions to Store Un-structured Data

9
Challenges to extract Information

10
Solutions to extract Information

XOLAP (extended online analytic processing)

11
Semi-structured Data

• Semi-structured data does not conform to any data model i.e. it is difficult to determine the meaning of data
neither can data be stored in rows and columns as in a database but semi-structured data has tags and
markers which help to group data and describe how data is stored, giving some metadata but it is not
sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The attributes or the properties within a
group may or may not be the same. For example two addresses may or may not contain the same number of
properties as in
Address 1
<house number><street name><area name><city> Address 2
<house number><street name><city>
• For example an e-mail follows a standard format
To: <Name> From: <Name> Subject: <Text> CC:
<Name>
Body: <Text, Graphics, Images etc. >
• The tags give us some metadata but the body of the e-mail contains no format neither is such which conveys
meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
12
Semi-Structured Data

13
Sources of Semi-structured Data

14
Managing Semi-structured Data
Some ways in which semi-structured data is managed and stored

Schemas Graph-based data XML


models

• Describe the • Contain data on • Models the data


structure and the leaves of the using tags and
content of data to graph. Also known elements
some extent as ‘schema less’

• Assign meaning to • Used for data • Schemas are not


data hence exchange among tightly coupled to
allowing automatic heterogeneous data
search and sources
indexing

15
Challenges to store Semi-structured Data

16
Possible Solutions to Store Semi-structured Data

Object Exchange Model

17
Challenges to extract Semi-structured Data

18
Solutions to extract Semi-structured Data

Object Exchange Model

19
XML: to manage Semi-structured Data

XML Extensible MarkUp Language

What is XML? Open-source mark up language written in plain


text. It is hardware and software independent

Does what? Designed to store and transport data over


the Internet

It allows data to be stored in a


How? hierarchical/nested structure. It allows user to
define tags to store the data

20
XML: to manage Semi-structured Data
XML has no predefined tags

<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>

The words in the <> (angular brackets) are user-defined tags


XML is known as self-describing as data can exist without a schema and schema can
be added later
Schema can be described in XSLT or XML schema
21
Structured Data

• Structured data is organized in semantic chunks (entities)


• Similar entities are grouped together (relations or classes)
• Entities in the same group have the same descriptions
(attributes)
• Descriptions for all entities in a group (schema) have the
same defined format have a predefined length are all
present and follow the same order

22
Structured Data

Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)

Structured
data

Attributes in a Data resides in


group are the fixed fields within
same a record or file

Definition, format
& meaning of data
is explicitly
known

23
Sources of Structured Data

Databases (e.g., Access)

Spreadsheets

Structured Data

SQL

Online Transaction Processing


OLTP systems

24
Managing Structured Data

Fully described datasets

Clearly defined categories and sub-categories

Data neatly placed in rows and columns

Data that goes into the records is regulated by a well-defined structure

Indexing can be easily done either by the DBMS itself or manually

25
26
Storing Structured Data

27
Retrieving Structured Data

28
Difference b/w types of Data

Sr. No. Key Structured Data Semi Structured Data Unstructured Data
Level of Structured Data as name On other hand in case of Semi Structured In last the data is fully non
organizing suggest this type of data Data the data is organized up to some organized in case of
is well organized and extent only and rest is non organized Unstructured Data and
1
hence level of organizing hence the level of organizing is less than hence level of organizing is
is highest in this type of that of Structured Data and higher than lowest in case of
data. that of Unstructured Data. Unstructured Data.
Means of Data Structured Data is get While in case of Semi Structured Data is On other hand in case of
Organization organized by the means of partially organized by the means of Unstructured Data is based
2
Relational Database. XML/RDF. on simple character and
binary data.
Transaction In Structured Data In Semi Structured Data transaction is not While in Unstructured Data
Management management and by default but is get adapted from DBMS no transaction management
concurrency of data is but data concurrency is not present. and no concurrency are
3
present and hence mostly present.
preferred in multitasking 29
process.
Difference b/w types of Data

Sr. Key Structured Data Semi Structured Data Unstructured Data


No.
Versioning Structured Data supports in On other hand in case of Semi Versioning in case of
Relational Database so versioning Structured Data versioning is done Unstructured Data is possible
is done over tuples, rows and only where tuples or graph is only as on whole data as no
4
table as well. possible as partial database is support of database at all.
supported in case of Semi Structured
Data.
Flexible and As Structured Data is based on While in case Semi Structured Data is As there is no dependency on
Scalable relational database so it becomes more flexible than Structured Data but any database so Unstructured
schema dependent and less less flexible and scalable as compare Data is more flexible and
5
flexible as well as less scalable. to Unstructured Data. scalable as compare to
Structured and Semi
Structured Data.

30
Difference b/w types of Data

Sr. Key Structured Data Semi Structured Data Unstructured Data


No.
Performance In Structure Data we can perform On other hand in case of Semi While in case of Unstructured
structured query which allow Structured Data only queries over Data only textual query are
complex joining and thus anonymous nodes are possible so its possible so performance is
6
performance is highest as performance is lower than Structured lower than both Structured
compare to that of Semi Data but more than that of and Semi Structured Data.
Structured and Unstructured Data. Unstructured Data

31
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”,
Wiley India Publishers.
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/DigitalData.pdf
• https://www.tutorialspoint.com/difference-between-structured-semi-
structured-and-unstructured-data
• https://www.michael-gramlich.com/what-is-structured-semi-structured-and-
unstructured-data/
• https://www.datamation.com/big-data/structured-vs-unstructured-data/

32
Data Mining

Dr. Atul Garg


Index
• Introduction
• Advantages / disadvantages
• Applications of Data Mining
• Knowledge Discovery Process,
• Data Mining vs Query Tools,
• What kind of data can be mined?
Data Mining
• Mining is the process of extraction of some valuable material from the
earth e.g. coal mining, diamond mining, etc. In the context of computer
science, “Data Mining” can be referred to as knowledge mining from
data, knowledge extraction, data/pattern analysis and data
dragging. It is basically the process carried out for the extraction of
useful information from a bulk of data or data warehouses.
or
• The process of extracting information to identify patterns, trends, and
useful data that would allow the business to take the data-driven
decision from huge sets of data is called Data Mining.
Advantages of Data Mining
• The Data Mining technique enables organizations to obtain
knowledge-based data.
• Compared with other statistical data applications, data mining is a
cost-efficient.
• Data Mining helps the decision-making process of an organization.
• It Facilitates the automated discovery of hidden patterns as well as the
prediction of trends and behaviors.
• It can be induced in the new system as well as the existing platforms.
• It is a quick process that makes it easy for new users to analyze
enormous amounts of data in a short time.
Disadvantages of Data Mining
• There is a probability that the organizations may sell useful data of
customers to other organizations for money.
• Many data mining analytics software is difficult to operate and needs
advance training to work on.
• The selection of the right data mining tools is a very challenging
task.
• The data mining techniques are not precise, so that it may lead to
severe consequences in certain conditions.
Applications of Data Mining

Customer
Relationship
Management
Data Mining and Knowledge Discovery
Data mining and knowledge discovery in databases (KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery process.

Knowledge Discovery Process


Data Mining and Knowledge Discovery
The iterative process consists of the following steps:
• Data cleaning: also known as data cleansing, it is a phase in which noise data and irrelevant data are removed
from the collection.
• Data integration: at this stage, multiple data sources, often heterogeneous, may be combined in a common
source.
• Data selection: at this step, the data relevant to the analysis is decided on and retrieved from the data
collection.
• Data transformation: also known as data consolidation, it is a phase in which the selected data is transformed
into forms appropriate for the mining procedure.
• Data mining: It is the crucial step in which clever techniques are applied to extract patterns potentially useful.
• Pattern evaluation: In this step, strictly interesting patterns representing knowledge are identified based on
given measures.
• Knowledge representation: is the final phase in which the discovered knowledge is visually represented to the
user. This essential step uses visualization techniques to help users understand and interpret the data mining
results.
Data cleaning and data integration can be performed together as a pre-processing phase to generate a data
warehouse.
Data selection and data transformation can also be combined where the consolidation of the data is the result of
the selection, or, as for the case of data warehouses, the selection is done on transformed data.
Query Tools vs Data Mining
Query Tools Data Mining
A software tool that allows end users to access Data mining is defined as a process used to extract usable
information stored in a database data from a larger set of any raw data

Query tools can be used to easily build and input Data Mining is a technique or a concept in computer
queries to databases science

Query tools make it very easy to build queries without Deals with extracting useful and previously unknown
even having to learn a database-specific query language information from raw data

Query tools the users need to know exactly what they While data mining is used mostly when the user has a
are looking for vague idea about what they are looking for

Query tools can be used to easily build and input Data miners can use the existing functionalities of
queries to databases Query Tools to pre-process raw data before the Data
mining process
Kind of data can be mined

1. Flat Files
2. Relational Databases
3. Data Warehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
9. Medical and personal data
10. Satellite sensing
11. Games
12. Text reports / Memos / Email-messages / chats
etc
Kind of data can be mined
1. Flat Files
• Flat files is defined as data files in text form or binary form with a structure that can be easily
extracted by data mining algorithms.
• Data stored in flat files have no relationship or path among themselves.
• Flat files are represented by data dictionary. Eg: CSV file.
• Application: Used in Data Warehousing to store data, Used in carrying data to and from server,
etc.
2. Relational Databases
• A Relational database is defined as the collection of data organized in tables with rows and
columns.
• Physical schema in Relational databases is a schema which defines the structure of tables.
• Logical schema in Relational databases is a schema which defines the relationship among tables.
• Standard API of relational database is SQL.
• Application: Data Mining, Relational Online Analytical Processing (ROLAP) model, etc.
Kind of data can be mined

3. Data Warehouse
• A data warehouse is defined as the collection of data integrated from multiple sources that will
queries and decision making.
• Two approaches can be used to update data in Data Warehouse: Query-driven Approach
and Update-driven Approach.
• Application: Business decision making etc.
4. Transactional Databases
• Transactional databases is a collection of data organized by time stamps, date, etc to represent
transaction in databases.
• This type of database has the capability to roll back or undo its operation when a transaction is
not completed or committed.
• Highly flexible system where users can modify information without changing any sensitive
information.
• Application: Banking, Distributed systems, Object databases, etc.
Kind of data can be mined
5. Multimedia Databases
• Multimedia databases consists audio, video, images and text media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in a pre-specified formats.
• Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6. Spatial Database
• Store geographical information.
• Stores data in the form of coordinates, topology, lines, polygons, etc.
• Application: Maps, Global positioning, etc.
7. Time-series Databases
• Time series databases contains stock exchange data and user logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, InfluxDB, etc.
Kind of data can be mined
8. WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are
identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible
via the Internet network.
•It is the most heterogeneous repository as it collects data from multiple resources.
•It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc
9. Medical and personal data: From government census to personnel and customer files, very large collections
of information are continuously gathered about individuals and groups. Governments, companies and
organizations such as hospitals, are stockpiling very important quantities of personal data to help them manage
human resources, better understand a market, or simply assist clientele.
Applications: Hospitals, Social media etc
10. Satellite sensing: There is a countless number of satellites around the globe: some are geo-stationary above a
region, and some are orbiting around the Earth, but all are sending a non-stop stream of data to the surface.
NASA, which controls a large number of satellites, receives more data every second than what all NASA
researchers and engineers can cope with.
Applications: Space institutions etc
Kind of data can be mined
11. Games: Our society is collecting a tremendous amount of data and statistics about games, players and
athletes. From hockey scores, basketball passes and car-racing lapses, to swimming times, boxers pushes
and chess positions, all the data are stored. Commentators and journalists are using this information for
reporting, but trainers and athletes would want to exploit this data to improve performance and better
understand opponents.
Applications: BCCI, etc
12. Text reports and memos (e-mail messages): Most of the communications within and between companies
or research organizations or even private people, are based on reports and memos in textual forms often
exchanged by e-mail. These messages are regularly stored in digital form for future use and reference
creating formidable digital libraries.
Applications: e-commerce, hospitals, library etc
References
• https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/notes/Chapte
r1/index.html
• https://www.javatpoint.com/data-mining
• https://www.differencebetween.com/difference-between-data-mining-
and-vs-query-tools/
• https://vspages.com/data-mining-vs-query-tools-1897/
• https://www.geeksforgeeks.org/types-of-sources-of-data-in-data-
mining/
Data Pre-processing

Dr. Atul Garg


Index
• Introduction
• Why pre-process the data,
• Major tasks,
• ETL and Data Cleaning
• Missing values
• Noisy data
• Data cleaning as process,
Data Pre-processing
Data preprocessing is a data mining technique which is used to
transform the raw data in a useful and efficient format.
Data Quality: Why Preprocess the Data?
• Measures for data quality: A multidimensional view
• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

4
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Reduction
• Dimensionality reduction: reduced/compresses data using schema
• Numerosity reduction: Replaced by alternative, smaller representations
• Data compression
• Data Transformation and Data Discretization
• Normalization
• Concept hierarchy generation

5
Data Cleaning/Cleansing

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly


formatted, duplicate, or incomplete data within a dataset. When combining multiple data
sources, there are many opportunities for data to be duplicated or mislabelled. If data is
incorrect, outcomes and algorithms are unreliable, even though they may look correct.
There is no one absolute way to prescribe the exact steps in the data cleaning process
because the processes will vary from dataset to dataset. But it is crucial to establish a
template for your data cleaning process so you know you are doing it the right way
every time.

6
Data Cleaning

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”, Birthday=“03-07-2010”,
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer income
in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred

8
How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
*Bayes’ theorem describes the probability of occurrence of an event related to any
condition.
9
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
10
How to Handle Noisy Data?
• Binning
• Sorted data value by consulting its neighborhood
• sort data and partition into (equal-frequency) buckets/bins
• then one can smooth by bin means: each value is replaced with bin value
• smooth by bin median: each value is replaced with bin value
• Smooth by bin boundaries: replace with min and max values in the bin
• Regression
• smooth by fitting the data into regression functions: A technique that confirms data
values to a functions. Linear regression involves finding the best line to fit two
attributes
• Clustering/Outlier Analysis
• detect and remove outliers: Where similar values are organized in groups/clusters
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)

11
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels): Integrates discrepancy detection and
12
transformation
• https://www.slideshare.net/ankurh/data-preprocessing-31623210
• https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
• https://www.tableau.com/learn/articles/what-is-data-cleaning
Data Preprocessing-
Data Integration, Data
Reduction, Clustering, Data
Discretization

Dr. Atul Garg


Data Preprocessing: Data Integration

2
2
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Can reduce and avoid redundancies and inconsistencies in resulting data set
• Can improve accuracy and speed

Entity identification problem:


• Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton

• Schema integration: e.g., A.cust-id  B.cust-# or roll no or student –id


• Integrate metadata from different sources

• Detecting and resolving data value conflicts


• For the same real world entity, attribute values from different sources are different
• Possible reasons: different representations, different scales, e.g., metric vs. British units
3
3
Handling Redundancy in Data Integration

• Redundant data occur often when integration of multiple databases


• Object identification: The same attribute or object may have different names in
different databases
• Derivable data: One attribute may be a “derived” attribute in another table, e.g.,
annual revenue
• Careful integration of the data from multiple sources may help reduce/avoid
redundancies and inconsistencies and improve mining speed and quality

4
4
Data Preprocessing:
Data Reduction

5
5
Data Reduction Strategies

• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Numerosity reduction (some simply call it: Data Reduction)
• Data compression

6
Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms: It transforms a vector into a numerically different vector
• Principal Component Analysis: often used to reduce the dimensionality of large data sets, by transforming
a large set of variables into a smaller one
• Supervised and nonlinear techniques (e.g., feature selection)

7
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models
Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …

8
Parametric Data Reduction: Regression and Log-Linear Models

• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
• Multiple regression
• Allows a response variable Y to be modeled as a linear function of multidimensional
feature vector
• Log-linear model
• Approximates discrete multidimensional probability distributions

9
Histogram Analysis

• Divide data into buckets and store average (sum) for each bucket
• Partitioning rules:
• Equal-width: equal bucket range
• Equal-frequency (or equal-depth)

10
Clustering
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering and be stored in multi-dimensional index tree
structures

11
Sampling

• Sampling: obtaining a small sample s to represent the whole data set N


• Allow a mining algorithm to run in complexity that is potentially sub-linear to
the size of the data
• Key principle: Choose a representative subset of the data
• Simple random sampling may have very poor performance in the presence
of skew
• Develop adaptive sampling methods, e.g., stratified sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
12
Types of Sampling
• Simple random sampling
• There is an equal probability of selecting any particular item
• Sampling without replacement
• Once an object is selected, it is removed from the data warehouse
• Sampling with replacement
• A selected object is not removed from the population
• Stratified sampling:
• Partition the data set, and draw samples from each partition (proportionally, i.e.,
approximately the same percentage of the data)
• Used in conjunction with skewed data

13
Sampling: With or without Replacement

Raw Data
14
Sampling: Cluster or Stratified Sampling
Stratified random sampling is a sampling method that involves taking samples of a population
subdivided into smaller groups called strata.

Raw Data Cluster/Stratified Sample

15
Data Cube Aggregation

A data cube enables data to be modeled and viewed in multiple dimensions.


• The lowest level of a data cube (base cuboid)
• The aggregated data for an individual entity of interest
• E.g., a customer in a phone calling data warehouse
• Multiple levels of aggregation in data cubes
• Further reduce the size of data to deal with
• Reference appropriate levels
• Use the smallest representation which is enough to solve the task
• Queries regarding aggregated information should be answered using data cube, when
possible
16
Data Reduction 3: Data Compression
• String compression
• There are extensive theories and well-tuned algorithms
• Typically lossless, but only limited manipulation is possible without expansion
• Audio/video compression
• Typically lossy compression, with progressive refinement
• Sometimes small fragments of signal can be reconstructed without reconstructing the
whole
• Time sequence is not audio
• Typically short and vary slowly with time
• Dimensionality and numerosity reduction may also be considered as forms of data
compression

17
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

18
Data Preprocessing:
Data Transformation and Data Discretization

19
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values. E.g. each old value can be identified with one of the new
values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing 20
Normalization
Normalization is used to scale the data of an attribute so that it falls in a smaller range.
Normalization is generally required when we are dealing with attributes on a different scale,
otherwise, it may lead to a dilution in effectiveness of an important equally important
attribute(on lower scale) because of other attribute having values on larger scale. In simple
words, when multiple attributes are there but attributes have values on different scales, this
may lead to poor data models while performing data mining operations.
• Min-Max Normalization – In this technique of data normalization, linear transformation
is performed on the original data.
• Z-score normalization – In this technique, values are normalized based on mean and
standard deviation of the data A.
• Decimal Scaling Method For Normalization – It normalizes by moving the decimal
point of values of the data. To normalize the data by this technique, we divide each value
of the data by the maximum absolute value of data.
21
Min-Max Normalization

Where A is the attribute data,


Min(A), Max(A) are the minimum and maximum absolute value of A respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of range
required) respectively.

Suppose that the minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is
transformed to 73,600 − 12,000 ( 1.0 − 0.0 ) + 0 = 0.716
98,000 − 12,000
Z-score Normalization
v’, v is the new and old of each
entry in data respectively.
σA, A is the standard deviation
and mean of A respectively.

Find the mean of the dataset is 21.2 and the standard deviation is 29.8.
To perform a z-score normalization on the first value in the dataset, we can use the
following formula:
•New value = (x – μ) / σ
•New value = (3 – 21.2) / 29.8
•New value = -0.61
Z-score Normalization

The mean of the normalized values is 0 and the standard deviation of the normalized
values is 1.
The normalized values represent the number of standard deviations that the original
value is from the mean.
For example:
•The first value in the dataset is 0.61 standard deviations below the mean.
•The second value in the dataset is 0.54 standard deviations below the mean.
•…
•The last value in the dataset is 3.79 standard deviations above the mean.
The benefit of performing this type of normalization is that the clear outlier in
the dataset (134) has been transformed in such a way that it’s no longer a
massive outlier.
Decimal Scaling Method For
Normalization
It normalizes by moving the decimal point of values of the data. To normalize the data
by this technique, we divide each value of the data by the maximum absolute value of
data. The data value, vi, of data is normalized to vi‘ by using the formula below –

Let the input data is: -10, 201, 301, -401, 501, 601, 701
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401,
0.501, 0.601, 0.701

where j is the smallest integer such that max(|vi‘|)<1.


Data Discretization
Data Discretization techniques can be used to divide the range of continuous attribute
into intervals. Numerous continuous attribute values are replaced by small interval
labels. This leads to a concise, easy-to-use, knowledge-level representation of mining
results.
• Top-down discretization: If the process starts by first finding one or a few points
(called split points or cut points) to split the entire attribute range, and then repeats this
recursively on the resulting intervals, then it is called top-down discretization or
splitting.
• Bottom-up discretization: If the process starts by considering all of the continuous
values as potential split-points, removes some by merging neighbourhood values to
form intervals, then it is called bottom-up discretization or merging.

26
Data Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

27
Data Discretization Methods
Typical methods: All the methods can be applied recursively
• 1 Binning: Binning is a top-down splitting technique based on a specified number
of bins. Binning is an unsupervised discretization technique.
• 2 Histogram Analysis: Because histogram analysis does not use class
information so it is an unsupervised discretization technique. Histograms partition
the values for an attribute into disjoint ranges called buckets.
• 3 Cluster Analysis: Cluster analysis is a popular data discretization method. A
clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy
• Decision-tree analysis (supervised, top-down split)
28
Concept Hierarchy Generation
• Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a concept
hierarchy.
• Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of abstraction
defined by concept hierarchies.
• This organization provides users with the flexibility to view data from
different perspectives.
• Examples include – geographic location, – job category and item type, etc
29
Data Preprocessing : Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
30
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley
India Publishers.
• Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Third edition, Morgan Kaufman Publishers.
• https://www.geeksforgeeks.org/data-normalization-in-data-mining/
• http://www.lastnightstudy.com/Show?id=45/Data-Discretization-and-Concept-
Hierarchy-Generation
• http://dataminingzone.weebly.com/uploads/6/5/9/4/6594749/ch_7discretization_an
d_concept_hierarchy_generation.pdf
• http://webpages.iust.ac.ir/yaghini/Courses/Application_IT_Fall2008/DM_02_07_D
ata%20Discretization%20and%20Concept%20Hierarchy%20Generation.pdf

31
Business Analytics

Dr. Atul Garg


Business Analytics

Business analytics, a data management solution and business intelligence


subset, refers to the use of methodologies such as
• Data mining,
• Predictive analytics, and
• Statistical analysis
in order to analyze and transform data into useful information, identify and
anticipate trends and outcomes, and ultimately make smarter, data-driven
business decisions.
Components of Business Analytics

• Data Aggregation: prior to analysis, data must first be gathered,


organized, and filtered, either through volunteered data or
transactional records
• Data Mining: data mining for business analytics sorts through large
datasets using databases, statistics, and machine learning to identify
trends and establish relationships
• Association and Sequence Identification: the identification of
predictable actions that are performed in association with other actions
or sequentially
• Text Mining: explores and organizes large, unstructured text datasets
for the purpose of qualitative and quantitative analysis
Components of Business Analytics
cont…
• Forecasting: analyzes historical data from a specific period in order to
make informed estimates that are predictive in determining future events
or behaviors
• Predictive Analytics: predictive business analytics uses a variety of
statistical techniques to create predictive models, which extract
information from datasets, identify patterns, and provide a predictive
score for an array of organizational outcomes
• Optimization: once trends have been identified and predictions have
been made, businesses can engage simulation techniques to test out best-
case scenarios
• Data Visualization: provides visual representations such as charts and
graphs for easy and quick data analysis
Business Analytics vs Data Analytics

• Data analytics is a broad umbrella term that refers to the science of analyzing
raw data in order to transform that data into useful information from which
trends and metrics can be revealed.
• While both business analytics and data analytics aim to improve operational
efficiency, business analytics is specifically oriented to business uses and data
analytics has a broader focus.
• Both business intelligence and reporting fall under the data analytics umbrella.
• Data scientists, data anBusiness Analytics vs Data Analytics
• alysts, and data engineers work together in the data analytics process to collect,
integrate, and prepare data for the development, testing, and revision of
analytical models, ensuring accurate results.
• Data analytics for business purposes is characterized by its focus on specific,
business operations questions.
Business Intelligence vs Business Analytics
• While business intelligence and business analytics serve similar purposes, and the
terms may be used interchangeably, these practices differ in their fundamental focus.
• Business intelligence analytics focuses on descriptive analytics, combining data
gathering, data storage, and knowledge management with data analysis to evaluate past
data and providing new perspectives into currently known information.
• Business analytics focuses on prescriptive analytics, using data mining, modeling, and
machine learning to determine the likelihood of future outcomes.
• Essentially, business intelligence answers the questions, “What happened?” and “What
needs to change?” and
• Business analytics answers the questions, “Why is this happening?”, “What if this trend
continues?”, “What will happen next?”, and “What will happen if we change
something?”
• Business analytics and business intelligence solutions tend to overlap in structure and
purpose.
References
• https://www.omnisci.com/technical-glossary/business-analytics
• https://ptgmedia.pearsoncmg.com/images/9780133552188/samplepage
s/0133552187.pdf
Thanks
Business Intelligence

Dr. Atul Garg


Index
• BI Technology
• BI Roles & Responsibilities
• BI Component Framework
Business Intelligence
Business Intelligence (BI) is a rapidly expanding area that is becoming
increasingly important to companies looking for ways to
• Improve how they use the data they gather.
• Enables companies to use the data that they hold efficiently by
analyzing it and then using it to inform business decisions.
• Business Intelligence refers to the insights gained from analyzing the
business information that companies hold.
• Businesses hold vast data, all of which can potentially provide them
with detailed insights to inform critical business decisions.
• This data needs to be mined and analyzed in the right way to realize its
full potential and to provide valuable insights for the business.
Roles of a Business Intelligence Analyst

• Writing data collection and processing procedures.


• Ensuring that data is being correctly gathered, stored and analyzed.
• Reporting data findings to management.
• Continually monitor data collection.
• Develop methodologies to improve data analysis.
Responsibilities of a BI Analyst
Job descriptions will vary by company, but these are some of general responsibilities:
• Review and validate customer data as it’s collected
• Oversee the deployment of data to the data warehouse
• Develop policies and procedures for the collection and analysis of data
• Create or discover new data procurement and processing programs
• Cooperate with IT department to deploy software and hardware upgrades that make
it possible to leverage big data use cases
• Monitor analytics and metrics results
• Implement new data analysis methodologies
• Review customer files to ensure integrity of data collection and utilization
• Perform data profiling to identify and understand anomalies
BI Component
Framework
BI Component Framework
• Business Layer
• Administration and Operation Layer
• Implementation Layer
Business Layer

This layer consists of four components –

1. Business requirements BUSINESS


BUSINESS VALUE
• Business drivers REQUIREMENTS

• Business Goals
• Business Strategies BUSINESS
LAYER

2. Business Value PROGRAM


DEVELOPMENT
MANAGEMENT
• Return on Investment
• Return on Assest
• Total Cost of Ownership
• Total Value of Ownership

3. Program Management

4. Development
Business Layer – Business Requirements

Business requirements: The requirements are a product of three steps of a process


that includes:

 Business drivers - the impulses that initiate the need to act.


Examples: changing workforce, changing labor laws, changing economy,
changing technology, etc.
 Business goals- the targets to be achieved in response to the business drivers.
Examples: increased productivity, improved market share, improved profit
margins, improved customer satisfaction, cost reduction, etc.
 Business strategies- the planned course of action that will help achieve the set
goals.
Examples: outsourcing, global delivery model, partnerships, customer retention
programs, employee retention programs, competitive pricing, etc.
Business Layer- Business Value

When a strategy is implemented against certain business goals, then certain costs (monetary, time, effort, information
produced by data integration and analysis, application of knowledge from past experience, etc.) are involved.
The business value can be measured in the terms of ROI (Return on Investment), ROA (Return on Assets), TCO
(Total Cost of Ownership), TVO(Total Value of Ownership), etc. Let us understand these terms with the help of a
few examples –
Return on Investment (ROI): We take the example of “AMAZON”, an e-commerce which has been using
social media (mainly Twitter and Facebook) to help get new clients and to increase the number of prospects/leads.
They attribute 10% of their daily revenue to social media. Now, that is an ROI from social media!
Return on Asset (ROA): Suppose a company, “Electronics Today”, has a net income of $l million and has total
assets of $5 million. Then, its ROA is 20%. So, ROA is the earning from invested capital (assets).
Total Cost of Ownership (TCO): Let us understand TCO in the context of a vehicle. TCO defines the cost of
owning a vehicle from the time of purchase by the owner, through its operation and maintenance to the time it
leaves the possession of the owner.
Total Value of Ownership (TVO): TVO has replaced the simple concept of Owner's Equity in some companies. It
could include a variety of subcategories such as stock, undistributed dividends, retained earnings or profit, or excess
capital contributed.
Business Layer- Program Management

This component of the business layer ensures that people, projects,


and priorities work in a manner in which individual processes are
compatible with each other so as to ensure seamless integration and
smooth functioning of the entire program. It should attend to each of
the following:
• Business priorities
• Mission and goals
• Strategies and risks
• Multiple projects
• Dependencies
• Cost and value
• Business rules
• Infrastructure
Business Layer- Development

The process of development consists of


• database/data-warehouse development (consisting of data
profiling, data cleansing and database tools),
• data integration system development (consists of data
integration tools and data quality tools)
• business analytics development (about processes and
various technologies used).
BI Component Framework
• Business Layer
• Administration and Operation Layer
• Implementation Layer
Administration and Operation Layer

• This layer consists of four components-


• 1. BI Architecture
a. Data
b. Integration
c. Information
d. Technology
e. Organization
• 2. BI and DW Operations
a. Backup and restore
b. Security
c. Configuration and Management
d. Database Management
3. Data Resource Management
a. Data Governance
b. Metadata management
4. Business Applications
Administration and Operations Layer - BI Architecture
Administration and Operations Layer – BI and DW Operations

Data Warehouse (DW) administration requires the usage of


various tools to monitor the performance and usage of the
warehouse, and perform administrative tasks on it. Some of these
tools would be:
• Backup and restore
• Security
• Configuration management
• Database management
Administration and Operations Layer –Data Resource Management

Data resource administration: Involves data governance


and metadata management.
Data governance is a technique for controlling data quality,
which is used to assess, improve, manage and maintain
information. It helps to define standards that are required to
maintain data quality. The distribution of roles for
governance of data is as follows:
• Data ownership
• Data stewardship
• Data custodianship
Administration and Operations Layer –Data Resource Management

Metadata management: Metadata is data about data.


Consider CD/DVD of music. There is the date of recording, the name of the artist,
the genre of music, the songs in the album, copyright information, etc. All this
information constitutes the metadata for the CD/DVD of music. In the context of a
camera, the data is the photographic image. The metadata then is the date and time
when the was taken. In simple words, metadata is data about data. Metadata
management involves tracking, assessment, and maintenance of metadata.
Metadata can be divided into four groups:
– Business metadata
– Process metadata
– Technical metadata
– Application metadata
Administration and Operations Layer –Data Resource Management
Administration and Operations Layer –
Business Applications

The application of technology to produce value for the


business refers to the generation of information or
intelligence from data assets like data warehouses/data
marts. Using BI tools, we can generate strategic, financial,
customer, or risk intelligence. This information can be
obtained through various BI applications, such as DSS
(decision support system), EIS(executive
information system), OLAP (On-line Analytical
processing), data mining and discovery, etc.
BI Component Framework
• Business Layer
• Administration and Operation Layer
• Implementation Layer
BI Component – Implementation Layer

The implementation layer of the BI component framework


consists of technical components that are required for data
capture, transformation and cleaning, data into information, and
finally delivering that information to leverage business goals
and produce value for the organization.
1. Data Warehousing
• Data Sources
• Data Acquisition, Cleaning, and Integration
• Data Stores
2. Information Services
• Information Delivery
• Business Analytics
BI Component - Implementation Layer
Implementation Layer – Data Warehousing

It is the process which prepares the basic repository of data


(called data warehouse) that becomes the data source where
we extract information from.
Date Warehouse: A data warehouse is a data store. It is
structured on the dimensional model schema, which is
optimized for data retrieval rather than update.
Data warehousing must play the following five distinct roles:
– Intake
– Integration
– Distribution
– Delivery
– Access
Implementation Layer
Implementation Layer – Information Services

• It is not only the process of producing information; rather, it involves


ensuring that the information produced is aligned with business
requirements and can be acted upon to produce value for the
company.

• Information is delivered in the form of Key Performance Indicator


(KPI’s), reports, charts, dashboards or scorecards, etc., or in the form
of analytics.

• Data mining is a practice used to increase the body of knowledge.

• Applied analytics is generally used to drive action and produce


outcomes.
Who is BI for?
It is a misnomer to believe that BI is only for managers or the executive class. True, it
is used more often by them. But does that mean that BI can be used only for
management and control? Thus, the answer is: NO!
BI Applications
BI applications can be divided into:
• Technology solutions
– DSS (Decision Support Systems)
– EIS (Executive Information Systems)
– OLAP (Online Analytical Processing and multidimensional analysis)
– Managed Query and Reporting
– Data Mining
• Business Solutions
– Performance Analysis
– Customer Analysis
– Market Place Analysis
– Productivity Analysis
– Sales Channel Analysis
– Behavioral Analysis
– Supply Chain Analysis
BI Roles and Responsibilities

Program Roles Project Roles


Business Manager
BI Program Manager BI Business Specialist
BI Data Architect BI Project Manager
BI ETL (Execute an integration project Business Requirements Analyst
)Architect
BI Technical Architect Decision Support Analyst
Metadata Manager BI Designer
BI Administrator ETL Specialist
Data Administrator
Open Source BI Tools

RDBMS MySQL, Firebird

Pentaho Data Integration (formerly


ETL Tools
called Kettle), SpagoBI

Analysis Tools Weka, RapidMiner, SpagoBI

Reporting Tools/Ad Hoc


Pentaho, BIRT, Actuate, Jaspersoft
Querying/Visualization
References
• https://www.fieldengineer.com/skills/business-intelligence
• https://searchbusinessanalytics.techtarget.com/Ultimate-guide-to-
business-intelligence-in-the-enterprise
• https://expert360.com/resources/articles/role-responsibilities-business-
intelligence-analyst
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/DigitalData.
pdf
Thanks
On-Line Transaction
Processing(OLTP) and On-Line
Analytical Processing

Dr. Atul Garg


Index
• OLTP
• OLAP
OLTP Understanding

Online Transaction Processing: Online transaction processing shortly known as OLTP supports
transaction-oriented applications in a 3-tier architecture. OLTP administers day to day transaction of an
organization.
Consider a point-of-sale (POS) system in a supermarket store. You have picked a bar of chocolate
and await your chance in the queue for getting it billed. The cashier scans the chocolate bar's bar code.
Consequent to the scanning of the bar code, some activities take place in the background —
the database is accessed;
the price and product information is retrieved and displayed on the computer screen;
the cashier feeds in the quantity purchased;
the application then computes the total, generates the bill, and prints it. You pay the cash and
leave.
The application has just added a record of your purchase in its database. This was an On-Line
Transaction Processing (OLTP) system designed to support on-line transactions and query processing.
 In other words, the POS of the supermarket store was an OLTP system.
OLTP Understanding

 OLTP systems refer to a class of systems that manage transaction-oriented applications.


 These applications are mainly concerned with the entry, storage and retrieval of data.
 They are designed to cover most of the day-to-day operations of an organization such as
purchasing, inventory, manufacturing, payroll, accounting, etc.
 OLTP systems are characterized by a large number of short on-line transactions such as
INSERT (a record of final purchase by a customer was added to the database), UPDATE
(the price of a product has been raised from Rs10 to Rs10.5), and DELETE (a product
has gone out of demand and therefore the store removes it from the shelf as well as from
its database).
 Almost all industries today (including airlines, mail-order, supermarkets, banking, etc.)
use OLTP systems to record transactional data. The data captured by OLTP systems is
usually stored in commercial relational databases.
Online Transactional Processing (OLTP) System

Traditional database application is focused on Online Transactional Processing (OLTP),


– Short, simple queries and frequent updates involving a relatively small number of tuples e.g.,
recording sales at cash-registers, selling airline tickets.
ONLINE TRANSACTION PROCESSING SYSTEM

• Used for transaction oriented applications


• Used by lower level employee
• Quick updates and retrievals
• Many users accessing the same data
• Users are not technical persons
• Response rate is very fast
• Single transaction (one application) at a time
• Stores routine data
• Follows client server model
Applications
– Banks
– Retail stores
– Airline reservation
OLTP
(ONLINE TRANSACTION PROCESSING SYSTEM)

User gets instant update


on the account balance
after withdrawing the
money
TRANSACTIONS

• Single event that changes something


• Different types of transactions
– Customer orders
– Receipts
– Invoices
– Payments

• Processing of transactions include storage and editing of data


– When transaction is completed then the records of an organization are
changed
TRANSACTIONS

INSERT
INSERT UPDATE
UPDATE RETRIEVE
RETRIEVE

INSERT INSERT
UPDATE UPDATE
RETRIEVE RETRIEVE
OLTP Segmentation

• They can be segmented into:


– Real-time Transaction Processing
– Batch Processing
Real-time Transaction processing

• Multiple users can fetch the information


• Very fast response rate
• Transactions processed immediately
• Everything is processed in real time
Batch Processing

• Where information is required in batch


• Offline access to information
• Presorting (sequence) is applied
• Takes time to process information

Day
Day 1 Day 2 Day 3 .......... 30

Monthly
purchase of
Retail Store
Characteristics of OLTP Model
• Online connectivity
• LAN,WAN
• Availability
– Available 24 hours a day
• Response rate
– Rapid response rate
– Load balancing by prioritizing the transactions
• Cost
– Cost of transactions is less
• Update facility
– Less lock periods
– Instant updates
– Use the full potential of hardware and software
Limitations of Relational Models

• Create and maintain large number of tables for the voluminous data
• For new functionalities, new tables are added
• Unstructured data cannot be stored in relational databases
• Very difficult to manage the data with common denominator (keys)
Queries that an OLTP System can Process

• Search for a particular customer’s record.


• Retrieve the product description and unit price of a
particular product.
• Filter all products with a unit price equal to or above e.
g., Rs. 25.
• Filter all products supplied by a particular supplier.
• Search and display the record of a particular supplier.
Advantages and Challenges of an OLTP System

Advantages of an OLTP System


• Simplicity – It is designed typically for use by clerks, cashiers, clients, etc.
• Efficiency – It allows its users to read, write and delete data quickly.
• Fast query processing – It responds to user actions immediately and also supports transaction
processing on demand.

Challenges of an OLTP System


• Security – An OLTP system requires concurrency control (locking) and recovery mechanisms
(logging).
• OLTP system data content not suitable for decision making – A typical OLTP system manages the
current data within an enterprise/organization. This current data is far too detailed to be easily used
for decision making.
The Queries that OLTP Cannot Answer

• The super market store is deciding on introducing a new product. The key questions they
are debating are: “Which product should they introduce?” and “Should it be specific to a
few customer segments?”
• The super market store is looking at offering some discount on their year- end sale. The
questions here are: “How much discount should they offer?” and “Should it be different
discounts for different customer segments?”
• The supermarket is looking at rewarding its most consistent salesperson. The question
here is:“How to zero in on its most consistent salesperson (consistent on several
parameters)?"
• All the queries stated above have more to do with analysis than simple reporting

Ideally these queries are not meant to be solved by an OLTP system.


OLAP - Online Analytical Processing
Online Analytical Processing, a category of software tools which provide analysis
of data for business decisions. OLAP systems allow users to analyze database
information from multiple database systems at one time.

• OLAP differs from traditional databases in the way data is conceptualized and stored.
• In OLAP data is held in the dimensional form rather than the relational form.
• OLAP’s life blood is multi-dimensional data.
• OLAP tools are based on the multi-dimensional data model. The multi-dimensional data
model views data in the form of a data cube.
• Online Analytical Processing (OLAP) is a technology that is used to organize large business
databases and support business intelligence.
• OLAP databases are divided into one or more cubes. The cubes are designed in such a way that
creating and viewing reports become easy.
• OLAP databases are divided into one or more cubes, and each cube is organized and
designed by a cube administrator to fit the way that you retrieve and analyze data so that it is
easier to create and use the PivotTable reports and PivotChart reports that you need.
OLAP (Online Analytical Processing)

• OLAP is a category of software that allows users to analyze information from


multiple database systems at the same time. It is a technology that enables analysts
to extract and view business data from different points of view
• Analysts frequently need to group, aggregate and join data. These operations in
relational databases are resource intensive. With OLAP, data can be pre-calculated
and pre-aggregated, making analysis faster.
• Provides multidimensional view of data
• Used for analysis of data
• Data can be viewed from different perspectives
• Determine why data appears the way it does
• Drill down approach is used to further dig down deep into the data
OLAP - Example

 Let us consider the data of a supermarket store, “AllGoods” store, for the year “2020”.
 This data as captured by the OLTP system is under the following column headings:
Section, Product-CategoryName, YearQuarter, and SalesAmount. E.g. we have a total of
32 records/rows.
 The Section column can have one value from amongst “Men”, “Women”, “Kid”, and
“Infant”.
 The ProductCategory Name column can have either the value “Accessories” or the
value “Clothing”.
 The YearQuarter column can have one value from amongst “Q1”, “Q2”, “Q3”, and
“Q4”.
 The SalesAmount column record the sales figures for each Section, ProductCategory
Name, and Year Quarter.
OLAP - Example
Characteristics of OLAP

• Multidimensional analysis
• Support for complex queries
• Advanced database support
– Support large databases
– Access different data sources
– Access aggregated data and detailed data
• Easy-to-use End-user interface
– Easy to use graphical interfaces
– Familiar interfaces with previous data analysis tools
• Client-Server Architecture
– Provides flexibility
– Can be used on different computers
– More machines can be added
One Dimensional
Consider the table shown in the earlier slide - It displays “AllGoods” store’s sales
data by Section, which is one-dimensional .
Figure 3.4 shows data in two dimensions (horizontal and vertical), in OLAP
it is considered to be one dimension as we are looking at the SalesAmount
from one particular perspective, i.e. by Section.

Table 3.5 presents the sales data of the “AllGoods” stores


by ProductCategoryName. This data is again in one
dimension as we are looking at the SalesAmount from
one particular perspective, ie.ProductCategoryName.
Table 3.6 presents the“AllGoods” sales data by yet
another dimension, i.e. YearQuarter. However, this data is
yet another example of one-dimensional data as we are
looking at the SalesAmount from one particular
perspective, i.e. by YearQuarter.
Two Dimensional
One-dimensional data was easy. What if, the requirement was to view Company’s data by calendar
quarters and product categories? Here, two-dimensional data comes into play. The two-dimensional
depiction of data allows one the liberty to think about dimensions as a kind of coordinate system.
Table 3.7 gives you a clear idea of the two-dimensional data. In this table, two dimensions (YearQuarters
and ProductCategoryName) have been combined.

In Table 3.7, data has been plotted along two dimensions as we can now look at the SalesAmount from two perspectives, i.e. by
YearQuarter and ProductCategoryName. The calendar quarters have been listed along the vertical axis and the product categories
have been listed across the horizontal axis. Each unique pair of values of these two dimensions corresponds to a single point of
SalesAmount data. For example, the Accessories sales for Q2 add up to $9680.00 whereas the Clothing sales for the same
quarter total up to $12366.00. Their sales figures correspond to a single point of SalesAmount data, i.e. $22046.
Three Dimensional

What if the company’s analyst wishes to view the data — all of it — along all the three dimensions (Year-Quarter,
ProductCategoryName, and Section) and all on the same table at the same time? For this the analyst needs a
three-dimensional view of data as arranged in Table 3.8. In this table, one can now look at the data by all the three
dimensions/ perspectives, i.e. Section, ProductCategoryName, YearQuarter. If the analyst wants to look for the
section which recorded maximum Accessories sales in Q2, then by giving a quick glance to Table 3.8, he can
conclude that it is the Kid section.
Can we go beyond Three Dimensional

Well, if the question is “Can you go beyond the third dimension?” the answer is YES!
If at all there is any constraint, it is because of the limits of your software. But if the question is “Should you
go beyond the third dimension?” we will say it is entirely on what data has been captured by your operational
transactional systems and what kind of queries you wish your OLAP system to respond to.
Now that we understand multi-dimensional data, it is time to look at the functionalities and characteristics
of an OLAP system. OLAP systems are characterized by a low volume of transactions that involve very
complex queries. Some typical applications of OLAP are: budgeting, sales forecasting, sales reporting, business
process manage
Example: Assume a financial analyst reports that the sales by the company have gone up. The next question is
“Which Section is most responsible for this increase?” The answer to this question is usually followed by a
barrage of questions such as “Which store in this Section is most responsible for the increase?” or “Which
particular product category or categories registered the maximum incréase?” The answers to these are provided
by multidimensional analysis or OLAP;
Can we go beyond Three Dimensional

 Let us go back to our example of a company’s (“AllGoods”) sales data


viewed along three dimensions:
Section, ProductCategoryName, and YearQuarter.
Given below are a set of queries, related to example, that a typical
OLAP system is capable of responding to:
• What will be the future sales trend for “Accessories” in the “Kid’s” Section?
• Given the customers buying pattern, will it be profitable to launch product “XYZ” in the
“Kid's” Section?
• What impact will a 5% increase in the price of produces have on the customers?
Advantages of an OLAP System

• Multi-dimensional data representation.


• Consistency of information.
• “What if ” analysis.
• Provides a single platform for all information and business needs– planning, budgeting,
forecasting, reporting and analysis.
• Fast and interactive ad hoc exploration.
Difference between OLTP and OLAP
Difference between OLTP and OLAP

BASIS FOR COMPARISON OLTP OLAP


Basic It is an online transactional system It is an online data retrieving and
and manages database modification. data analysis system.

Focus Insert, Update, Delete information Extract data for analyzing that helps
from the database. in decision making.

Data OLTP and its transactions are the Different OLTPs database becomes
original source of data. the source of data for OLAP.

Transaction OLTP has short transactions. OLAP has long transactions.

Time The processing time of a transaction The processing time of a transaction


is comparatively less in OLTP. is comparatively more in OLAP.

Queries Simpler queries. Complex queries.


Difference between OLTP and OLAP

BASIS FOR
OLTP OLAP
COMPARISON
Source of data Operational/Transactional Data Data extracted from various
operational data sources, transformed
and loaded into the data warehouse
Purpose of data Manage (control and execute) basic Assists in planning, budgeting,
business tasks forecasting and decision making

Data contents Current data. Far too detailed – not Historical data. Has support for
suitable for decision making summarization and aggregation. Stores
and manages data at various levels of
granularity, thereby suitable for
decision making
Difference between OLTP and OLAP

ASIS FOR COMPARISON OLTP OLAP


Normalization Tables in OLTP database are Tables in OLAP database are not
normalized (3NF). normalized.
Integrity OLTP database must maintain data OLAP database does not get
integrity constraint. frequently modified. Hence, data
integrity is not affected.
Database Design Typically normalized tables. OLTP Typically de-normalized tables; uses
system adopts ER (Entity star or snowflake schema
Relationship) model
Operations Read/Write Mostly read
Backup and Recovery Regular backups of operational data Instead of regular backups, data
are warehouse is refreshed periodically
mandatory. Requires concurrency using data from operational data
control (locking) and recovery sources
mechanisms (logging)
Difference between OLTP and OLAP

BASIS FOR COMPARISON OLTP OLAP


Derived data and aggregates Rare Common
Data Structures Complex Multi-dimensional

Few Sample Queries  Search & locate student(s)  Which courses have productivity
 Print student scores impact on-the-job?
 Filter students above 90% marks  How much training is needed on
future technologies for non- linear
growth in BI?
 Why consider investing in DSS
experience lab?
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• https://techdifferences.com/difference-between-oltp-and-olap.html
• https://www.guru99.com/oltp-vs-olap.html
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/OLTPandOL
AP.pdf
Thanks
Different On-Line Analytical
Processing (OLAP)
Architectures

Dr. Atul Garg


Index
• Different OLAP Architectures
• MOLAP
• ROLAP
• HOLAP
Different OLAP Architectures
Multidimensional OLAP (MOLAP)

Multidimensional OLAP (MOLAP) is a classical OLAP that facilitates data analysis by


using a multidimensional data cube. Data is pre-computed, re-summarized, and stored in a
MOLAP. Using a MOLAP, a user can use multidimensional view data with different facets.
Multidimensional data analysis is also possible if a relational database is used. On the
contrary, MOLAP has all possible combinations of data already stored in a multidimensional
array. MOLAP can access this data directly.
MOLAP Architecture

MOLAP Architecture includes the following components:


Database Server
MOLAP Server
1. The user request reports through the interface
Front-end tool
2. The application logic layer of the (Multi-
dimensional Database (MDDB) retrieves the
stored data from Database
3. The application logic layer forwards the result
to the client/user.
MOLAP architecture mainly reads the
precompiled data. MOLAP architecture has
limited capabilities to dynamically create
MOLAP Architecture aggregations or to calculate results that have not
been pre-calculated and stored.
MOLAP Advantages

1. MOLAP can manage, analyze and store considerable amounts of


multidimensional data.
2. Fast Query Performance due to optimized storage, indexing, and caching.
3. Smaller sizes of data as compared to the relational database.
4. Automated computation of higher level of aggregates data.
5. Help users to analyze larger, less-defined data.
6. MOLAP is easier to the user that’s why It is a suitable model for
inexperienced users.
7. MOLAP cubes are built for fast data retrieval and are optimal for slicing
and dicing operations.
8. All calculations are pre-generated when the cube is created.
Disadvantages of MOLAP

1. One major weakness of MOLAP is that it is less scalable than ROLAP as it


handles only a limited amount of data.
2. The MOLAP also introduces data redundancy as it is resource intensive
3. MOLAP Solutions may be lengthy, particularly on large data volumes.
4. MOLAP products may face issues while updating and querying models
when dimensions are more than ten.
5. MOLAP is not capable of containing detailed data.
6. The storage utilization can be low if the data set is highly scattered.
7. It can handle the only limited amount of data therefore, it’s impossible to
include a large amount of data in the cube itself.
Relational OLAP (ROLAP)

ROLAP is an extended RDBMS along with multidimensional data mapping to


perform the standard relational operation.
In ROLAP, data is stored in relational database. In essence, each action of
slicing and dicing is equivalent to adding a “WHERE” clause in the SQL
statement.

Data stored in relational database (ROLAP)


ROLAP Model
There are three main components in a ROLAP model:
1.Database server: This exists in the data layer. This consists of data that is loaded into the
ROLAP server.
2.ROLAP server: This consists of the ROLAP engine that exists in the application layer.
3.Front-end tool: This is the client desktop that exists in the presentation layer.

Working of ROLAP:
When a user makes a query (complex), the
ROLAP server will fetch data from the RDBMS
server. The ROLAP engine will then create data
cubes dynamically. The user will view data from a
multi-dimensional point.

Unlike in MOLAP, where the multi-dimensional


view is static, ROLAP provides a dynamic multi-
dimensional view. This explains why it is slower
when compared to MOLAP.
ROLAP Advantages

1. It can handle huge volumes of data.


2. ROLAP utilizes a relational database. This enables the model to integrate
the ROLAP server with an RDBMS (relational database management
system).
3. High data efficiency: It offers high data efficiency because query
performance and access language are optimized particularly for the
multidimensional data analysis.
4. Scalability: This type of OLAP system offers scalability for managing
large volumes of data, and even when the data is steadily increasing.
Disadvantages of ROLAP

1. Demand for higher resources: ROLAP needs high utilization of


manpower, software, and hardware resources.
2. Aggregately data limitations: ROLAP tools use SQL for all calculation of
aggregate data. However, there are no set limits to the for handling
computations.
3. Slow query performance: Query performance in this model is slow when
compared with MOLAP
Hybrid Online Analytical Processing (HOLAP)

This type of analytical processing solves the limitations of MOLAP and


ROLAP and combines their attributes. Hybrid OLAP is a mixture of both
ROLAP and MOLAP. It offers fast computation of MOLAP and higher
scalability of ROLAP. Data in the database is divided into two parts:
specialized storage and relational storage. Integrating these two aspects
addresses issues relating to performance and scalability. HOLAP stores huge
volumes of data in a relational database and keeps aggregations in a MOLAP
server.

HOLAP
HOLAP Model

Working of HOLAP:
The HOLAP model consists of a server that can
support ROLAP and MOLAP. It consists of a
complex architecture that requires frequent
maintenance. Queries made in the HOLAP model
involve the multi-dimensional database and the
relational database. The front-user tool presents
data from the database management system
(directly) or through the intermediate MOLAP.

HOLAP Model
HOLAP Advantages

1. It improves performance and scalability because it combines multi-dimensional and


relational attributes of online analytical processing.
2. It is a resourceful analytical processing tool if we expect the size of data to increase.
3. Its processing ability is higher than the other two analytical processing tools.
4. This kind of OLAP helps to economize the disk space, and it also remains compact
which helps to avoid issues related to access speed and convenience.
5. Hybrid HOLAP’s uses cube technology which allows faster performance for all types of
data.
6. ROLAP are instantly updated and HOLAP users have access to this real-time instantly
updated data. MOLAP brings cleaning and conversion of data thereby improving data
relevance. This brings best of both worlds.
Disadvantages of HOLAP

The model uses a huge storage space because it consists of data from two
databases.
The model requires frequent updates because of its complex nature.
Greater complexity level: The major drawback in HOLAP systems is that it
supports both ROLAP and MOLAP tools and applications. Thus, it is very
complicated.
Potential overlaps: There are higher chances of overlapping especially into
their functionalities.
Other OLAP

Web OLAP which is OLAP system accessible via the web


Web OLAP (WOLAP) browser. WOLAP is a three-tiered architecture. It consists of
three components: client, middleware, and a database server.

Mobile OLAP helps users to access and analyze OLAP data


Mobile OLAP:
using their mobile devices

SOLAP is created to facilitate management of both spatial and


Spatial OLAP :
non-spatial data in a Geographic Information system (GIS)
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• https://www.guru99.com/multidimensional-online-analytical-
processing.html
• https://www.guru99.com/online-analytical-processing.html
• https://www.section.io/engineering-education/molap-vs-rolap-vs-
holap/
Thanks
OLAP Cube and Analytical
operations of OLAP

Dr. Atul Garg


Index
• OLAP Cubes
• Roll up
• Drill down
• Slice and Dice
• Pivot Table
OLAP Cube

Online Analytical Processing (OLAP) is a category of software that allows


users to analyze information from multiple database systems at the same
time. It is a technology that enables analysts to extract and view business data
from different points of view.
Analysts frequently need to group, aggregate and join data. These OLAP
operations in data mining are resource intensive. With OLAP data can be pre-
calculated and pre-aggregated, making analysis faster.
OLAP databases are divided into one or more cubes. The cubes are designed
in such a way that creating and viewing reports become easy.
OLAP Cube
At the core of the OLAP concept, is an OLAP Cube. The OLAP cube is a data
structure optimized for very quick data analysis.
OLAP Cube

The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional
data. However, OLAP contains multidimensional data, with data usually obtained from
a different and unrelated source. Using a spreadsheet is not an optimal option. The cube
can store and analyze multidimensional data in a logical and orderly manner.
How does it work?
A Data warehouse would extract information from multiple data sources and formats
like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or
OLAP cube) where information is pre-calculated in advance for further analysis.
Basic analytical operations of OLAP

Four types of analytical OLAP operations are:


• Roll-up
• Drill-down
• Slice and dice
• Pivot (rotate)
Roll-up

Roll-up is also known as “consolidation” or “aggregation.” The Roll-up operation can be


performed in 2 ways
1. Reducing dimensions
2. Climbing up concept hierarchy. Concept hierarchy is a system of grouping things based
on their order or level.
Roll-up operations in OLAP

• In this example, cities New jersey and Lost


Angles and rolled up into country USA
• The sales figure of New Jersey and Los
Angeles are 440 and 1560 respectively. They
become 2000 after roll-up
• In this aggregation process, data is location
hierarchy moves up from city to the country.
• In the roll-up process at least one or more
dimensions need to be removed. In this
example, Cities dimension is removed.
Drill Down

In drill-down data is fragmented into smaller parts. It is the opposite


of the rollup process. It can be done via
1. Moving down the concept hierarchy
2. Increasing a dimension
Drill-down operations in OLAP

Consider the diagram:


• Quater Q1 is drilled down to months
January, February, and March.
Corresponding sales are also registered.
• In this example, dimension months are
added.
Slice Operations

Here, one dimension is selected, and a new sub-cube is created.

•Dimension Time is Sliced with Q1 as the filter.


•A new cube is created altogether.

Slice operation in OLAP


Dice operations
This operation is similar to a slice. The difference in dice is you select 2
or more dimensions that result in the creation of a sub-cube.
Pivot operations in OLAP
This term refers to a new view of data available within a Slice of a multidimensional OLAP Cube.
As an example: a financial analyst might want to view or “pivot” data in various ways, such as
displaying all the cities down the page and all the products across a page.

It is also known as rotation operation as it


rotates the current view to get a new
view of the representation. In the sub-
cube obtained after the slice operation,
performing pivot operation gives a new
view of it.
In Pivot, analyst rotate the data axes to
provide a substitute presentation of data.
In this example, the pivot is based on item
types.
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• https://www.guru99.com/online-analytical-processing.html
• https://olap.com/learn-bi-olap/olap-bi-definitions/pivot/
Thanks
Data model for OLTP and OLAP

Dr. Atul Garg


Index
• Data Model for OLTP
ER Data Model
• Data Model for OLAP
Star Model
Snowflake Data Model
Data Model for OLTP

OLTP (Online Transactional Processing) is a type of data processing that


executes transaction-focused tasks. It involves inserting, deleting, or updating
small quantities of database data. It is often used for financial transactions,
order entry, retail sales etc.
OLTP system usually adopts an Entity Relationship (ER) model.
Data Model for OLTP

Three entities are considered in this data model:


a. EmployeeID is the primary key
b. EmployeeAddress (B) : - EmployeeID is a foreign
key referencing to EmployeeID attribute of
Employee Entity (A)
c. EmployeePayHistory (C): - EmployeeID is a foreign
A B key referencing to EmployeeID attribute of
Employee Entity (A)

Two relationships are considered:


a. There is 1:M cardinality, between A and B. It means
instances of A can be related to multiple instances of
C
B.
b. There is 1:M cardinality, between A and C. It means
instances of A can be related to multiple instances of
C.

Entity Relationship (ER) data model for OLTP


Multi-Dimensional Data

• Measures - numerical data being tracked


• Dimensions - business parameters that define a transaction
• Example: Analyst may want to view sales data (measure) by geography, by time,
and by product (dimensions)
• Dimensional modeling is a technique for structuring data around the business
concepts
• ER models describe “entities” and “relationships”
• Dimensional models describe “measures” and “dimensions”

CS 336
5
Data Model for OLAP

Need to Understand
a. Dimension
b. Facts/measure
Dimension: It is a perspective or entity with respect to which an organization wants to
keep records. For example a store wants to keep records of the store’s sale with respect
to “time”, “product”, “customer”, “employee”.
These dimensions allows the store to keep track of things such as the quarterly
sales of products, the customers to whom the products were sold. Each of these
dimensions mat have a table associated wih it, called the dimension table.
Facts/measure: these are numerical measures/quantities by which analyst want to
analyse relationship between dimensions. E.g.,
Total (sales amount in dollars), Quantity (no of Units), Discount (amount in dollars)
Star Model for OLAP

Star schema is widely used by all


OLAP systems to design OLAP cubes
efficiently. In fact, major OLAP
systems deliver a ROLAP mode of
operation which can use a star
schema as a source without
designing a cube structure.
In this example, a central fact table is
connected to four dimensions. And,
each dimension is represented by one
table and each table has a set of
attributes.
Star Model for OLAP

Example of Star Model for OLAP


Snowflake Model for OLAP

Snowflake Schema in data


warehouse is a logical arrangement
of tables in a multidimensional
database such that the ER
diagram resembles a snowflake
shape. A Snowflake Schema is an
extension of a Star Schema, and it
adds additional dimensions. The
dimension tables are normalized
which splits data into additional
tables. Snowflake Model for OLAP
Star vs Snowflake Model for OLAP

Star Schema Snowflake Schema


Hierarchies for the dimensions are stored in the
Hierarchies are divided into separate tables.
dimensional table.
It contains a fact table surrounded by dimension One fact table surrounded by dimension table which
tables. are in turn surrounded by dimension table
In a star schema, only single join creates the
A snowflake schema requires many joins to fetch the
relationship between the fact table and any
data.
dimension tables.
Simple DB Design. Very Complex DB Design.
Denormalized Data structure and query also run
Normalized Data Structure.
faster.
High level of Data redundancy Very low-level data redundancy
Single Dimension table contains aggregated data. Data Split into different Dimension Tables.
Role of OLAP in BI

OLAP (Online Analytical Processing) is the technology behind many Business


Intelligence (BI) applications. OLAP is a powerful technology for data discovery,
including capabilities for limitless report viewing, complex analytical calculations, and
predictive “what if” scenario (budget, forecast) planning.
Role of OLAP in BI
Role of OLAP in BI

Advantages of using OLAP Description


The main advantage of OLAP is the speed of query execution. A correctly
designed cube usually processes a typical user query within 5 seconds. The data
will always be right at your fingertips to refer to while there is a necessity to
rapidly take an important decision. The users don’t have to spend much time on
calculations and composing complex heavyweight reports.
High Speed of Data Processing Transactional data from scattered points of sale, data about every particular
customer and supplier, data about all the employees in the company and their
performance, - all that is stored in one place absolutely ready to be operated
with. Even if the warehouse contains tons of information, you can make a
complex report using any kind of data from the warehouse in just a few
minutes.
When working with OLAP, users first see the consolidated data. All the data is
stored in tables connected to the star schema in the center. The tables organize a
Aggregated and Detailed Data cube with multiple dimensions which makes it easy and fast to navigate through
tons of information. The users can detail the data down to separate facts
through “drill down” function and do the opposite using “drill up” function.
Role of OLAP in BI

Advantages of using OLAP Description


OLAP data is represented by cubes. Each edge of the cube contains certain
attributes of an analyzed business process. Measures and dimensions define
Multidimensional Data the cube axes in a multidimensional coordinate system. Such data structure
Representation allows users to see information from different points of view (slices). A
cube slice is, in fact, a two-dimensional table, which is a clear and familiar
way of data representation.
OLAP dimension in the cube reflects certain aspects of the company’s
fiscal and economic activities. Instead of manipulating database table
Using Familiar Business
fields, the end user interacts with common business categories such as
Expressions
products, customers, salesmen, employees, territory, date, etc. That is why
OLAP-based tools are very simple to use even for non-technical users.
Role of OLAP in BI

Advantages of using OLAP Description


If the cubes you use support write-back function, you can analyze not only
actual data but also create different “what-if” scenarios and change the data you
work with while also ensuring the actual cube data is not overridden or lost.
”What-if” Scenarios This function of OLAP lets users replace the values to see what other outcomes
may take place if there are changes introduced into the business. Through this
BI tool it is possible to deeply analyze an ongoing business state, foresee losses,
and prevent them.
Working with OLAP data is available to users without any technical
background. Usually, end users don’t need any special training, which means
saving money for the company.
Furthermore, OLAP vendors commonly provide their customers with extensive
Flat Learning Curve
documentation, tutorials, and prompt technical support especially in terms of
web-based OLAP client. The users are constantly free to address to the team of
tech professionals without having to manage all the issues tied to the software
themselves.
Role of OLAP in BI

Disadvantages of using
Description
OLAP
In majority cases, it is not so cheap to implement such system. That is why not every
organization can effort it. However, for big companies it is a really great investment, as
High cost
opportunities offered by OLAP system can not only pay off but bring much more profit
in the future.
The main problem of such system’s kind is that the structure must be defined in
OLAP is relational advance. It means the number of columns in the table, data types should be pre-
calculated before table creation. For quick results, it can cause some difficulties.
Some systems provide lack of computational power. That is greatly reduces the
flexibility of the OLAP tool. Analyzers are limited to a narrow and small area, unable
Computation capability
to analyze freely, and even have to resort to a third party to perform this kind of
calculation. In such business computing, OLAP is often left in an awkward situation.
The above mentioned problem may cause the next problem – possibility of risk. It leads
to the fact that it is not possible to provide huge amounts of data, and there is a great
Some potential risk
difficulty in providing valuable links to the decision maker. However, it depends on the
system type, vendor & modern OLAP software can be rather powerful in this issue.
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• https://onlinecourses.swayam2.ac.in/cec19_cs 01/preview
• https://www.guru99.com/star-snowflake-data-warehousing.html
• https://galaktika-soft.com/blog/advantages-of-using-olap-for-
business-intelligence.html
• https://www.researchpublishers.org/pdf/26/Importance-of-OLAP-in-
Business-Intelligence-by-Debashis-Rout.pdf
Thanks
Data Warehousing

Dr. Atul Garg

Business Intelligence and Data 1


Warehousing (CS122) Dr. Atul Garg
Index
• Data Warehouse
• Characteristics of Data Warehouse
• Advantages of Data Warehouse
• Principles of Data warehouse
• Architecture of Data Warehouse
• Meta Data
• Data Warehouse Models

Business Intelligence and Data 2


Data Warehousing
• Data warehouse is perfectly named from physical warehouse, it
operates as storage for data that has been extracted from another
source.
• The concept of the data warehouse has existed since the 1980s, when it
was developed to help transition data from merely powering
operations to fueling decision support systems that reveal business
intelligence.
• Many organizations have proprietary data warehouses that store
information on performance metrics, sales quotas, lead generation stats
and a variety of other information.
• A data warehouse is a large collection of business data used to help
an organization make decisions.
Business Intelligence and Data 3
Data Warehousing
• The large amount of data in data warehouses comes from different
places such as internal applications such as marketing, sales,
finance, customers and external partner systems, among others.
• Data warehouses can perform some analytics capabilities: using the
extract, transform, load (ETL) process, data warehouses can
perform the complex queries that transactional databases cannot
handle.
• Once data has entered a warehouse, it cannot be altered. Data
warehouses only perform analysis of historical data.

Business Intelligence and Data 4


Characteristics of Data Warehouse
• Uses large historical data sets
• Allows both planned and ad hoc queries
• Controls data load
• Make an organization's information easily accessible
• Retrieves large volumes of data
• Manage user schema like tables, indexes, etc.
• Generate reports
• Backs up data
Business Intelligence and Data 5
Advantages of Data Warehouse
• Saves times
• Enhances data quality and consistency
• Generates a high Return on Investment (ROI)
• Provides competitive advantage
• Improves the decision-making process
• Enables organizations to forecast with confidence
• Streamlines the flow of information
• Increasing data quality
• Increase searching probability of more information

Business Intelligence and Data 6


Principles of Data Warehouse
Load Processing
In a data warehouse, several phases and steps must be taken to load new data and process
it including filtering, reformatting, data conversion, indexing and metadata updates.
Load Performance
Data warehouses need to increase the loading of new data from time to time within narrow
time windows.
Load performance should be measured in gigabytes per hour and in hundreds of millions
of rows and should not artificially disrupt the data business volume.
Data Quality Management
The data warehouse verifies contextual integrity, local stability and global consistency
despite "dirty" sources and large-scale database size.
This is only useful and valuable to the extent if business stakeholders trust the data and its
resources as a fact-based management system requires the highest quality of the data.
Business Intelligence and Data 7
Principles of Data Warehouse

Strategic Adaptability
Adaptability is critical to the development of business requirements. The business
intelligence tools available in the market must be taken into account to adapt to
often unexpected changes in business demands.
In data warehouses, adaptability requires a principle and method to use alternative
BI tools in the future such as various back-end or visualization tools.
Query Performance
Massive and complex queries should be completed in seconds, not hours or days.
Terabyte Scalability
Today, the size of the data warehouses is evolving at staggering rates. This ranges
from a few bytes to hundreds of terabytes and gigabytes sized data warehouses.
Business Intelligence and Data 8
Architecture of Data Warehouse

Generally speaking, data warehouses have a three-tier architecture, which consists of:
• Bottom tier: The bottom tier consists of a data warehouse server, usually a relational
database system, which collects, cleanses, and transforms data from multiple data sources
through a process known as Extract, Transform, and Load (ETL) or a process known as
Extract, Load, and Transform (ELT).
• Middle tier: The middle tier consists of an OLAP (i.e. online analytical processing)
server which enables fast query speeds. Three types of OLAP models can be used in this
tier, which are known as ROLAP, MOLAP and HOLAP. The type of OLAP model used
is dependent on the type of database system that exists.
• Top tier: The top tier is represented by some kind of front-end user interface or reporting
tool, which enables end users to conduct ad-hoc data analysis on their business data.

Business Intelligence and Data 9


Architecture of Data Warehouse

Business Intelligence and Data 10


Architecture of Data Warehouse

Operational System: An operational system is a method used in data


warehousing to refer to a system that is used to process the day-to-day
transactions of an organization.
Flat Files: A Flat file system is a system of files in which transactional
data is stored, and every file in the system must have a different name.

Business Intelligence and Data 11


Meta Data
 Meta Data: A set of data that defines and gives information about other data. Meta Data used
in Data Warehouse for a variety of purpose, including:
• Meta Data summarizes necessary information about data, which can make finding and
work with particular instances of data more accessible. For example, author, data build,
and data changed and file size are examples of very basic document metadata.
• Metadata is used to direct a query to the most appropriate data source.
Database that describes various aspects of data in the warehouse
 Administrative Metadata: Source database and contents, Transformations required, History
of Migrated data
 End User Metadata:
• Definition of warehouse data
• Descriptions of it
• Consolidation Hierarchy

Business Intelligence and Data 12


Meta Data types

Business Intelligence and Data 13


Data Warehouse Modeling
Data warehouse modeling is the process of designing the schemas of the
detailed and summarized information of the data warehouse. The goal of data
warehouse modeling is to develop a schema describing the reality, or at least
a part of the fact, which the data warehouse is needed to support.

Main reasons of Data warehouse modelling:


1. Through the schema, data warehouse clients can visualize the
relationships among the warehouse data, to use them with greater ease.
2. A well-designed schema allows an effective data warehouse structure to
emerge, to help decrease the cost of implementing the warehouse and
improve the efficiency of using it.
Business Intelligence and Data 14
Types of Data Warehouse Modeling

Business Intelligence and Data 15


Enterprise Warehouse
Collects all of the information about subjects spanning the entire organization:

An enterprise warehouse collects all the information and the subjects spanning an entire
organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external information providers.
 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
It generally contains detailed information as well as summarized information and can range in
estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.

An enterprise data warehouse may be accomplished on traditional mainframes, UNIX super


servers, or parallel architecture platforms. It required extensive business modeling and may
take years to develop and build.
Business Intelligence and Data 16
Data Mart
 Data mart contains a subset of organization-wide data.
 This subset of data is valuable to specific groups of an organization.
 Data marts are confined to subjects.
 In other words, we can claim that data marts contain data specific to a particular group. For
example, the marketing data mart may contain data related to items, customers, and sales.
Points to remember about data marts −
• Window-based or Unix/Linux-based servers are used to implement data marts. They are
implemented on low-cost servers.
• The implementation data mart cycles is measured in short periods of time, i.e., in weeks
rather than months or years.
• The life cycle of a data mart may be complex in long run, if its planning and design are not
organization-wide.
• Data marts are small in size.
• Data marts are customized by department.
• The source of a data mart is departmentally structured data warehouse.
• Data mart are flexible. Business Intelligence and Data 17
Data Mart

Data Marts is divided into two parts:


Independent Data Mart: Independent data mart is sourced from data captured
from one or more operational systems or external data providers, or data
generally locally within a different department or geographic area.
Dependent Data Mart: Dependent data marts are sourced exactly from
enterprise data-warehouses.

Business Intelligence and Data 18


Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It
is easy to build a virtual warehouse. Building a virtual warehouse requires excess
capacity on operational database servers.

• A set of views over operational databases


• Only some of the possible summary views may be materialized

Business Intelligence and Data 19


References
• https://datawarehouseinfo.com/data-warehouse/benefits-of-a-data-warehouse/
• https://www.ibm.com/cloud/learn/data-warehouse
• https://www.javatpoint.com/data-warehouse-architecture
• https://www.c-sharpcorner.com/blogs/goals-of-a-data-warehouse1
• https://www.talend.com/resources/what-is-data-warehouse/
• https://www.scientificworldinfo.com/2019/10/data-warehouse-characteristics-and-
principles-of-data-warehousing.html
• https://courses.cs.washington.edu/courses/csep573/01sp/lectures/class1/tsld046.htm
• https://www.tutorialspoint.com/dwh/dwh_architecture.htm

Business Intelligence and Data 20


Thanks

Business Intelligence and Data 21


Data Integration

Dr. Atul Garg

1
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Index

• Data Integration
• Challenges in Data Integration
• Technologies in Data Integration
• Need of Data Integration
• Advantages of Data Integration
• Common Data Integration Approaches

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


What Is Data Integration?
Process of coherent merging of data from various data sources and
presenting a cohesive/consolidated view to the user

• Involves combining data residing at different sources and providing users with a unified
view of the data.

• Significant in a variety of situations; both

Commercial (e.g., two similar companies trying to merge their database)

Scientific (e.g., combining research results from different bioinformatics research


repositories)

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Challenges in Data Integration
• Development challenges
Translation of relational database to object-oriented applications
Consistent and inconsistent metadata
Handling redundant and missing data
Normalization of data from different sources

• Technological challenges
Various formats of data
Structured and unstructured data
Huge volumes of data

• Organizational challenges
Unavailability of data
Manual integration risk, failure
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Main Approaches in Data Integration
Integration is divided into two main approaches:
 Schema integration – reconciles schema elements
Multiple data sources may provide data on the same entity type. The main goal is to
allow applications to transparently view and query this data as one uniform data
source, and this is done using various mapping rules to handle structural differences.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Main Approaches in Data Integration
Instance integration – matches tuples and attribute values
Data integration from multiple heterogeneous data sources has become a high-priority
task in many large enterprises. Hence to obtain the accurate semantic information on the
data content, the information is being retrieved directly from the data. It identifies and
integrates all the instance of the data items that represents the real-world entity, distinct
from the schema integration.
Table 1 Table 2 Table 3
Roll No Name Id Roll No Name Id Roll No Name Id
123 Ram Kumar 345 123 R Kumar 345 123 Kumar Ram 345

Table 4 Integrated Table1 Integrated Table2 and so on


Roll No Name Id Roll No Name Id Roll No Name Id
123 RK 345 123 Ram Kumar 345 123 Ram Kumar 345

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Technologies in Data Integration
Entity Identification (EI) and attribute-value conflict resolution (AVCR) comprise the
instance- integration task. When common key-attributes are not available across different data
sources, the rules for EI and the rules for AVCR are expressed as combinations of constraints
on their attribute values.
• Electronic Data Interchange (EDI) :
– It refers to the structured transmission of data between organizations by electronic
means. It is used to transfer electronic documents from one computer system to another
(ie) from one trading partner to another trading partner.
– It is more than mere E-mail; for instance, organizations might replace bills of lading
and even checks with appropriate EDI messages.
• Object Brokering/Object Request Broker (ORB):
– An ORB is a piece of middleware software that allows programmers to make programs
calls from one computer to another via a network.
– It handles the transformation of in-process data structure to and from the byte sequence.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Need for Data Integration
What it means?

 It is done for providing data in a specific


view as requested by users, applications, DB2
etc.
Unified
 The bigger the organization gets, the SQL view of
more data there is and the more data data
needs integration.
 Increases with the need for data sharing. Oracle

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Advantages of Using Data Integration

DB2

It is of benefit to decision-makers, who have access to important information


Unified from
SQL view of
past studies data
Reduces cost, overlaps and redundancies; reduces exposure to risks
Helps to monitor key variables like trends and consumer behaviour, etc.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Common Approaches to Data Integration

The most popular ones are:


Federated databases
Memory-mapped data structure
Data warehousing

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Data Integration Approaches
• Federated database (virtual database):
 Type of meta-database management system which transparently integrates multiple
autonomous databases into a single federated database
 The constituent databases are interconnected via a computer network,
geographically decentralized.
 The federated databases is the fully integrated, logical composite of all constituent
databases in a federated database management system.
• Memory-mapped data structure:
 Useful when needed to do in-memory data manipulation and data structure is large. It’s
mainly used in the dot net platform and is always performed with C# or using VB.NET
 It’s is a much faster way of accessing the data than using Memory Stream.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Data Integration Approaches

Data Warehousing
The various primary concepts used in data warehousing would be:
ETL (Extract Transform Load)
Component-based (Data Mart)
Dimensional Models and Schemas
Metadata driven

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley
India Publishers.
• https://www.omnisci.com/technical-glossary/data-integration
• https://www.xenonstack.com/blog/data-integration-tools

13
Data Warehouse Approaches-
Ralph Kimball’s &
Inmon’s Approach

Dr. Atul Garg

1
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Index

• Ralph Kimball’s Approach


• Advantages of Kimball Method
• Dis-advantages of Kimball Method
• Inmon Method
• Advantages of Inmon Method
• Dis-advantages of Inmon Method
• Ralph Kimball’s vs Inmon Method

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Ralph Kimball’s Approach

The Kimball data model follows a bottom-up approach to data warehouse (DW)
architecture design in which data marts are first formed based on the business
requirements.
The primary data sources are then evaluated, and an Extract, Transform and Load (ETL) tool
is used to fetch different types of data formats from several sources and load them into a
staging area of the relational database server. Once data is uploaded in the staging area in the
data warehouse, the next phase includes loading data into a dimensional data warehouse
model that’s de-normalized by nature. This model partitions data into the fact table, which is
numeric transactional data, or dimension table, which is the reference information that
supports facts.
Star schema is the fundamental element of the dimensional data warehouse model.
Kimball dimensional modelling allows users to construct several star schemas to fulfill
various reporting needs.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Ralph Kimball’s Approach

Basic Kimball Data Warehouse (DW) architecture (Source: Zentut)

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Advantages of Kimball Method

• Kimball dimensional modeling is so fast to construct as no normalization is involved.


• An advantage of star schema is that most data operators can easily comprehend it because of
its de-normalized structure, which simplifies querying and analysis.
• Data warehouse system footprint is trivial because it focuses on individual business areas and
processes rather than the whole enterprise. So, it takes less space in the database, simplifying
system management.
• It enables fast data retrieval from the data warehouse, as data is segregated into fact tables
and dimensions.
• A smaller team of designers and planners is sufficient for data warehouse management
because data source systems are stable, and the data warehouse is process-oriented.
• Query optimization is straightforward, predictable, and controllable.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Dis-advantages of Kimball Method

• Data isn’t entirely integrated before reporting; the idea of a ‘single source of truth is lost.’
• Irregularities can occur when data is updated in Kimball DW architecture. This is because, in
de-normalization techniques data warehouse, redundant data is added to database tables.
• In the Kimball DW architecture, performance issues may occur due to the addition of
columns in the fact table, as these tables are quite in-depth. The addition of new columns can
expand the fact table dimensions, affecting its performance.
• The dimensional data warehouse model becomes difficult to alter with any change in
business needs.
• As the Kimball model is business process-oriented, instead of focusing on the enterprise as a
whole, it cannot handle all the BI reporting requirements.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Inmon Method

Bill Inmon, the father of data warehousing, came up with the concept to develop a data
warehouse that starts designing the corporate data warehouse data model, which identifies the
main subject areas and entities the enterprise works with, such as customers, product, vendor,
and so on.
Bill Inmon’s definition of a data warehouse is that it is a “subject-oriented, nonvolatile,
integrated, time-variant collection of data in support of management’s decisions.”
It is based on Top-down approach.
The model then creates a thorough, logical model for every primary entity. For instance, a
logical model is constructed for products with all the attributes associated with that entity.
This logical model could include ten diverse entities under product, including all the details,
such as business drivers, aspects, relationships, dependencies, and affiliations.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Inmon Method

• The Inmon design approach uses the normalized form for building entity structure,
avoiding data redundancy as much as possible.
• This results in clearly identifying business requirements and preventing any data update
irregularities.
• Moreover, the advantage of this top-down approach in database design is that it is robust to
business changes and contains a dimensional perspective of data across data mart.
• Next, the physical model is constructed, which follows the normalized structure.
• This Inmon model creates a single source of truth for the whole business.
• Data loading becomes less complex due to the normalized structure of the model.
• This arrangement for querying is challenging as it includes numerous tables and links.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Inmon Method

Basic Inmon data warehousing architecture explained

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Advantages of Inmon Method

• Data warehouse acts as a unified source of truth for the entire business, where all data is
integrated.
• This approach has very low data redundancy. So, there’s less possibility of data update
irregularities, making the ETL data warehouse process more straightforward and less
susceptible to failure.
• It simplifies business processes, as the logical model represents detailed business objects.
• This approach offers greater flexibility, as it’s easier to update the data warehouse in case
there’s any change in the business requirements or source data.
• It can handle diverse enterprise-wide reporting requirements.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Dis-advantages of Inmon Method

• Complexity increases as multiple tables are added to the data model with time.
• Resources skilled in data warehouse data modeling are required, which can be expensive and
challenging to find.
• The preliminary setup and delivery are time-consuming.
• Additional ETL operation is required since data marts are created after the creation of the
data warehouse.
• This approach requires experts to manage a data warehouse effectively.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Inmon top down vs Kimball’s bottom-up approach

Inmon data warehousing architecture explained Kimball’s data warehousing architecture explained

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Inmon top down vs Kimball’s bottom-up approach

Parameters Kimball Inmon


Introduced by Introduced by Ralph Kimball. Introduced by Bill Inmon.
It has Bottom-Up Approach for
Approach It has Top-Down Approach for implementation.
implementation.
Data Integration It focuses Individual business areas. It focuses Enterprise-wide areas.
Building Time It is efficient and takes less time. It is complex and consumes a lot of time.
Cost It has iterative steps and is cost effective. Initial cost is huge and development cost is low.
It does not need such skills but a generic
Skills Required It needs specialized skills to make work.
team will do job.
Maintenance Here maintenance is difficult. Here maintenance is easy.
It prefers data to be in De-normalized
Data Model It prefers data to be in normalized model.
model.
Data Store Systems In this, source systems are highly stable. In this, source systems have high rate of change.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley
India Publishers.
• https://www.astera.com/type/blog/data-warehouse-concepts/
• https://www.computerweekly.com/tip/Inmon-or-Kimball-Which-approach-is-
suitable-for-your-data-warehouse
• https://www.geeksforgeeks.org/difference-between-kimball-and-inmon/

14
Multidimensional data model

Dr. Atul Garg

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg 1


Index

• About basics of data modelling


• How to go about designing a data model at the
conceptual and logical levels?
• Pros and Cons of the popular modelling techniques
such as ER modelling and dimensional modelling

Business Intelligence and Data 2


Warehousing (CS122) Dr. Atul Garg
Case Study – “TenToTen Retail Stores”
• A new range of cosmetic products has been introduced by a leading brand, which TenToTen wants
to sell through its various outlets.
• In this regard TenToTen wants to study the market and the consumer’s choice of cosmetic
products.
• As a promotional strategy the group also wants to offer attractive introductory offers like
discounts, buy one get one free, etc.
• To have a sound knowledge of the cosmetics market, TenToTen Stores has to carry out a detailed
study of the buying pattern of consumers’ by geography, the sales of cosmetic products by
preferred brand, etc. and then decide on a strategy to promote the product. To take right decisions
on various aspects of business expansion, product promotion, preferences, etc., TenToTen Stores
has decided to go in for an intelligent decision support system.
• TenToTen Retail Store taken the help of “AllSolutions” (leading consulting firms of the world) .
• After studying the requirements of TenToTen Stores, AllSolutions decided on build a data
warehouse application. To construct a data model that would meet the business requirements put
forth by TenToTen Stores. AllSolutions identified the following concerns that need to be
addressed:
What are the entities involved in this business process and how are they related to each other?
What tables associated with those entities must be included in the data warehouse?
What columns have to be included into each table?
What are the primary keys for the tables that have been identified?
What are the relations that the tables have with each other and which is the column on which the relationship has to be
made?
What should be the column definitions for the columns that have been identified?
What are the other constraints to be added into the tables?
Thus, AllSolutions has zeroed down on the requirements of TenToTen Stores. Now, step for building data model can be
proceeded.
Business Intelligence and Data 3
Warehousing (CS122) Dr. Atul Garg
Recap of some basics of Data Modelling-
 Entity
 Attribute
 Cardinality of Relationship

Data Model
 A data model is a diagrammatic representation of the data and the relationship
between its different entities. It assists in identifying how the entities are
related through a visual representation of their relationships and thus helps
reduce possible errors in the database design. It helps in building a robust
database/data warehouse.
Types of Data Model
 Conceptual Data Model
 Logical Data Model
 Physical Data Model

Business Intelligence and Data 4


Warehousing (CS122) Dr. Atul Garg
Conceptual Data Model

The conceptual data model is designed by identifying the various entities and
the highest- level relationships between them as per the given requirements.
Let us look at some features of a conceptual data mpdel-
• It identifies the most important entities.
• It identifies relationships between different entities.
• It does not support the specification of attributes.
• It does not support the specification of the primary key.
The entities can be identified as
• Category (to store the category details of products).
• SubCategory (to store the details of sub-categories that belong to different categories)
• Product (to store product details).
• ProductOffer (to map the promotion offer to a product).
• Date (to keep track of the sale date and also to analyze sales in different time periods)
• OperatorType (to store the details of types of operator, viz. company-operated or franchise)
• Outlet (to store the details of various stores distributed over various locations).
• Sales (to store all the daily transactions made at various stores)

Business Intelligence and Data 5


Warehousing (CS122) Dr. Atul Garg
Logical Data Model

The logical data model is used to describe data in as much detail as possible.
While describing the data, no consideration is given to the physical
implementation aspect.

Let us look at some features of a logical data model:


• It identifies all entities and the relationships among them.
• It identifies all the attributes for each entity.
• It specifies the primary key for each entity.
• It specifies the foreign keys (keys identifying the relationship between different
entities).
• Normalization of entities is performed at this stage.

Normalization:
1NF
2NF
3NF and soon
Business Intelligence and Data 6
Warehousing (CS122) Dr. Atul Garg
Outcome of Logical Data Model

Business Intelligence and Data 7


Warehousing (CS122) Dr. Atul Garg
Outcome of Logical Data Model

Business Intelligence and Data 8


Warehousing (CS122) Dr. Atul Garg
Outcome of Logical Data Model

Business Intelligence and Data 9


Warehousing (CS122) Dr. Atul Garg
Outcome of Logical Data Model

Business Intelligence and Data 10


Warehousing (CS122) Dr. Atul Garg
Outcome of Logical Data Model

Business Intelligence and Data 11


Warehousing (CS122) Dr. Atul Garg
To Conclude about Conceptual Data Model

• We have identified the various entities from the requirements specification.


• We have identified the various attributes for each entity.
• We have also identified the relationship that the entities share with each
other (Primary key-Foreign Key).
Logical vs Conceptual Data Model
• All attributes for each entity are specified in a logical data model, whereas
no attributes are specified in a conceptual data model.
• Primary keys are present in a logical data model, whereas no primary key is
present in conceptual data model.
• In a logical data model, the relationships between entities are specified
using primary and foreign keys, whereas in a conceptual data model, the
relationships are simply without specifying attributes. It means in a
conceptual data model, we only know that two related; we don’t know which
attributes are used for establishing the relationship between these two entities

Business Intelligence and Data 12


Warehousing (CS122) Dr. Atul Garg
Physical Model

• Specification of all tables and columns.


• Foreign keys are used to identify relationships between tables.
• While logical data model is about normalization, physical data model may
support de-normalization based on user requirements.
• Physical considerations (implementation concerns) may cause the physical data
model to be quite different from the logical data model.
• Physical data model will be different for different RDBMS. For example, data
type for a column may be different for MySQL, DB2, Oracle, SQL Server, etc.

The steps for designing a physical data model are as follows:


• Convert entities into tables/relation.
• Convert relationships into foreign keys.
• Convert attributes into columns/fields.

Business Intelligence and Data 13


Warehousing (CS122) Dr. Atul Garg
Outcome of Physical Data Model

Business Intelligence and Data 14


Warehousing (CS122) Dr. Atul Garg
Outcome of Physical Data Model

Business Intelligence and Data 15


Warehousing (CS122) Dr. Atul Garg
Outcome of Physical Data Model

Business Intelligence and Data 16


Warehousing (CS122) Dr. Atul Garg
Logical vs Physical Data Model

• The entity names of the logical data model are table names in the physical data
model.
• The attributes of the logical data model are column names in the physical
data model.
• In the physical data model, the data type for each column is specified.
However, data types differ depending on the actual database (MySQL,DB2,
SQL Server 2008, Oracle etc.) being used. In a logical data model, only the
attributes are identified without going into the details about the data type
specifications.

Business Intelligence and Data 17


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Normalization
(Entity relationship) Modeling

An industry service provider, “InfoMechanists”, has several


Business Units (BUs) such as
- Financial Services(FS)
- Insurance Services (IS)
- Life Science Services (LSS)
- Communication Services (CS)
- Testing Services (TS) etc.
Each BU has
- a Head as a manager
- Many employees reporting to him.
- Each employee has a current residential address.

Business Intelligence and Data 18


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Normalization
(Entity relationship) Modeling

• There are cases where a couple (both husband and wife) are employed either in the same BU
or a different one.
• In such a case, they (the couple) have same address.
• An employee can be on a project, but at any given point in time, he or she can be working on a
single project only.
• Each project belongs to a client. There could be chances where a client has awarded more than
one project to the company (either to the same BU or different BUs).
• A project can also be split into modules which can be distributed to BUs according to their
field of specifications.
• For example, in an insurance project, the development and maintenance work is with Insurance
Services (IS) and the testing task is with Testing Services (TS). Each BU usually works on
several projects at a time.

Business Intelligence and Data 19


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques –
Normalization (Entity relationship) Modeling

Given the specifications mentioned in the last slide, let us see


how we will proceed to design an ER model.
• Enumerated below is a list of steps to help you arrive at the ER
diagram:
1. Identify all the entities.
2. Identify the relationships among the entities along with
cardinality and participation type(total/partial participation).
3. Identify the key attribute or attributes.
4. Identify all other relevant attributes.
5. Plot the ER diagram with all attributes including key attribute(s).
6. The ER diagram is then reviewed with the business users.
Business Intelligence and Data 20
Warehousing (CS122) Dr. Atul Garg
E R Model for Infomechanists

Business Intelligence and Data 21


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Normalization
(Entity relationship) Modeling

Pros -
• The ER diagram is easy to understand and is represented in a language that the
business users can understand.
• It can also be easily understood by a non-technical domain expert.
• It is intuitive and helps in the implementation on the chosen database
platform.
• It helps in understanding the system at a higher level.
Cons –
•The physical designs derived using ER model may have some amount of
redundancy.
•There is scope for misinterpretations because of the limited information available
diagram.

Business Intelligence and Data 22


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques –
Dimensional Modeling :

Need of Dimensional Modeling


Picture this situation –
Consider you have just reached the Bangalore International Airport.
• You are an Indian national due to fly to London, Heathrow International Airport. You have
collected your boarding pass.
• You have two bags that you would like checked in. The person at the counter asks for your
boarding pass, weighs the bags, pastes the label with details about your flight number, your name,
your travel date, source airport code, and destination airport code, etc.
• He then pastes a similar label at the back of your boarding pass. This done, you proceed to the
Immigration counter, passport, and boarding pass in hand. The seal with the current date on your
passport.
• Your next stop is the security counter. The security personnel scrutinize your boarding pass,
passport, etc. And you find yourself in the queue to board the airfact.
• Again quick, careful rounds of verification by the aircraft crew before you find yourself ensconced
in your seat.

Business Intelligence and Data 23


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques –
Dimensional Modeling :
Need of Dimensional Modeling
Picture this situation –
You must be wondering what has all this got to do with multidimensional
modelling.
Well, we are trying to understand multidimensional perspectives of the same data.
The data here is our “boarding pass”.
Your boarding pass is looked at by different personnel for different reasons:
• The person at the check-in counter needs your boarding pass to book your check-in.
• The immigration personnel looked at your boarding pass to ascertain the source and
of your itinerary.
• The security personnel scrutinized your boarding pass for security reasons to verify an
eligible traveller.
• The aircraft crew looked at your boarding pass to onboard you and guide you to your
seat.
This is nothing- but multidimensional perspectives of the same data. To put it
simply, “Multiple Perspectives”.
To help with this multidimensional view of the data, we rely on dimensional
modeling.
Business Intelligence and Data 24
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques –
Dimensional Modeling- Definition :
• Dimensional modeling is a logical design technique for structuring data so that it is intuitive to
business users and delivers fast query performance.
• Dimensional modeling is the first step towards building a dimensional database, i.e. a data
warehouse.
• It allows the database to become more understandable and simpler. In fact, the dimensional
database can be viewed as a cube having three or more dimensional/perspectives for analyzing the
given data.
• Dimensional modelling divides the databaseinto two parts:
(a) Measurement/fact (b) Context/dimension
To better understand the fact (measurement)—dimension (context) link, let us take the
example of booking an airlines ticket. In this case, the facts and dimensions are as given
below:
Facts — Number of tickets booked, amount paid, etc.
Dimensions — Customer details, airlines, time of booking, time of travel, origin city, etc.
Benefits of Dimensional Modeling:
1.Comprehensibility:
• Data presented is more subjective as compared to objective nature in a relational model.
• Data is arranged in a coherent category or dimensions to enable better comprehension.
2.Improved query performance:
Business Intelligence and Data 25
3.Trended for data analysis scenarios.
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional
Modeling- Fact Table:

Fact Table: A fact table consists of various measurements. It stores the measures of business
processes and points to the lowest detail level of each dimension table. The measures are factual or
quantitative in representation and are generally numeric in nature. They represent the how much or
how many aspects of a question. For example, price, product sales, product inventory, etc.

Types of Fact:
Additive facts: These are the facts that can be summed up/aggregated across all dimensions in a fact
table. For example, discrete numerical measures of activity — quantity sold, dollars sold, etc.
Consider a scenario where a retail store “Northwind Traders” wants to analyze the revenue
generated. The revenue generated can be by the employee who is selling the products; or it can be
in terms of any combination of multiple dimensions. Products, time, region, and employee are the
dimensions in this case.
The revenue,which is a fact, can be aggregated along any of the above dimensions to give the
total revenue along that dimension. Such scenarios where the fact can be aggregated along all the
dimensions make the fact a fully additive or just an additive fact. Here revenue is the additive fact.

Business Intelligence and Data 26


Warehousing (CS122) Dr. Atul Garg
Fact Types

Additive

Facts

Factless Fact
Semi Non
Additive Additive

Business Intelligence and Data 27


Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional
Modeling- Fact Table:

This figure depicts the “SalesFact”act table along with its corresponding
dimension tables.
This fact table has one measure., “SalesAmount”, and three dimension keys,
“DateID”, “ProductID”, and “StoreID”.
The purpose of the “SalesFact” table is to record the sales amount for each
product in each store on a daily basis. In this table, “SalesAmount” is an additive
fact because we can sum up this fact along any of the three dimensions present in
the fact table i.e. “DimDate”, “DimStore”, and “DimProduct”. For example – the
sum of “SalesAmount” for all 7 days in a week represents the total sales amount
for that week.
Business Intelligence and Data 28
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional Modeling- Semi-Additive Facts:
Semi Additive facts: These are the facts that can be summed up for some dimensions in the fact table, but not
all. For example, account balances, inventory level, distinct counts etc.

Consider a scenario where the “Northwind Traders” warehouse manager needs to find the total number of
products in the inventory. One inherent characteristic of any inventory is that there will be incoming products to
the inventory from the manufacturing plants and outgoing products from the inventory to the distribution
centres or retail outlets.
So if the total products in the inventory need to be found out, say, at the end of a month, it cannot be a simple
sum of the products in the inventory of individual days of that month. Actually, it is a combination of addition of
incoming products and subtraction of outgoing ones. This means the inventory level cannot be aggregated
along the “time” dimension.
But if a company has warehouses in multiple regions and would like to find the total products in inventory
across those warehouses, a meaningful number can be arrived at by aggregating inventory levels across those
warehouses. This simply means inventory levels can be aggregated along the “region” dimension. Such
scenarios where a fact can be aggregated along some dimensions but not along all dimensions give rise to
semi-additive facts. In this case, the number of products in inventory or the inventory level is the semi-
additive fact.
Let us discuss another example of semi-additive facts.
Figure depicts the “AccountsFact” fact table along with its
corresponding dimension tables. The “AccountsFact” fact table has
two measures :“CurrentBalance” and “ProfitMargin”. It has two
dimension keys: “DatelD” and “AccountID”. “CurrentBalance” is a semi-
additive fact. It makes sense to add up current balances for all
accounts to get the information on “what's the total current balance
for all accounts in the bank?” However, it does not make sense to add
up current balances through time. It does not make sense to add up all
current balances through time. It does not make sense to add up all
current balance for a given account for a given account for each day of
the month. Similarly, “ProfitMargin” is another non-additive fact, as it
does not make sense to add profit margins at the account level or at
the day level.
Business Intelligence and Data 29
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional Modeling-
Non-Additive Facts:
Non Additive facts: These are the facts that cannot be summed up for some dimensions present in the fact table. For
example, measurement of room temperature, percentages, ratios, factless, facts, etc. Non additive facts cannot be added
meaningfully across any dimensions. In other words, non-additive facts are facts where SUM operator cannot be used to
produce any meaningful results. The following illustration will help you understand why room temperature is a non-
additive fact.
Date Temperature
5th May (7AM) 27
5th May (12 AM) 33
5th May (5 PM) 10
Sum 70 (Non-Meaningful result)
Average 23.3 (Meaningful result)

Examples of non-additive facts are:


Textual facts: Adding textual facts does not result in any number. However, counting textual facts may result in a sensible number.
Per-unit prices: Adding unit prices does not produce any meaningful number. For example: the unit sales price or unit cost is strictly non-
addictive. But these prices can be multiplied with the number products sold and can be depicted as total sales amount or total product cost in
the fact table.
Percentages and ratios: A ratio, such as gross margin, is non-additive. Non-additive facts are usually the result of ratio or other
calculations, such as percentages.
Measures of intensity: Measures of intensity such as the room temperature are non-additive across all dimensions. Summing the
room temperature across different times of the day produces a totally non-meaningful number.
Averages: Facts based on averages are non-additive. For example, average sales price is non-additive. Adding all the average unit prices
produces a meaningless number.
Factless facts (event-based fact tables): Event fact tables are tables that record events. For example, event fact tables are used to record
events such as Webpage clicks and employee or student attendance. In an attendance recording scenario, attendance can be recorded in terms
of “yes” or “no” OR with pusedo facts like “1” or “0”. In such scenarios, we can count the values but adding them will give invalid values.
Factless facts are generally used to model the many-to-many relationships or to track events that did or did not happen.
Business Intelligence and Data 30
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional Modeling- Non-Additive Facts -
Example:
The following figure is an example of a “factless fact table” -“EventFact”. This factless fact
table has four dimension keys: “EventID”, “SpeakerID”, “ParticipantID”, and “DateID”. It
does nor have any measures or facts. This table can be queried to get details on the events
that are the most popular. It can further be used to track events that did not happen. We can
also use this table to elicit information about events that were the least popular or that were
not attended.

Business Intelligence and Data 31


Warehousing (CS122) Dr. Atul Garg
What Are Dimensions/Dimension Tables?

• Dimension tables consist of dimension attributes which describe the dimension


elements to enhance comprehension.
• Dimension attributes (descriptive) are typically static values containing discrete
numbers which behave as text values.
• Main functionalities :
 Query filtering\constraining
 Query result set labeling
• The dimension attribute must be
Complete: Dimension attributes must not contain missing values. Verbose:
Labels must consist of full words.
Descriptive: The dimension attribute names must be able to convey the purpose of the
dimension element in as few and simple words as possible. Discrete values: Dimension
attributes must contain only one value per row in dimension table.
Quality assured: Dimension attributes must not contain misspelt values or impossible
values.

Business Intelligence and Data 32


Warehousing (CS122) Dr. Atul Garg
Dimension Hierarchies

• A dimension hierarchy is a cascaded series of many-to-one relationships and consists


of different levels. Each level in a hierarchy corresponds to a dimension attribute.
Hierarchies document the relationship between different levels in a dimension.
• A dimension hierarchy may also be described as a set of parent-child relationships
attributes present within a dimension. These hierarchy attributes, also known as
levels, roll up a child to parent. For example, Customer totals can roll up to Sub-
region totals which can further roll up to Region totals. A better example would be
— daily sales could roll up to weekly sales, which further roll up to month to quarter
to yearly sales. Let us understand the concept of hierarchy through the example. In
this example, the Product hierarchy is like this

Department Category Brand Product Name

Business Intelligence and Data 33


Warehousing (CS122) Dr. Atul Garg
Dimension Hierarchies - Example

Similarly, the Date hierarchy is depicted as


Year Quarter Month
Example: 2011 Q1 April
For a better idea of dimension hierarchy, let us assume a product store, “ProductsForAll”. The store has
several departments such as “Confectionary”, “Electronics”, “Travel Goods”, “Home Appliances”, “Dairy
Products”, etc. Each department is further divided into categories. Example “Dairy Products” further
classified into “Milk”, “Butter”, “Cottage Cheese”, “Yogurt”, etc. Each product class offers several brands
such as “Amul”, “Nestle”, etc. And, finally each brand has specific product example, “Amul cheese” has
names such as “Amul Slim Cheese”, “Amul EasySpread”, etc.

Business Intelligence and Data 34


Warehousing (CS122) Dr. Atul Garg
Types of Dimensions

Degenerate
Dimension
Rapidly Junk
Changing (garbage)
Dimension Dimension

Dimension
Slowly Type
Changing Role-playing
Dimension Dimension

Business Intelligence and Data 35


Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Degenerate Dimension
A degenerate dimension is a data that is dimension in temperament but is present in a fact table. It is a dimension without
any attributes. Usually, a degenerate dimension is a transaction-based number. There can be more than one degenerate
dimension in a fact table.
Degenerate dimensions often cause confusion as they don’t feel or look like normal dimensions. They act as dimension
keys in fact tables; however, they are not joined to corresponding dimensions in other dimension tables as all their
attributes are already present in other dimension tables.
Degenerate dimensions can also be called textual facts, but they are not facts as the primary key for the fact table is often
a combination of dimensional foreign keys and degenerate dimensions. As already stated, a fact table can have more than
one degenerate dimension. For example, an insurance claim line fact table typically includes both claim and policy
numbers as degenerate dimensions. A manufacturer can include degenerate dimensions for the quote, order, and bill of
lading numbers in the shipments fact table.

This figure depicts a PointOfSalesFact table along with other


dimension tables. The “PointOfSalesFact” has two measures:
AmountTransacted and QuantitySold. It has the following
dimension keys: DateKey that links the “PointOfSaleFact” to
“DimDate”, ProductID that links the “PointOfSaleFact” to
“DimProduct” and “StoreID” that links the
“PointOfSaleFact” to “DimStore”. Here, TransactionNo is a
degenrate dimension as it is a dimension key without a
corresponding dimension table. All information/details
pertaining to the transaction are extracted and stored in the
“PointOfSaleFact” table itself;therefore, there is no need to
have a separate dimension table to store the attributes of the
transaction.
Business Intelligence and Data 36
Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Slowly Changing Dimension (SCD)
In a dimension model, dimension attributes are not fixed as their values can change slowly over a period of
time. Here comes the role of a slowly changing dimension. A slowly changing dimension is a dimension whose
attribute/attributes for a record (row) change slowly over time, rather than change on a regularly timely basis.
Let us assume a company sells car-related accessories. The company decides to assign a new sales territory, Los
Angeles, to its sales representative, Bret Watson, who earlier operated from Chicago. How can you record the
change without making it appear that Watson earlier held Chicago?
Let us take a look at the original record of Bret Watson: Now the original record has to be changed as Bret
Watson has been assigned “Los Angeles” as his sales territory, effective May 1, 2011. This would be done
through a slowly changing dimension. Given below are the approaches for handling a slowly changing
dimension:
Type-I (Overwriting the History)
In this approach, the existing dimension attribute is overwritten with new data, and hence no history is preserved.
This approach is used when correcting data errors present in a field, such as a word spelled incorrectly.
SalesRepID SalesRepName SalesTerritory
1001 Bret Watson LosAngels

Type-ll (Preserving the History)


A new row is added into the dimension table with a new primary key every time a change occurs to any of the
attributes in the dimension table. Therefore, both the original values as well as the newly updated values are
captured. SalesRepID SalesRepName SalesTerritory
1001 Bret Watson Chicago
1006 Bret Watson Los Angeles

Type-III(Preserving One or more Versions of History)


This approach is used when it is compulsory for the data warehouse to track historical and when these changes
will happen only for a finite number of times. Type-III SCDs do not increase the size of the table as compared to
the Type-II SCDs since old information is updated by adding new information.
SalesRepID SalesRepName OriginalSalesTerritory CurrentSalesTerritory EffectiveFrom
1001 Bret Watson Chicago Los Angeles 01-05-2011

Business Intelligence and Data 37


Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Slowly Changing Dimension (SCD)

Type-I (Overwriting the History) Advantages


•It is the easiest and simplest approach to implement.
•It is very effective in those situations requiring the correction of bad data.
•No change is needed to the structure of the dimension table.
Disadvantages
•All history may be lost in this approach if used inappropriately.
•It is typically not possible to trace history.
•All previously made aggregated tables need to be rebuilt.
Type-ll (Preserving the History) Advantages
•This approach enables us to accurately keep track of all historical information.
Disadvantages
•This approach will cause the size of the table to grow fast.
• Storage and performance can become a serious concern, especially in cases where the number of rows for the table is
very high to start with.
•It complicates the ETL process too.
Type-III(Preserving One or more Versions of History) Advantages
Since only old information is updated with new information, this does not increase the size of the table. It allows us
to keep some part of history.
Disadvantages
Type-III SCDs will not be able to keep all history where an attribute is changed more than once.
For example, if Bret Watson is later assigned “Washington” on December 1, 2012, the Los Angeles information will be lost.

Business Intelligence and Data 38


Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Slowly Changing Dimension (SCD)-
Comparison of the three types of handling of SCD

Business Intelligence and Data 39


Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Rapidly Changing Dimension (RCD)

We have seen how to handle very slow changes in the dimension, but what would happen if occur
more frequently?
A dimension is considered to be a fast changing dimension, also call changing dimension, if its one or
more attributes change frequently and also in several rows. For example, consider a customer table
having 1,00,000 rows. Assuming that on an average 10 changes occur in a dimension every year, then
in one year the number of rows will increase to 1,00,000 x 10 = 10,00,000.
To identify a fast changing dimension, look for attributes having continuously variable values. Some of
the fast changing dimension attributes have been identified as:
• Age
• Income
• Test score
• Rating
• Credit history score
• Customer account status
• Weight
One method of handling fast changing dimensions is to break off a fast changing dimension into one or
more separate dimensions known as mini-dimensions. The fact table would then have two separate
foreign keys — one for the primary dimension table and another for the fast changing attribute.

Business Intelligence and Data 40


Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Junk Garbage Dimension (JGD)
The garbage dimension is a dimension that contains low-cardinality columns/attributes such as indicators, codes, and status flags. The garbage
dimension is also known as junk dimension. The attributes in a garbage dimension are not associated with any hierarchy.
We recommend going for junk/ garbage dimension only if the cardinality of each attribute is relatively low, there are only a few attributes, and
the cross-join of the source tables is too big. The option here will be to create a junk dimension based on the actual attribute combinations found
in the source data for the fact table. This resulting junk dimension will include only combinations that actually occur, thereby keeping the size
significantly smaller.
A junk dimension will combine several low cardinality flags and attributes into a single table rather than modeling them as separate dimensions.
This will help reduce the size of the fact table and make dimensional modeling easier to work with.
Let us look at the following example from the healthcare domain. There are two source tables and a fact table:
In our example, each of the source tables
[CaseType (Case-TypeID, CaseTypeDescription) and TreatmentLevel
(Treatment TypelD, Treatment TypeDescription)]
has only two attributes each. The cardinality of each attribute is also low.
One way to build the junk dimension will be to perform a cross-join of the
source tables. This will create all possible combinations of attributes, even if
they do not or might never exist in the real world. The other way is to build
the junk dimension based on the actual attribute combinations found in the
source tables for the fact table. This will most definitely keep the junk
dimension table significantly smaller since it will include only those
combinations that actually occur. Based on this explanation, we redesign the
fact table along with the junk dimension table as shown below:

SurrogateKeyID CountOfPatients
1 2
2 3
3 5

SurrogateK CaseT CaseTypeDescription TreatmentTyp TreatmentTypeDescr


eyID ypeID eID iption
1 4 Tsrf by a brnch 1 ICU
2 1 Rfrd by anthr hsp 3 Orthopaedic
3 3 Consultaion 4 Ophthalmology
Business Intelligence and Data 41
Warehousing (CS122) Dr. Atul Garg
Dimension Tables – Role Playing Dimension (RPD)
A single dimension that is expressed differently in a fact table with the usage of views is called a role-playing
dimension.
Consider an on-line transaction involving the purchase of a laptop. The moment an order is placed, an order date and
a delivery date will be generated. It should be observed that both the dates are the attributes of the same time
dimension. Whenever two separate analyses of the salts performance — one in terms of the order date and the other
in terms of the delivery date — are required, two views of the same time dimension will be created to perform the
analyses. In this scenario, the time dimension is called the role-playing dimension as it is playing the role of both
the order and delivery dates.
Another example of the role-playing dimension is the broker dimension. The broker can play the role of both sell
broker and buy broker in a share trading scenario. Figure below will help you a better understanding of the role-
playing dimension.

“Shipping” is a fact table with three measures — “Total”,


“Quantity”, and “Discount.
It has five dimension keys —
“ProductID” that links the fact table “Shipping” with the
“DimProduct” dimension table;
“DateID” that links “Shipping” with the “DimTime”
dimension table;
“ShipperID” that links “Shipping” with the “DimShipper”
dimension table;
and the remaining two dimensions “ToCityID” and
“FromCityID”, link the “Shipping” fact table with the same
dimension table, i.e. “DimCity”.
The two cities, as identified by the respective CityIDs,
would have the same(DimCity) but would mean two
completely different cities when used to signify FromCity
and ToCity. This is a case of role-playing dimension.

Business Intelligence and Data 42


Warehousing (CS122) Dr. Atul Garg
References

R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley India
Publishers.
http://www.punjabiuniversity.ac.in/Pages/Images/elearn/MultidimensionalDataModelin
g.pdf

43

Business Intelligence and Data


Warehousing (CS122) Dr. Atul Garg
Schemas for Multidimensional
Database

Dr. Atul Garg

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg 1


Typical Dimension Models

As it has been earlier discusses that the Entity Relationship (ER) data
model is a commonly used data model for relational databases. Here, the
database schema is represented by a set of entities and the relationship
between them. It is an ideal data model for On-Line Transaction Processing
(OLTP).
Let us look at a data model that is considered apt for On-Line Data
Analysis. Multidimensional data modeling is the most popular data model
when it comes to designing a data warehouse.

Dimensional modeling is generally represented by either of the


following schemas
1. Star Schema
2. Snowflake Schema
3. Fact Constellation Schema

Business Intelligence and Data 2


Warehousing (CS122) Dr. Atul Garg
Typical Dimension Models – Star Schema

• It is the simplest of data warehousing schema.


• It consists of a large central table (called the fact table) with no redundancy.
• The central table is being referred by a number of dimension tables. The schema graph
looks like a starburst (see figure below).
• The dimension tables form a radial pattern around the large central fact table.
• The star schema is always very effective for handling queries.
• In the star schema, the fact table is usually in 3NF or higher form of normalization.
• All the dimension tables are usually in a denormalized manner, and the highest form of
normalization they are usually present in is 2NF.
• The dimension tables are also known as look up or reference tables.

Business Intelligence and Data 3


Warehousing (CS122) Dr. Atul Garg
Example - Star Schema for sales of “ElectronicsForAll”
The basic star schema contains four components.
These are:

Fact table, Dimension tables, Attributes and Dimension hierarchies

Business Intelligence and Data 4


Warehousing (CS122) Dr. Atul Garg
Snow Flake Schema

• The Snowflake schema is a variant of the Star schema.


• Here, the centralized fact table is connected to multiple dimensions.
• In the Snowflake schema, dimensions are present in a normalized form in multiple related
tables (Figure below).
• A snowflake structure materializes when the dimensions of a star schema are detailed and
highly structured, having several levels of relationship, and the child tables have multiple
parent tables.
• This “snowflaking” effect affects only the dimension tables and does not affect the fact table.

Business Intelligence and Data 5


Warehousing (CS122) Dr. Atul Garg
Snow Flake Schema

• Normalization and expansion of the dimension tables in a star schema result in the
implementation of a snowflake design.
• A dimension table is said to be snow flaked when the low-cardinality attributes in the
dimension have been removed to separate normalized tables and these normalized
tables are then joined back into the original dimension table.

Business Intelligence and Data 6


Warehousing (CS122) Dr. Atul Garg
Snow Flake Schema

As we have in the example of “ElectronicsForAll”, the main difference between the Star and
Snow- flake schema is that the dimension tables of the Snowflake schema are maintained in
normalized form to reduce redundancy. The advantage here is that such tables (normalized) are
easy to save storage space. However, it also means that more joins will be needed to execute a
query. This will adversely impact system performance.
Identifying Dimensions to be Snowflaked
In this section, we will observe the practical implementation of the dimensional design.
What is snowflaking?
The snowflake design is the result of further expansion and normalization of the dimension table.
In other words, a dimension table is said to be snowflaked if the low-cardinality attributes of the
dimensions have been divided into separate normalized tables. These tables are then joined to
dimension table with referential constraints (foreign key constraints).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimensional model as more tables would be required to
satisfy the queries.
When do we snowflake?
The dimensional model is snowflaked under the following two conditions:
The dimension table consists of two or more sets of attributes which define information at
different grains.The sets of attributes of the same dimension table are being populated by
different source systems. Business Intelligence and Data 7
Warehousing (CS122) Dr. Atul Garg
Snow Flake Schema
For understanding why and when we snowflake, consider the “Product” dimension tabl¢ shown in

Business Intelligence and Data 8


Warehousing (CS122) Dr. Atul Garg
Conversion to Snowflaked Schema

Business Intelligence and Data 9


Warehousing (CS122) Dr. Atul Garg
Snow Flaking Example

• Consider the Normalized form of Region dimension

Region
RegionID

Country Code
Country
State Code
Country Code
City Code Country Name

State Code
City Code State
Decreases performance because more tables
State will
codeneed to be joined to satisfy queries
State Name
City
City Code
City Code
City Name
ZIP

Business Intelligence and Data 10


Warehousing (CS122) Dr. Atul Garg
Why not to Snowflake?

Normally, you should avoid snowflaking or normalization of a


dimension table, unless required and appropriate. Snowflaking
reduces space consumed by dimension tables, but compared
with entire data warehouse the saving is usually insignificant.
Do not snowflake hierarchies of one dimension table into
separate tables. Hierarchies should belong to the dimension
table only and should never be snowflaked. Multiple
hierarchies can belong to the same dimension if the dimension
has been designed at the lowest possible detail.

Business Intelligence and Data 11


Warehousing (CS122) Dr. Atul Garg
Data Model for Fact Constellation Schema

The constellation schema is shaped like a constellation of stars (i.e. Star schemas). This is
more complex than Star or Snowflake schema variations, as it contains multiple fact tables.
This allows the dimension tables to be shared among the various fact tables. It is also called
“Galaxy schema”. The main disadvantage of the fact constellation is more complicated
design because multiple aggregations must be taken into consideration (Figure below).

Business Intelligence and Data 12


Warehousing (CS122) Dr. Atul Garg
Dimensional Modeling Life Cycle
Phases of Dimensional Modeling Life Cycle:
1. Requirements gathering
2. Identifying the grain
3. Identifying the dimensions
4. Identifying the facts
5. Designing the dimensional model

Business Intelligence and Data 13


Warehousing (CS122) Dr. Atul Garg
Understanding Dimension – Cube

• An extension to the two-dimensional Table.


• For example in the previous scenario CEO wants a report on revenue
generated by different services across regions during each quarter

Business Intelligence and Data 14


Warehousing (CS122) Dr. Atul Garg
Understanding Dimension – Cube (contd.)

Dimension Hierarchy
Grain
Fact

Testing
N .America
Consulting
Europe Production Support

Asia Pacific

Q1 Q2 Q3 Q4

Business Intelligence and Data 15


Warehousing (CS122) Dr. Atul Garg
References

R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley India
Publishers.
http://www.punjabiuniversity.ac.in/Pages/Images/elearn/MultidimensionalDataModelin
g.pdf

Business Intelligence and Data


Warehousing (CS122) Dr. Atul Garg
Clustering Techniques-K-mean
square

Dr. Atul Garg

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg 1


Clustering

Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
• Partition data set into clusters based on similarity, and store cluster representation
(e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering and be stored in multi-dimensional index tree
structures

Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
Business Intelligence and Data 2
Warehousing (CS122) Dr. Atul Garg
Clustering Methods

Clustering methods can be classified into the following categories −


• Partitioning Method
• Hierarchical Method
• Density-based Method
• Grid-Based Method
• Model-Based Method
• Constraint-based Method

Business Intelligence and Data 3


Warehousing (CS122) Dr. Atul Garg
Clustering Methods

Partitioning Method: Suppose we are given a database of ‘n’ objects and the
partitioning method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy
the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Hierarchical Methods: This method creates a hierarchical decomposition of the
given set of data objects. We can classify hierarchical methods on the basis of how
the hierarchical decomposition is formed.
Density-based Method: This method is based on the notion of density. The basic
idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a given
cluster, the radius of a given cluster has to contain at least a minimum number of
points.

Business Intelligence and Data 4


Warehousing (CS122) Dr. Atul Garg
Clustering Methods

Grid-based Method: In this, the objects together form a grid. The object space is
quantized into finite number of cells that form a grid structure.
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods: In this method, a model is hypothesized for each cluster to
find the best fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data points.
Constraint-based Method: In this method, the clustering is performed by the
incorporation of user or application-oriented constraints. A constraint refers to the
user expectation or the properties of desired clustering results.

Business Intelligence and Data 5


Warehousing (CS122) Dr. Atul Garg
Applications of Cluster Analysis
Clustering analysis is broadly used in many applications such as market research,
pattern recognition, data analysis, and image processing.
• Clustering can also help marketers discover distinct groups in their customer
base.
• In the field of biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionalities.
• Clustering also helps in identification of areas of similar land use in an earth
observation database.
• It also helps in the identification of groups of houses in a city according to house
type, value, and geographic location.
• Clustering also helps in classifying documents on the web for information
discovery.
• Clustering is also used in outlier detection applications such as detection of credit
card fraud.
• As a data mining function, cluster analysis serves as a tool to gain insight into the
distribution of data to observe characteristics of each cluster.
Business Intelligence and Data 6
Warehousing (CS122) Dr. Atul Garg
Requirements of clustering in data mining

The following are some points why clustering is important in data mining.
Scalability: Require highly scalable clustering algorithms to work with large
databases.
Ability to deal with different kinds of attributes: Algorithms should be able to
work with the type of data such as categorical, numerical, and binary data.
Discovery of clusters with attribute shape: The algorithm should be able to detect
clusters in arbitrary shape and it should not be bounded to distance measures.
Interpretability: The results should be comprehensive, usable, and interpretable.
High dimensionality: The algorithm should be able to handle high dimensional
space instead of only handling low dimensional data.
Business Intelligence and Data 7
Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used
to solve the clustering problems in machine learning or data science.

The algorithm takes the unlabelled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.
The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an
iterative process.
2. Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.
Hence each cluster has data points with some commonalities, and it is
away from other clusters. Business Intelligence and Data 8
Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to
the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to
FINISH.
Step-7: The model is ready.

Business Intelligence and Data 9


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm

Business Intelligence and Data 10


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

Let's take number k of clusters,


i.e., K=2, to identify the dataset
and to put them into different
clusters. It means here we will try
to group these datasets into two
different clusters.

Business Intelligence and Data 11


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as k points, which are not the part of our
dataset. Consider the below image 2.
Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a median
between both the centroids. Consider the below image3:

Business Intelligence and Data image3 12


image 2
Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm

From the above image, it is clear that points left side of


the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear
visualization.

Business Intelligence and Data 13


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm

As we need to find the closest cluster, so we will repeat the process by


choosing a new centroid. To choose the new centroids, we will compute the
center of gravity of these centroids, and will find new centroids as below:

Business Intelligence and Data 14


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
Next, we will reassign each
datapoint to the new centroid.
From the previous image, we can see, one
For this, we will repeat the
yellow point is on the left side of the line,
same process of finding a
and two blue points are right to the line.
median line. The median will
So, these three points will be assigned to
be like below image:
new centroids.

Business Intelligence and Data 15


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm

We will repeat the process by finding the


center of gravity of centroids, so the new
centroids will be as shown in the below
image:

As reassignment has taken place, so we


will again go to the step-4, which is
finding new centroids or K-points.

Business Intelligence and Data 16


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm

As we got the new centroids so We can see in the previous image; there
again will draw the median line are no dissimilar data points on either side
and reassign the data points. So, of the line, which means our model is
the image will be: formed. Consider the below image:

Business Intelligence and Data 17


Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm

As our model is ready, so we can now remove the


assumed centroids, and the two final clusters will be
as shown in the below image:

Business Intelligence and Data 18


Warehousing (CS122) Dr. Atul Garg
Text Mining

Text mining
Application of data mining to non-structured or less structured text files. It
entails the generation of meaningful numerical indices from the
unstructured text and then processing these indices using various data
mining algorithms
Text mining helps organizations:
Find the “hidden” content of documents, including additional useful relationships
Relate documents across previous unnoticed divisions
Group documents by common themes
Text Mining

Applications of text mining


Automatic detection of e-mail spam or phishing through analysis of the document content
Automatic processing of messages or e-mails to route a message to the most appropriate party to
process that message
Analysis of warranty claims, help desk calls/reports, and so on to identify the most common problems
and relevant responses
Analysis of related scientific publications in journals to create an automated summary view of a
particular discipline
Creation of a “relationship view” of a document collection
Qualitative analysis of documents to detect deception
Text Mining

How to mine text


1. Eliminate commonly used words (stop-words)
2. Replace words with their stems or roots (stemming algorithms)
3. Consider synonyms and phrases
4. Calculate the weights of the remaining terms
Web Mining

Web mining
The discovery and analysis of interesting and useful information from the
Web, about the Web, and usually through Web-based tools
Types of Web Mining
Web Mining

Web content mining


The extraction of useful information from Web pages
Web structure mining
The development of useful information from the links included in the Web documents
Web usage mining
The extraction of useful information from the data being generated through webpage
visits, transaction, etc.
Web Mining

Uses for Web mining:


Determine the lifetime value of clients
Design cross-marketing strategies across products
Evaluate promotional campaigns
Target electronic ads and coupons at user groups
Predict user behavior
Present dynamic information to users
Example of Customization using Web
usage Mining
References

• Data Mining: Concepts and Techniques by Jiawei Han and Micheline


Kamber, Third edition, Morgan Kaufman Publishers.
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”,
Wiley India Publishers.
• https://www.tutorialspoint.com/data_mining/dm_cluster_analysis.htm
• https://www.geeksforgeeks.org/clustering-in-data-mining/
• https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-
learning
• https://www.edureka.co/blog/k-means-clustering/
• https://www.lifewire.com/k-means-clustering-1019648
• https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-
learning
• https://www.geeksforgeeks.org/k-means-clustering-introduction/?ref=lbp

Business Intelligence and Data


Warehousing (CS122) Dr. Atul Garg
Measures, Metrics, KPIs and
Performance Management

Dr. Atul Garg

Business Intelligence and Data Warehousing (CS122) 1


Dr. Atul Garg
Business Intelligence and Data Warehousing (CS122) 2
Dr. Atul Garg
Index

Business Intelligence and Data Warehousing (CS122) 3


Dr. Atul Garg
Business Intelligence and Data Warehousing (CS122) 4
Dr. Atul Garg
Business Intelligence and Data Warehousing 5
Business Intelligence and Data Warehousing (CS122) 6
Dr. Atul Garg
Business Intelligence and Data Warehousing (CS122) 7
Business Intelligence and Data Warehousing (CS122) 8
Dr. Atul Garg
Business Intelligence and Data Warehousing (CS122)
9
Dr. Atul Garg
Key Performance Indicator

Business Intelligence and Data Warehousing (CS122) 10


Dr. Atul Garg
Business Intelligence and Data Warehousing (CS122) 11
Business Intelligence and Data Warehousing (CS122) 12
Dr. Atul Garg
References

R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley India
Publishers.
Knowledge Management:
Introduction, purpose and
strategies

Dr. Atul Garg

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Knowledge
• Familiarity or understanding on a specific topic gained through
experience or study.
• Knowledge is used in terms of a persons skills or expertise in a given
area.
• Knowledge typically reflects an empirical.
• General awareness or possession of information, facts, ideas, truths, or
principles.
• Clear awareness or explicit information, for example, of a situation or
fact.
• All the information, facts, truths, and principles learned throughout time.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Types of Knowledge
• Explicit knowledge is knowledge covering topics that are easy to
systematically document (in writing), and share out at scale: what we
think of as structured information. Explicit knowledge includes things
like FAQs, instructions, raw data and related reports, diagrams, one-
sheets, and strategy slide decks.
• Implicit knowledge is, essentially, learned skills or know-how. It is
gained by taking explicit knowledge and applying it to a specific
situation. Implicit knowledge is what is gained when you learn the best
way to something.
• Tacit knowledge is intangible information that can be difficult to explain
in a straightforward way, such as things that are often “understood”
without necessarily being said, and are often personal or cultural.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Knowledge Management

• Knowledge management is the systematic management of an


organization's knowledge assets for creating value and meeting
tactical & strategic requirements. It consists of the initiatives,
processes, strategies, and systems that sustain and enhance the
storage, assessment, sharing, refinement, and creation of knowledge.
• Each enterprise should define knowledge management in terms of its
own business objectives. Knowledge management is all about
applying knowledge in new, previously overburdened or novel
situation.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Need of Knowledge Management
Application of Knowledge Management (KM) lie in the below four key areas
• Globalization of Business − Organizations today are more universal i.e., they are operating in
multiple sites, multilingual, and multicultural in nature.
• Leaner Organizations − Organizations are adopting to a lean strategy where they understand
customer value and focus on key processes to continuously increase it. The ultimate goal is to
provide perfect value to the customer through a perfect value creation process that has zero
waste.
• Corporate Amnesia − We are freer as a workforce, which creates issues regarding knowledge
continuity for the organization and places with continuous learning demands from knowledge
worker. We no longer expect to spend our entire work life with the same organization.
• Technological Advances − The world is more connected with the advent of websites, smart
phones and other latest gadgets. Advancements in technology has not only helped in better
connectivity but also changed expectations. Companies are expected to have online presence
round the clock providing required information as per the customer needs.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Knowledge Management Factors

Successful knowledge management initiatives depend on a few key factors. Ensure you take
the following elements into account when designing your knowledge management strategy:
• People and Culture — Knowledge management is not a stand-alone function in an
organization. It should include a cross-functional, culture-driven approach to how the
organization operates and needs to be made a top priority to be successful.
• Process — Your company needs to develop a plan for how your knowledge management
becomes part of your everyday business operations.
• Technology — Successful knowledge management initiatives depend heavily on
technology. You will need a technology infrastructure that supports your knowledge
management plan.
• Strategy — Your knowledge management process should focus on identifying and
eliminating knowledge and process gaps.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Strategic Framework
• A strategic framework for knowledge management should be relevant for an organization
targeting to sustain competitive advantage. The framework should assist an organization to
create a clear association between the competitive circumstance and a knowledge
management strategy. Competitive knowledge can be categorized on a degree of innovation
comparative to the remaining of the particular industry into three groups as below:
• Core knowledge- It is the fundamental degree of knowledge needed by all members of a
specific industry. It does not stand for a competitive advantage. On the other hand it is
merely the knowledge required to be capable to be active in that domain of work.
• Advanced knowledge- It provides an organization a competitive upper hand. It is
specialized knowledge which distinguishes an organization from its competitors. It can be
achieved by being more knowledgeable than a competitor or by harnessing knowledge in
diverse methods.
• Innovative knowledge- It is the knowledge that can facilitate an organization to be a
market leader. It permits an organization to modify the working approach of an industry. It
stands for a noteworthy differentiating feature from other organizations.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
How to Build Knowledge Management Strategy

A knowledge management strategy is a written plan of action that outlines your company’s steps to
implement a knowledge management strategy and system. A strategy will help you identify what
knowledge you need to manage and keep your project on track.
1. Build Your Knowledge Management Team: To build a comprehensive strategy, gather team
members who understand the value of managing your company’s knowledge. Members of your
knowledge management team should become role models and influencers when it comes time for
employees to use your system.
2. Identify Your Goals: Identify your company’s business goals and create goals for your knowledge
management system that align. Next, figure out how your knowledge management system will benefit
employees, customers, and your organization as a whole. This will help you get buy-in from
leadership as you move through the strategy and implementation process and provide a solid road map
you can refer back to at any time.
3. Perform a Knowledge Audit: A knowledge audit takes a look at your company’s information to
understand how you are currently managing that information. Unlike a content audit, a knowledge
audit takes a step back to look at the overall amount of content you are storing.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
How to Build Knowledge Management Strategy

4. Choose Your Technology: Choose the primary tool you’ll use to build your knowledge management system.
Knowledge management tools provide a central location for all of your knowledge, making it easy to store and
retrieve your information. Examples of knowledge management tools include customer relationship
systems, knowledge base software, internal wikis, and learning management systems.
5. Create a Communication Plan: Create a plan for sharing your new knowledge management system with
your employees to make sure they know and understand how it works. This plan should include the messaging
you’ll use and the channels you will use to distribute communications.
6. Establish Milestones: Create specific milestones to keep your project on track. Be specific when designing
your milestones so they can be easily measured and managed. For example, a proper milestone will include
specific dates so you can set delivery expectations. A milestone looks like “select a knowledge base by April
27th” instead of “find a knowledge base to use.”
7. Build a Roadmap: As soon as you have put all the pieces in place, you can begin constructing your
implementation roadmap. The roadmap should describe the complete picture of your implementation, broken up
into stages, and include your objectives, milestones, and timelines. Describe each step clearly so stakeholders
can easily understand it.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Knowledge Management - Strategy

A good Knowledge Management strategy possesses the following components −


• A Stated Business Strategy and Objectives − It should have products or services, target
customers, referred distribution or delivery channels, characterization of regulatory
environment, mission or vision statement.
• A Description of Knowledge-Based Business Issues − Need for collaboration, need to
level performance variance, need for innovation, and need to address information overload.
• An Inventory of Available Knowledge Resources − Knowledge capital, social capital,
infrastructure capital.
• An Analysis of Recommended Knowledge Leverage − Points that briefs what can be
done with the above-identified knowledge and knowledge artifacts and lists Knowledge
management projects that can be undertaken with the intent to maximize ROI and business
value.

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


References

• Data Mining: Concepts and Techniques by Jiawei Han and Micheline


Kamber, Third edition, Morgan Kaufman Publishers.
• https://whatfix.com/blog/knowledge-management/
• https://www.managementstudyguide.com/knowledge-management-
strategy.htm
• https://www.kmworld.com/About/What_is_Knowledge_Management
• https://www.tutorialspoint.com/knowledge_management/knowledge_manag
ement_introduction.htm

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg


Thanks

Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg

You might also like