Business Intelligence and Data Warehousing-Merged
Business Intelligence and Data Warehousing-Merged
Business Intelligence and Data Warehousing-Merged
Data Warehousing
Reference: https://www.guru99.com/business-intelligence-definition-example.html
Other Examples
• Example 2:
• A hotel owner uses BI analytical applications to gather statistical information
regarding average occupancy and room rate. It helps to find aggregate revenue
generated per room.
• It also collects statistics on market share and data from customer surveys from
each hotel to decides its competitive position in various markets.
• By analyzing these trends year by year, month by month and day by day helps
management to offer discounts on room rentals.
• Example 3:
• A bank gives branch managers access to BI applications. It helps branch manager
to determine who are the most profitable customers and which customers they
should work on.
• The use of BI tools frees information technology staff from the task of generating
analytical reports for the departments. It also gives department personnel access to
a richer data source.
• E-commerce: AMAZON, Flipkart etc.
Pros and Cons of BI
Pros Cons
1
Digital data
2
Types of Digital data
Digital data can be classified into three forms:
• Unstructured data: This is the data which does not conform to a data model or is not
in a form which can be used easily by a computer program. About 80—90% data of
an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, body of an email, etc.
• Semi-structured data: This is the data which does not conform to a data model but
has some structure. However, it is not in a form which can be used easily by a
computer program; for example XML, mark-up languages like HTML, etc. Metadata
for this data is available but is not sufficient.
• Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. Data stored in databases is an
example of structured data.
3
Formats of Digital Data
5
Sources of Un-structured Data
Broadly speaking, anything in a non-database form is unstructured data.
6
Managing Un-structured Data
Few generic tasks to be performed to enable storage and search of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database Management System(RDBMS). In this system,
data is indexed to enable faster search and retrieval. On the basis of some value in the data, index is defined which is
nothing but an identifier and represents the large record in the data set. In the absence of an index, the whole data set/
document will be scanned for retrieving the desired information. In the case of unstructured data too, indexing helps in
searching and retrieval. Based on text or some other attributes, e.g. file name, the unstructured data is indexed. Indexing in
unstructured data is difficult because neither does this data have any predefined attributes nor does it follow any pattern or
naming conventions. Text can be indexed based on a text string but in case of non-text based files, e.g. audio/video, etc.,
indexing depends on file names.
Tags/Metadata: Using metadata, metadata, data in a document, document, etc. can be tagged. This enables search and
retrieval. But in unstructured data, this is difficult as little or no metadata is available. Structure of data has to be
determined which is very difficult as the data itself has no particular format and is coming from more than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the relationships that exist between data. Data can
be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an organization. However, classifying
unstructured data is difficult as identifying relationships between data is not an easy task. In the absence of any structure or
metadata or schema, identifying accurate relationships and classifying is not easy. Since the data is unstructured, naming
conventions or standards are not consistent across an organization, thus making it difficult to classify data.
CAS (Content Addressable Storage): It stores data based on their metadata. It assigns 2 unique name to every object
stored in it. The object is retrieved based on its content and not its location. It is used extensively to store emails, etc
7
Challenges to store Un-structured Data
8
Possible Solutions to Store Un-structured Data
9
Challenges to extract Information
10
Solutions to extract Information
11
Semi-structured Data
• Semi-structured data does not conform to any data model i.e. it is difficult to determine the meaning of data
neither can data be stored in rows and columns as in a database but semi-structured data has tags and
markers which help to group data and describe how data is stored, giving some metadata but it is not
sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The attributes or the properties within a
group may or may not be the same. For example two addresses may or may not contain the same number of
properties as in
Address 1
<house number><street name><area name><city> Address 2
<house number><street name><city>
• For example an e-mail follows a standard format
To: <Name> From: <Name> Subject: <Text> CC:
<Name>
Body: <Text, Graphics, Images etc. >
• The tags give us some metadata but the body of the e-mail contains no format neither is such which conveys
meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
12
Semi-Structured Data
13
Sources of Semi-structured Data
14
Managing Semi-structured Data
Some ways in which semi-structured data is managed and stored
15
Challenges to store Semi-structured Data
16
Possible Solutions to Store Semi-structured Data
17
Challenges to extract Semi-structured Data
18
Solutions to extract Semi-structured Data
19
XML: to manage Semi-structured Data
20
XML: to manage Semi-structured Data
XML has no predefined tags
<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>
22
Structured Data
Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)
Structured
data
Definition, format
& meaning of data
is explicitly
known
23
Sources of Structured Data
Spreadsheets
Structured Data
SQL
24
Managing Structured Data
25
26
Storing Structured Data
27
Retrieving Structured Data
28
Difference b/w types of Data
Sr. No. Key Structured Data Semi Structured Data Unstructured Data
Level of Structured Data as name On other hand in case of Semi Structured In last the data is fully non
organizing suggest this type of data Data the data is organized up to some organized in case of
is well organized and extent only and rest is non organized Unstructured Data and
1
hence level of organizing hence the level of organizing is less than hence level of organizing is
is highest in this type of that of Structured Data and higher than lowest in case of
data. that of Unstructured Data. Unstructured Data.
Means of Data Structured Data is get While in case of Semi Structured Data is On other hand in case of
Organization organized by the means of partially organized by the means of Unstructured Data is based
2
Relational Database. XML/RDF. on simple character and
binary data.
Transaction In Structured Data In Semi Structured Data transaction is not While in Unstructured Data
Management management and by default but is get adapted from DBMS no transaction management
concurrency of data is but data concurrency is not present. and no concurrency are
3
present and hence mostly present.
preferred in multitasking 29
process.
Difference b/w types of Data
30
Difference b/w types of Data
31
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”,
Wiley India Publishers.
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/DigitalData.pdf
• https://www.tutorialspoint.com/difference-between-structured-semi-
structured-and-unstructured-data
• https://www.michael-gramlich.com/what-is-structured-semi-structured-and-
unstructured-data/
• https://www.datamation.com/big-data/structured-vs-unstructured-data/
32
Data Mining
Customer
Relationship
Management
Data Mining and Knowledge Discovery
Data mining and knowledge discovery in databases (KDD) are frequently treated as synonyms, data
mining is actually part of the knowledge discovery process.
Query tools can be used to easily build and input Data Mining is a technique or a concept in computer
queries to databases science
Query tools make it very easy to build queries without Deals with extracting useful and previously unknown
even having to learn a database-specific query language information from raw data
Query tools the users need to know exactly what they While data mining is used mostly when the user has a
are looking for vague idea about what they are looking for
Query tools can be used to easily build and input Data miners can use the existing functionalities of
queries to databases Query Tools to pre-process raw data before the Data
mining process
Kind of data can be mined
1. Flat Files
2. Relational Databases
3. Data Warehouse
4. Transactional Databases
5. Multimedia Databases
6. Spatial Databases
7. Time Series Databases
8. World Wide Web(WWW)
9. Medical and personal data
10. Satellite sensing
11. Games
12. Text reports / Memos / Email-messages / chats
etc
Kind of data can be mined
1. Flat Files
• Flat files is defined as data files in text form or binary form with a structure that can be easily
extracted by data mining algorithms.
• Data stored in flat files have no relationship or path among themselves.
• Flat files are represented by data dictionary. Eg: CSV file.
• Application: Used in Data Warehousing to store data, Used in carrying data to and from server,
etc.
2. Relational Databases
• A Relational database is defined as the collection of data organized in tables with rows and
columns.
• Physical schema in Relational databases is a schema which defines the structure of tables.
• Logical schema in Relational databases is a schema which defines the relationship among tables.
• Standard API of relational database is SQL.
• Application: Data Mining, Relational Online Analytical Processing (ROLAP) model, etc.
Kind of data can be mined
3. Data Warehouse
• A data warehouse is defined as the collection of data integrated from multiple sources that will
queries and decision making.
• Two approaches can be used to update data in Data Warehouse: Query-driven Approach
and Update-driven Approach.
• Application: Business decision making etc.
4. Transactional Databases
• Transactional databases is a collection of data organized by time stamps, date, etc to represent
transaction in databases.
• This type of database has the capability to roll back or undo its operation when a transaction is
not completed or committed.
• Highly flexible system where users can modify information without changing any sensitive
information.
• Application: Banking, Distributed systems, Object databases, etc.
Kind of data can be mined
5. Multimedia Databases
• Multimedia databases consists audio, video, images and text media.
• They can be stored on Object-Oriented Databases.
• They are used to store complex information in a pre-specified formats.
• Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
6. Spatial Database
• Store geographical information.
• Stores data in the form of coordinates, topology, lines, polygons, etc.
• Application: Maps, Global positioning, etc.
7. Time-series Databases
• Time series databases contains stock exchange data and user logged activities.
• Handles array of numbers indexed by time, date, etc.
• It requires real-time analysis.
• Application: eXtremeDB, InfluxDB, etc.
Kind of data can be mined
8. WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc which are
identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML pages, and accessible
via the Internet network.
•It is the most heterogeneous repository as it collects data from multiple resources.
•It is dynamic in nature as Volume of data is continuously increasing and changing.
Application: Online shopping, Job search, Research, studying, etc
9. Medical and personal data: From government census to personnel and customer files, very large collections
of information are continuously gathered about individuals and groups. Governments, companies and
organizations such as hospitals, are stockpiling very important quantities of personal data to help them manage
human resources, better understand a market, or simply assist clientele.
Applications: Hospitals, Social media etc
10. Satellite sensing: There is a countless number of satellites around the globe: some are geo-stationary above a
region, and some are orbiting around the Earth, but all are sending a non-stop stream of data to the surface.
NASA, which controls a large number of satellites, receives more data every second than what all NASA
researchers and engineers can cope with.
Applications: Space institutions etc
Kind of data can be mined
11. Games: Our society is collecting a tremendous amount of data and statistics about games, players and
athletes. From hockey scores, basketball passes and car-racing lapses, to swimming times, boxers pushes
and chess positions, all the data are stored. Commentators and journalists are using this information for
reporting, but trainers and athletes would want to exploit this data to improve performance and better
understand opponents.
Applications: BCCI, etc
12. Text reports and memos (e-mail messages): Most of the communications within and between companies
or research organizations or even private people, are based on reports and memos in textual forms often
exchanged by e-mail. These messages are regularly stored in digital form for future use and reference
creating formidable digital libraries.
Applications: e-commerce, hospitals, library etc
References
• https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/notes/Chapte
r1/index.html
• https://www.javatpoint.com/data-mining
• https://www.differencebetween.com/difference-between-data-mining-
and-vs-query-tools/
• https://vspages.com/data-mining-vs-query-tools-1897/
• https://www.geeksforgeeks.org/types-of-sources-of-data-in-data-
mining/
Data Pre-processing
4
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Reduction
• Dimensionality reduction: reduced/compresses data using schema
• Numerosity reduction: Replaced by alternative, smaller representations
• Data compression
• Data Transformation and Data Discretization
• Normalization
• Concept hierarchy generation
5
Data Cleaning/Cleansing
6
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or
containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”, Birthday=“03-07-2010”,
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
7
Incomplete (Missing) Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as customer income
in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of entry
• not register history or changes of the data
• Missing data may need to be inferred
8
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean for all samples belonging to the same class: smarter
• the most probable value: inference-based such as Bayesian formula or
decision tree
*Bayes’ theorem describes the probability of occurrence of an event related to any
condition.
9
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data
10
How to Handle Noisy Data?
• Binning
• Sorted data value by consulting its neighborhood
• sort data and partition into (equal-frequency) buckets/bins
• then one can smooth by bin means: each value is replaced with bin value
• smooth by bin median: each value is replaced with bin value
• Smooth by bin boundaries: replace with min and max values in the bin
• Regression
• smooth by fitting the data into regression functions: A technique that confirms data
values to a functions. Linear regression involves finding the best line to fit two
attributes
• Clustering/Outlier Analysis
• detect and remove outliers: Where similar values are organized in groups/clusters
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with possible outliers)
11
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels): Integrates discrepancy detection and
12
transformation
• https://www.slideshare.net/ankurh/data-preprocessing-31623210
• https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/
• https://www.tableau.com/learn/articles/what-is-data-cleaning
Data Preprocessing-
Data Integration, Data
Reduction, Clustering, Data
Discretization
2
2
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Can reduce and avoid redundancies and inconsistencies in resulting data set
• Can improve accuracy and speed
4
4
Data Preprocessing:
Data Reduction
5
5
Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the same) analytical
results
• Why data reduction? — A database/data warehouse may store terabytes of
data. Complex data analysis may take a very long time to run on the
complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Numerosity reduction (some simply call it: Data Reduction)
• Data compression
6
Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms: It transforms a vector into a numerically different vector
• Principal Component Analysis: often used to reduce the dimensionality of large data sets, by transforming
a large set of variables into a smaller one
• Supervised and nonlinear techniques (e.g., feature selection)
7
Data Reduction 2: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods
• Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
• Ex.: Log-linear models
Non-parametric methods
• Do not assume models
• Major families: histograms, clustering, sampling, …
8
Parametric Data Reduction: Regression and Log-Linear Models
• Linear regression
• Data modeled to fit a straight line
• Often uses the least-square method to fit the line
• Multiple regression
• Allows a response variable Y to be modeled as a linear function of multidimensional
feature vector
• Log-linear model
• Approximates discrete multidimensional probability distributions
9
Histogram Analysis
• Divide data into buckets and store average (sum) for each bucket
• Partitioning rules:
• Equal-width: equal bucket range
• Equal-frequency (or equal-depth)
10
Clustering
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering and be stored in multi-dimensional index tree
structures
11
Sampling
13
Sampling: With or without Replacement
Raw Data
14
Sampling: Cluster or Stratified Sampling
Stratified random sampling is a sampling method that involves taking samples of a population
subdivided into smaller groups called strata.
15
Data Cube Aggregation
17
Data Compression
Original Data
Approximated
18
Data Preprocessing:
Data Transformation and Data Discretization
19
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of
replacement values. E.g. each old value can be identified with one of the new
values
• Methods
• Smoothing: Remove noise from data
• Attribute/feature construction
• New attributes constructed from the given ones
• Aggregation: Summarization, data cube construction
• Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
• Discretization: Concept hierarchy climbing 20
Normalization
Normalization is used to scale the data of an attribute so that it falls in a smaller range.
Normalization is generally required when we are dealing with attributes on a different scale,
otherwise, it may lead to a dilution in effectiveness of an important equally important
attribute(on lower scale) because of other attribute having values on larger scale. In simple
words, when multiple attributes are there but attributes have values on different scales, this
may lead to poor data models while performing data mining operations.
• Min-Max Normalization – In this technique of data normalization, linear transformation
is performed on the original data.
• Z-score normalization – In this technique, values are normalized based on mean and
standard deviation of the data A.
• Decimal Scaling Method For Normalization – It normalizes by moving the decimal
point of values of the data. To normalize the data by this technique, we divide each value
of the data by the maximum absolute value of data.
21
Min-Max Normalization
Suppose that the minimum and maximum values for the attribute income
are $12,000 and $98,000, respectively. We would like to map income to the
range [0.0, 1.0]. By min-max normalization, a value of $73,600 for income is
transformed to 73,600 − 12,000 ( 1.0 − 0.0 ) + 0 = 0.716
98,000 − 12,000
Z-score Normalization
v’, v is the new and old of each
entry in data respectively.
σA, A is the standard deviation
and mean of A respectively.
Find the mean of the dataset is 21.2 and the standard deviation is 29.8.
To perform a z-score normalization on the first value in the dataset, we can use the
following formula:
•New value = (x – μ) / σ
•New value = (3 – 21.2) / 29.8
•New value = -0.61
Z-score Normalization
The mean of the normalized values is 0 and the standard deviation of the normalized
values is 1.
The normalized values represent the number of standard deviations that the original
value is from the mean.
For example:
•The first value in the dataset is 0.61 standard deviations below the mean.
•The second value in the dataset is 0.54 standard deviations below the mean.
•…
•The last value in the dataset is 3.79 standard deviations above the mean.
The benefit of performing this type of normalization is that the clear outlier in
the dataset (134) has been transformed in such a way that it’s no longer a
massive outlier.
Decimal Scaling Method For
Normalization
It normalizes by moving the decimal point of values of the data. To normalize the data
by this technique, we divide each value of the data by the maximum absolute value of
data. The data value, vi, of data is normalized to vi‘ by using the formula below –
Let the input data is: -10, 201, 301, -401, 501, 601, 701
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401,
0.501, 0.601, 0.701
26
Data Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
27
Data Discretization Methods
Typical methods: All the methods can be applied recursively
• 1 Binning: Binning is a top-down splitting technique based on a specified number
of bins. Binning is an unsupervised discretization technique.
• 2 Histogram Analysis: Because histogram analysis does not use class
information so it is an unsupervised discretization technique. Histograms partition
the values for an attribute into disjoint ranges called buckets.
• 3 Cluster Analysis: Cluster analysis is a popular data discretization method. A
clustering algorithm can be applied to discrete a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Each initial cluster or partition may be further decomposed into several
subcultures, forming a lower level of the hierarchy
• Decision-tree analysis (supervised, top-down split)
28
Concept Hierarchy Generation
• Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a concept
hierarchy.
• Concept hierarchies can be used to reduce the data by collecting and
replacing low-level concepts with higher-level concepts.
• In the multidimensional model, data are organized into multiple
dimensions, and each dimension contains multiple levels of abstraction
defined by concept hierarchies.
• This organization provides users with the flexibility to view data from
different perspectives.
• Examples include – geographic location, – job category and item type, etc
29
Data Preprocessing : Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
• Entity identification problem
• Remove redundancies
• Detect inconsistencies
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
30
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley
India Publishers.
• Data Mining: Concepts and Techniques by Jiawei Han and Micheline Kamber,
Third edition, Morgan Kaufman Publishers.
• https://www.geeksforgeeks.org/data-normalization-in-data-mining/
• http://www.lastnightstudy.com/Show?id=45/Data-Discretization-and-Concept-
Hierarchy-Generation
• http://dataminingzone.weebly.com/uploads/6/5/9/4/6594749/ch_7discretization_an
d_concept_hierarchy_generation.pdf
• http://webpages.iust.ac.ir/yaghini/Courses/Application_IT_Fall2008/DM_02_07_D
ata%20Discretization%20and%20Concept%20Hierarchy%20Generation.pdf
31
Business Analytics
• Data analytics is a broad umbrella term that refers to the science of analyzing
raw data in order to transform that data into useful information from which
trends and metrics can be revealed.
• While both business analytics and data analytics aim to improve operational
efficiency, business analytics is specifically oriented to business uses and data
analytics has a broader focus.
• Both business intelligence and reporting fall under the data analytics umbrella.
• Data scientists, data anBusiness Analytics vs Data Analytics
• alysts, and data engineers work together in the data analytics process to collect,
integrate, and prepare data for the development, testing, and revision of
analytical models, ensuring accurate results.
• Data analytics for business purposes is characterized by its focus on specific,
business operations questions.
Business Intelligence vs Business Analytics
• While business intelligence and business analytics serve similar purposes, and the
terms may be used interchangeably, these practices differ in their fundamental focus.
• Business intelligence analytics focuses on descriptive analytics, combining data
gathering, data storage, and knowledge management with data analysis to evaluate past
data and providing new perspectives into currently known information.
• Business analytics focuses on prescriptive analytics, using data mining, modeling, and
machine learning to determine the likelihood of future outcomes.
• Essentially, business intelligence answers the questions, “What happened?” and “What
needs to change?” and
• Business analytics answers the questions, “Why is this happening?”, “What if this trend
continues?”, “What will happen next?”, and “What will happen if we change
something?”
• Business analytics and business intelligence solutions tend to overlap in structure and
purpose.
References
• https://www.omnisci.com/technical-glossary/business-analytics
• https://ptgmedia.pearsoncmg.com/images/9780133552188/samplepage
s/0133552187.pdf
Thanks
Business Intelligence
• Business Goals
• Business Strategies BUSINESS
LAYER
3. Program Management
4. Development
Business Layer – Business Requirements
When a strategy is implemented against certain business goals, then certain costs (monetary, time, effort, information
produced by data integration and analysis, application of knowledge from past experience, etc.) are involved.
The business value can be measured in the terms of ROI (Return on Investment), ROA (Return on Assets), TCO
(Total Cost of Ownership), TVO(Total Value of Ownership), etc. Let us understand these terms with the help of a
few examples –
Return on Investment (ROI): We take the example of “AMAZON”, an e-commerce which has been using
social media (mainly Twitter and Facebook) to help get new clients and to increase the number of prospects/leads.
They attribute 10% of their daily revenue to social media. Now, that is an ROI from social media!
Return on Asset (ROA): Suppose a company, “Electronics Today”, has a net income of $l million and has total
assets of $5 million. Then, its ROA is 20%. So, ROA is the earning from invested capital (assets).
Total Cost of Ownership (TCO): Let us understand TCO in the context of a vehicle. TCO defines the cost of
owning a vehicle from the time of purchase by the owner, through its operation and maintenance to the time it
leaves the possession of the owner.
Total Value of Ownership (TVO): TVO has replaced the simple concept of Owner's Equity in some companies. It
could include a variety of subcategories such as stock, undistributed dividends, retained earnings or profit, or excess
capital contributed.
Business Layer- Program Management
Online Transaction Processing: Online transaction processing shortly known as OLTP supports
transaction-oriented applications in a 3-tier architecture. OLTP administers day to day transaction of an
organization.
Consider a point-of-sale (POS) system in a supermarket store. You have picked a bar of chocolate
and await your chance in the queue for getting it billed. The cashier scans the chocolate bar's bar code.
Consequent to the scanning of the bar code, some activities take place in the background —
the database is accessed;
the price and product information is retrieved and displayed on the computer screen;
the cashier feeds in the quantity purchased;
the application then computes the total, generates the bill, and prints it. You pay the cash and
leave.
The application has just added a record of your purchase in its database. This was an On-Line
Transaction Processing (OLTP) system designed to support on-line transactions and query processing.
In other words, the POS of the supermarket store was an OLTP system.
OLTP Understanding
INSERT
INSERT UPDATE
UPDATE RETRIEVE
RETRIEVE
INSERT INSERT
UPDATE UPDATE
RETRIEVE RETRIEVE
OLTP Segmentation
Day
Day 1 Day 2 Day 3 .......... 30
Monthly
purchase of
Retail Store
Characteristics of OLTP Model
• Online connectivity
• LAN,WAN
• Availability
– Available 24 hours a day
• Response rate
– Rapid response rate
– Load balancing by prioritizing the transactions
• Cost
– Cost of transactions is less
• Update facility
– Less lock periods
– Instant updates
– Use the full potential of hardware and software
Limitations of Relational Models
• Create and maintain large number of tables for the voluminous data
• For new functionalities, new tables are added
• Unstructured data cannot be stored in relational databases
• Very difficult to manage the data with common denominator (keys)
Queries that an OLTP System can Process
• The super market store is deciding on introducing a new product. The key questions they
are debating are: “Which product should they introduce?” and “Should it be specific to a
few customer segments?”
• The super market store is looking at offering some discount on their year- end sale. The
questions here are: “How much discount should they offer?” and “Should it be different
discounts for different customer segments?”
• The supermarket is looking at rewarding its most consistent salesperson. The question
here is:“How to zero in on its most consistent salesperson (consistent on several
parameters)?"
• All the queries stated above have more to do with analysis than simple reporting
• OLAP differs from traditional databases in the way data is conceptualized and stored.
• In OLAP data is held in the dimensional form rather than the relational form.
• OLAP’s life blood is multi-dimensional data.
• OLAP tools are based on the multi-dimensional data model. The multi-dimensional data
model views data in the form of a data cube.
• Online Analytical Processing (OLAP) is a technology that is used to organize large business
databases and support business intelligence.
• OLAP databases are divided into one or more cubes. The cubes are designed in such a way that
creating and viewing reports become easy.
• OLAP databases are divided into one or more cubes, and each cube is organized and
designed by a cube administrator to fit the way that you retrieve and analyze data so that it is
easier to create and use the PivotTable reports and PivotChart reports that you need.
OLAP (Online Analytical Processing)
Let us consider the data of a supermarket store, “AllGoods” store, for the year “2020”.
This data as captured by the OLTP system is under the following column headings:
Section, Product-CategoryName, YearQuarter, and SalesAmount. E.g. we have a total of
32 records/rows.
The Section column can have one value from amongst “Men”, “Women”, “Kid”, and
“Infant”.
The ProductCategory Name column can have either the value “Accessories” or the
value “Clothing”.
The YearQuarter column can have one value from amongst “Q1”, “Q2”, “Q3”, and
“Q4”.
The SalesAmount column record the sales figures for each Section, ProductCategory
Name, and Year Quarter.
OLAP - Example
Characteristics of OLAP
• Multidimensional analysis
• Support for complex queries
• Advanced database support
– Support large databases
– Access different data sources
– Access aggregated data and detailed data
• Easy-to-use End-user interface
– Easy to use graphical interfaces
– Familiar interfaces with previous data analysis tools
• Client-Server Architecture
– Provides flexibility
– Can be used on different computers
– More machines can be added
One Dimensional
Consider the table shown in the earlier slide - It displays “AllGoods” store’s sales
data by Section, which is one-dimensional .
Figure 3.4 shows data in two dimensions (horizontal and vertical), in OLAP
it is considered to be one dimension as we are looking at the SalesAmount
from one particular perspective, i.e. by Section.
In Table 3.7, data has been plotted along two dimensions as we can now look at the SalesAmount from two perspectives, i.e. by
YearQuarter and ProductCategoryName. The calendar quarters have been listed along the vertical axis and the product categories
have been listed across the horizontal axis. Each unique pair of values of these two dimensions corresponds to a single point of
SalesAmount data. For example, the Accessories sales for Q2 add up to $9680.00 whereas the Clothing sales for the same
quarter total up to $12366.00. Their sales figures correspond to a single point of SalesAmount data, i.e. $22046.
Three Dimensional
What if the company’s analyst wishes to view the data — all of it — along all the three dimensions (Year-Quarter,
ProductCategoryName, and Section) and all on the same table at the same time? For this the analyst needs a
three-dimensional view of data as arranged in Table 3.8. In this table, one can now look at the data by all the three
dimensions/ perspectives, i.e. Section, ProductCategoryName, YearQuarter. If the analyst wants to look for the
section which recorded maximum Accessories sales in Q2, then by giving a quick glance to Table 3.8, he can
conclude that it is the Kid section.
Can we go beyond Three Dimensional
Well, if the question is “Can you go beyond the third dimension?” the answer is YES!
If at all there is any constraint, it is because of the limits of your software. But if the question is “Should you
go beyond the third dimension?” we will say it is entirely on what data has been captured by your operational
transactional systems and what kind of queries you wish your OLAP system to respond to.
Now that we understand multi-dimensional data, it is time to look at the functionalities and characteristics
of an OLAP system. OLAP systems are characterized by a low volume of transactions that involve very
complex queries. Some typical applications of OLAP are: budgeting, sales forecasting, sales reporting, business
process manage
Example: Assume a financial analyst reports that the sales by the company have gone up. The next question is
“Which Section is most responsible for this increase?” The answer to this question is usually followed by a
barrage of questions such as “Which store in this Section is most responsible for the increase?” or “Which
particular product category or categories registered the maximum incréase?” The answers to these are provided
by multidimensional analysis or OLAP;
Can we go beyond Three Dimensional
Focus Insert, Update, Delete information Extract data for analyzing that helps
from the database. in decision making.
Data OLTP and its transactions are the Different OLTPs database becomes
original source of data. the source of data for OLAP.
BASIS FOR
OLTP OLAP
COMPARISON
Source of data Operational/Transactional Data Data extracted from various
operational data sources, transformed
and loaded into the data warehouse
Purpose of data Manage (control and execute) basic Assists in planning, budgeting,
business tasks forecasting and decision making
Data contents Current data. Far too detailed – not Historical data. Has support for
suitable for decision making summarization and aggregation. Stores
and manages data at various levels of
granularity, thereby suitable for
decision making
Difference between OLTP and OLAP
Few Sample Queries Search & locate student(s) Which courses have productivity
Print student scores impact on-the-job?
Filter students above 90% marks How much training is needed on
future technologies for non- linear
growth in BI?
Why consider investing in DSS
experience lab?
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• https://techdifferences.com/difference-between-oltp-and-olap.html
• https://www.guru99.com/oltp-vs-olap.html
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/OLTPandOL
AP.pdf
Thanks
Different On-Line Analytical
Processing (OLAP)
Architectures
Working of ROLAP:
When a user makes a query (complex), the
ROLAP server will fetch data from the RDBMS
server. The ROLAP engine will then create data
cubes dynamically. The user will view data from a
multi-dimensional point.
HOLAP
HOLAP Model
Working of HOLAP:
The HOLAP model consists of a server that can
support ROLAP and MOLAP. It consists of a
complex architecture that requires frequent
maintenance. Queries made in the HOLAP model
involve the multi-dimensional database and the
relational database. The front-user tool presents
data from the database management system
(directly) or through the intermediate MOLAP.
HOLAP Model
HOLAP Advantages
The model uses a huge storage space because it consists of data from two
databases.
The model requires frequent updates because of its complex nature.
Greater complexity level: The major drawback in HOLAP systems is that it
supports both ROLAP and MOLAP tools and applications. Thus, it is very
complicated.
Potential overlaps: There are higher chances of overlapping especially into
their functionalities.
Other OLAP
The OLAP Cube consists of numeric facts called measures which are categorized by
dimensions. OLAP Cube is also called the hypercube.
Usually, data operations and analysis are performed using the simple spreadsheet, where
data values are arranged in row and column format. This is ideal for two-dimensional
data. However, OLAP contains multidimensional data, with data usually obtained from
a different and unrelated source. Using a spreadsheet is not an optimal option. The cube
can store and analyze multidimensional data in a logical and orderly manner.
How does it work?
A Data warehouse would extract information from multiple data sources and formats
like text files, excel sheet, multimedia files, etc.
The extracted data is cleaned and transformed. Data is loaded into an OLAP server (or
OLAP cube) where information is pre-calculated in advance for further analysis.
Basic analytical operations of OLAP
CS 336
5
Data Model for OLAP
Need to Understand
a. Dimension
b. Facts/measure
Dimension: It is a perspective or entity with respect to which an organization wants to
keep records. For example a store wants to keep records of the store’s sale with respect
to “time”, “product”, “customer”, “employee”.
These dimensions allows the store to keep track of things such as the quarterly
sales of products, the customers to whom the products were sold. Each of these
dimensions mat have a table associated wih it, called the dimension table.
Facts/measure: these are numerical measures/quantities by which analyst want to
analyse relationship between dimensions. E.g.,
Total (sales amount in dollars), Quantity (no of Units), Discount (amount in dollars)
Star Model for OLAP
Disadvantages of using
Description
OLAP
In majority cases, it is not so cheap to implement such system. That is why not every
organization can effort it. However, for big companies it is a really great investment, as
High cost
opportunities offered by OLAP system can not only pay off but bring much more profit
in the future.
The main problem of such system’s kind is that the structure must be defined in
OLAP is relational advance. It means the number of columns in the table, data types should be pre-
calculated before table creation. For quick results, it can cause some difficulties.
Some systems provide lack of computational power. That is greatly reduces the
flexibility of the OLAP tool. Analyzers are limited to a narrow and small area, unable
Computation capability
to analyze freely, and even have to resort to a third party to perform this kind of
calculation. In such business computing, OLAP is often left in an awkward situation.
The above mentioned problem may cause the next problem – possibility of risk. It leads
to the fact that it is not possible to provide huge amounts of data, and there is a great
Some potential risk
difficulty in providing valuable links to the decision maker. However, it depends on the
system type, vendor & modern OLAP software can be rather powerful in this issue.
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business
Analytics”, Wiley India Publishers.
• https://onlinecourses.swayam2.ac.in/cec19_cs 01/preview
• https://www.guru99.com/star-snowflake-data-warehousing.html
• https://galaktika-soft.com/blog/advantages-of-using-olap-for-
business-intelligence.html
• https://www.researchpublishers.org/pdf/26/Importance-of-OLAP-in-
Business-Intelligence-by-Debashis-Rout.pdf
Thanks
Data Warehousing
Strategic Adaptability
Adaptability is critical to the development of business requirements. The business
intelligence tools available in the market must be taken into account to adapt to
often unexpected changes in business demands.
In data warehouses, adaptability requires a principle and method to use alternative
BI tools in the future such as various back-end or visualization tools.
Query Performance
Massive and complex queries should be completed in seconds, not hours or days.
Terabyte Scalability
Today, the size of the data warehouses is evolving at staggering rates. This ranges
from a few bytes to hundreds of terabytes and gigabytes sized data warehouses.
Business Intelligence and Data 8
Architecture of Data Warehouse
Generally speaking, data warehouses have a three-tier architecture, which consists of:
• Bottom tier: The bottom tier consists of a data warehouse server, usually a relational
database system, which collects, cleanses, and transforms data from multiple data sources
through a process known as Extract, Transform, and Load (ETL) or a process known as
Extract, Load, and Transform (ELT).
• Middle tier: The middle tier consists of an OLAP (i.e. online analytical processing)
server which enables fast query speeds. Three types of OLAP models can be used in this
tier, which are known as ROLAP, MOLAP and HOLAP. The type of OLAP model used
is dependent on the type of database system that exists.
• Top tier: The top tier is represented by some kind of front-end user interface or reporting
tool, which enables end users to conduct ad-hoc data analysis on their business data.
An enterprise warehouse collects all the information and the subjects spanning an entire
organization
It provides us enterprise-wide data integration.
The data is integrated from operational systems and external information providers.
This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or
beyond.
It generally contains detailed information as well as summarized information and can range in
estimate from a few gigabyte to hundreds of gigabytes, terabytes, or beyond.
1
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Index
• Data Integration
• Challenges in Data Integration
• Technologies in Data Integration
• Need of Data Integration
• Advantages of Data Integration
• Common Data Integration Approaches
• Involves combining data residing at different sources and providing users with a unified
view of the data.
• Technological challenges
Various formats of data
Structured and unstructured data
Huge volumes of data
• Organizational challenges
Unavailability of data
Manual integration risk, failure
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Main Approaches in Data Integration
Integration is divided into two main approaches:
Schema integration – reconciles schema elements
Multiple data sources may provide data on the same entity type. The main goal is to
allow applications to transparently view and query this data as one uniform data
source, and this is done using various mapping rules to handle structural differences.
DB2
Data Warehousing
The various primary concepts used in data warehousing would be:
ETL (Extract Transform Load)
Component-based (Data Mart)
Dimensional Models and Schemas
Metadata driven
13
Data Warehouse Approaches-
Ralph Kimball’s &
Inmon’s Approach
1
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Index
The Kimball data model follows a bottom-up approach to data warehouse (DW)
architecture design in which data marts are first formed based on the business
requirements.
The primary data sources are then evaluated, and an Extract, Transform and Load (ETL) tool
is used to fetch different types of data formats from several sources and load them into a
staging area of the relational database server. Once data is uploaded in the staging area in the
data warehouse, the next phase includes loading data into a dimensional data warehouse
model that’s de-normalized by nature. This model partitions data into the fact table, which is
numeric transactional data, or dimension table, which is the reference information that
supports facts.
Star schema is the fundamental element of the dimensional data warehouse model.
Kimball dimensional modelling allows users to construct several star schemas to fulfill
various reporting needs.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
Ralph Kimball’s Approach
• Data isn’t entirely integrated before reporting; the idea of a ‘single source of truth is lost.’
• Irregularities can occur when data is updated in Kimball DW architecture. This is because, in
de-normalization techniques data warehouse, redundant data is added to database tables.
• In the Kimball DW architecture, performance issues may occur due to the addition of
columns in the fact table, as these tables are quite in-depth. The addition of new columns can
expand the fact table dimensions, affecting its performance.
• The dimensional data warehouse model becomes difficult to alter with any change in
business needs.
• As the Kimball model is business process-oriented, instead of focusing on the enterprise as a
whole, it cannot handle all the BI reporting requirements.
Bill Inmon, the father of data warehousing, came up with the concept to develop a data
warehouse that starts designing the corporate data warehouse data model, which identifies the
main subject areas and entities the enterprise works with, such as customers, product, vendor,
and so on.
Bill Inmon’s definition of a data warehouse is that it is a “subject-oriented, nonvolatile,
integrated, time-variant collection of data in support of management’s decisions.”
It is based on Top-down approach.
The model then creates a thorough, logical model for every primary entity. For instance, a
logical model is constructed for products with all the attributes associated with that entity.
This logical model could include ten diverse entities under product, including all the details,
such as business drivers, aspects, relationships, dependencies, and affiliations.
• The Inmon design approach uses the normalized form for building entity structure,
avoiding data redundancy as much as possible.
• This results in clearly identifying business requirements and preventing any data update
irregularities.
• Moreover, the advantage of this top-down approach in database design is that it is robust to
business changes and contains a dimensional perspective of data across data mart.
• Next, the physical model is constructed, which follows the normalized structure.
• This Inmon model creates a single source of truth for the whole business.
• Data loading becomes less complex due to the normalized structure of the model.
• This arrangement for querying is challenging as it includes numerous tables and links.
• Data warehouse acts as a unified source of truth for the entire business, where all data is
integrated.
• This approach has very low data redundancy. So, there’s less possibility of data update
irregularities, making the ETL data warehouse process more straightforward and less
susceptible to failure.
• It simplifies business processes, as the logical model represents detailed business objects.
• This approach offers greater flexibility, as it’s easier to update the data warehouse in case
there’s any change in the business requirements or source data.
• It can handle diverse enterprise-wide reporting requirements.
• Complexity increases as multiple tables are added to the data model with time.
• Resources skilled in data warehouse data modeling are required, which can be expensive and
challenging to find.
• The preliminary setup and delivery are time-consuming.
• Additional ETL operation is required since data marts are created after the creation of the
data warehouse.
• This approach requires experts to manage a data warehouse effectively.
Inmon data warehousing architecture explained Kimball’s data warehousing architecture explained
14
Multidimensional data model
Data Model
A data model is a diagrammatic representation of the data and the relationship
between its different entities. It assists in identifying how the entities are
related through a visual representation of their relationships and thus helps
reduce possible errors in the database design. It helps in building a robust
database/data warehouse.
Types of Data Model
Conceptual Data Model
Logical Data Model
Physical Data Model
The conceptual data model is designed by identifying the various entities and
the highest- level relationships between them as per the given requirements.
Let us look at some features of a conceptual data mpdel-
• It identifies the most important entities.
• It identifies relationships between different entities.
• It does not support the specification of attributes.
• It does not support the specification of the primary key.
The entities can be identified as
• Category (to store the category details of products).
• SubCategory (to store the details of sub-categories that belong to different categories)
• Product (to store product details).
• ProductOffer (to map the promotion offer to a product).
• Date (to keep track of the sale date and also to analyze sales in different time periods)
• OperatorType (to store the details of types of operator, viz. company-operated or franchise)
• Outlet (to store the details of various stores distributed over various locations).
• Sales (to store all the daily transactions made at various stores)
The logical data model is used to describe data in as much detail as possible.
While describing the data, no consideration is given to the physical
implementation aspect.
Normalization:
1NF
2NF
3NF and soon
Business Intelligence and Data 6
Warehousing (CS122) Dr. Atul Garg
Outcome of Logical Data Model
• The entity names of the logical data model are table names in the physical data
model.
• The attributes of the logical data model are column names in the physical
data model.
• In the physical data model, the data type for each column is specified.
However, data types differ depending on the actual database (MySQL,DB2,
SQL Server 2008, Oracle etc.) being used. In a logical data model, only the
attributes are identified without going into the details about the data type
specifications.
• There are cases where a couple (both husband and wife) are employed either in the same BU
or a different one.
• In such a case, they (the couple) have same address.
• An employee can be on a project, but at any given point in time, he or she can be working on a
single project only.
• Each project belongs to a client. There could be chances where a client has awarded more than
one project to the company (either to the same BU or different BUs).
• A project can also be split into modules which can be distributed to BUs according to their
field of specifications.
• For example, in an insurance project, the development and maintenance work is with Insurance
Services (IS) and the testing task is with Testing Services (TS). Each BU usually works on
several projects at a time.
Pros -
• The ER diagram is easy to understand and is represented in a language that the
business users can understand.
• It can also be easily understood by a non-technical domain expert.
• It is intuitive and helps in the implementation on the chosen database
platform.
• It helps in understanding the system at a higher level.
Cons –
•The physical designs derived using ER model may have some amount of
redundancy.
•There is scope for misinterpretations because of the limited information available
diagram.
Fact Table: A fact table consists of various measurements. It stores the measures of business
processes and points to the lowest detail level of each dimension table. The measures are factual or
quantitative in representation and are generally numeric in nature. They represent the how much or
how many aspects of a question. For example, price, product sales, product inventory, etc.
Types of Fact:
Additive facts: These are the facts that can be summed up/aggregated across all dimensions in a fact
table. For example, discrete numerical measures of activity — quantity sold, dollars sold, etc.
Consider a scenario where a retail store “Northwind Traders” wants to analyze the revenue
generated. The revenue generated can be by the employee who is selling the products; or it can be
in terms of any combination of multiple dimensions. Products, time, region, and employee are the
dimensions in this case.
The revenue,which is a fact, can be aggregated along any of the above dimensions to give the
total revenue along that dimension. Such scenarios where the fact can be aggregated along all the
dimensions make the fact a fully additive or just an additive fact. Here revenue is the additive fact.
Additive
Facts
Factless Fact
Semi Non
Additive Additive
This figure depicts the “SalesFact”act table along with its corresponding
dimension tables.
This fact table has one measure., “SalesAmount”, and three dimension keys,
“DateID”, “ProductID”, and “StoreID”.
The purpose of the “SalesFact” table is to record the sales amount for each
product in each store on a daily basis. In this table, “SalesAmount” is an additive
fact because we can sum up this fact along any of the three dimensions present in
the fact table i.e. “DimDate”, “DimStore”, and “DimProduct”. For example – the
sum of “SalesAmount” for all 7 days in a week represents the total sales amount
for that week.
Business Intelligence and Data 28
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional Modeling- Semi-Additive Facts:
Semi Additive facts: These are the facts that can be summed up for some dimensions in the fact table, but not
all. For example, account balances, inventory level, distinct counts etc.
Consider a scenario where the “Northwind Traders” warehouse manager needs to find the total number of
products in the inventory. One inherent characteristic of any inventory is that there will be incoming products to
the inventory from the manufacturing plants and outgoing products from the inventory to the distribution
centres or retail outlets.
So if the total products in the inventory need to be found out, say, at the end of a month, it cannot be a simple
sum of the products in the inventory of individual days of that month. Actually, it is a combination of addition of
incoming products and subtraction of outgoing ones. This means the inventory level cannot be aggregated
along the “time” dimension.
But if a company has warehouses in multiple regions and would like to find the total products in inventory
across those warehouses, a meaningful number can be arrived at by aggregating inventory levels across those
warehouses. This simply means inventory levels can be aggregated along the “region” dimension. Such
scenarios where a fact can be aggregated along some dimensions but not along all dimensions give rise to
semi-additive facts. In this case, the number of products in inventory or the inventory level is the semi-
additive fact.
Let us discuss another example of semi-additive facts.
Figure depicts the “AccountsFact” fact table along with its
corresponding dimension tables. The “AccountsFact” fact table has
two measures :“CurrentBalance” and “ProfitMargin”. It has two
dimension keys: “DatelD” and “AccountID”. “CurrentBalance” is a semi-
additive fact. It makes sense to add up current balances for all
accounts to get the information on “what's the total current balance
for all accounts in the bank?” However, it does not make sense to add
up current balances through time. It does not make sense to add up all
current balances through time. It does not make sense to add up all
current balance for a given account for a given account for each day of
the month. Similarly, “ProfitMargin” is another non-additive fact, as it
does not make sense to add profit margins at the account level or at
the day level.
Business Intelligence and Data 29
Warehousing (CS122) Dr. Atul Garg
Data Modeling Techniques – Dimensional Modeling-
Non-Additive Facts:
Non Additive facts: These are the facts that cannot be summed up for some dimensions present in the fact table. For
example, measurement of room temperature, percentages, ratios, factless, facts, etc. Non additive facts cannot be added
meaningfully across any dimensions. In other words, non-additive facts are facts where SUM operator cannot be used to
produce any meaningful results. The following illustration will help you understand why room temperature is a non-
additive fact.
Date Temperature
5th May (7AM) 27
5th May (12 AM) 33
5th May (5 PM) 10
Sum 70 (Non-Meaningful result)
Average 23.3 (Meaningful result)
Degenerate
Dimension
Rapidly Junk
Changing (garbage)
Dimension Dimension
Dimension
Slowly Type
Changing Role-playing
Dimension Dimension
We have seen how to handle very slow changes in the dimension, but what would happen if occur
more frequently?
A dimension is considered to be a fast changing dimension, also call changing dimension, if its one or
more attributes change frequently and also in several rows. For example, consider a customer table
having 1,00,000 rows. Assuming that on an average 10 changes occur in a dimension every year, then
in one year the number of rows will increase to 1,00,000 x 10 = 10,00,000.
To identify a fast changing dimension, look for attributes having continuously variable values. Some of
the fast changing dimension attributes have been identified as:
• Age
• Income
• Test score
• Rating
• Credit history score
• Customer account status
• Weight
One method of handling fast changing dimensions is to break off a fast changing dimension into one or
more separate dimensions known as mini-dimensions. The fact table would then have two separate
foreign keys — one for the primary dimension table and another for the fast changing attribute.
SurrogateKeyID CountOfPatients
1 2
2 3
3 5
R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley India
Publishers.
http://www.punjabiuniversity.ac.in/Pages/Images/elearn/MultidimensionalDataModelin
g.pdf
43
As it has been earlier discusses that the Entity Relationship (ER) data
model is a commonly used data model for relational databases. Here, the
database schema is represented by a set of entities and the relationship
between them. It is an ideal data model for On-Line Transaction Processing
(OLTP).
Let us look at a data model that is considered apt for On-Line Data
Analysis. Multidimensional data modeling is the most popular data model
when it comes to designing a data warehouse.
• Normalization and expansion of the dimension tables in a star schema result in the
implementation of a snowflake design.
• A dimension table is said to be snow flaked when the low-cardinality attributes in the
dimension have been removed to separate normalized tables and these normalized
tables are then joined back into the original dimension table.
As we have in the example of “ElectronicsForAll”, the main difference between the Star and
Snow- flake schema is that the dimension tables of the Snowflake schema are maintained in
normalized form to reduce redundancy. The advantage here is that such tables (normalized) are
easy to save storage space. However, it also means that more joins will be needed to execute a
query. This will adversely impact system performance.
Identifying Dimensions to be Snowflaked
In this section, we will observe the practical implementation of the dimensional design.
What is snowflaking?
The snowflake design is the result of further expansion and normalization of the dimension table.
In other words, a dimension table is said to be snowflaked if the low-cardinality attributes of the
dimensions have been divided into separate normalized tables. These tables are then joined to
dimension table with referential constraints (foreign key constraints).
Generally, snowflaking is not recommended in the dimension table, as it hampers the
understandability and performance of the dimensional model as more tables would be required to
satisfy the queries.
When do we snowflake?
The dimensional model is snowflaked under the following two conditions:
The dimension table consists of two or more sets of attributes which define information at
different grains.The sets of attributes of the same dimension table are being populated by
different source systems. Business Intelligence and Data 7
Warehousing (CS122) Dr. Atul Garg
Snow Flake Schema
For understanding why and when we snowflake, consider the “Product” dimension tabl¢ shown in
Region
RegionID
Country Code
Country
State Code
Country Code
City Code Country Name
State Code
City Code State
Decreases performance because more tables
State will
codeneed to be joined to satisfy queries
State Name
City
City Code
City Code
City Name
ZIP
The constellation schema is shaped like a constellation of stars (i.e. Star schemas). This is
more complex than Star or Snowflake schema variations, as it contains multiple fact tables.
This allows the dimension tables to be shared among the various fact tables. It is also called
“Galaxy schema”. The main disadvantage of the fact constellation is more complicated
design because multiple aggregations must be taken into consideration (Figure below).
Dimension Hierarchy
Grain
Fact
Testing
N .America
Consulting
Europe Production Support
Asia Pacific
Q1 Q2 Q3 Q4
R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley India
Publishers.
http://www.punjabiuniversity.ac.in/Pages/Images/elearn/MultidimensionalDataModelin
g.pdf
Cluster is a group of objects that belongs to the same class. In other words, similar
objects are grouped in one cluster and dissimilar objects are grouped in another
cluster.
• Partition data set into clusters based on similarity, and store cluster representation
(e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering and be stored in multi-dimensional index tree
structures
Points to Remember
• A cluster of data objects can be treated as one group.
• While doing cluster analysis, we first partition the set of data into groups based
on data similarity and then assign the labels to the groups.
• The main advantage of clustering over classification is that, it is adaptable to
changes and helps single out useful features that distinguish different groups.
Business Intelligence and Data 2
Warehousing (CS122) Dr. Atul Garg
Clustering Methods
Partitioning Method: Suppose we are given a database of ‘n’ objects and the
partitioning method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups, which satisfy
the following requirements −
• Each group contains at least one object.
• Each object must belong to exactly one group.
Hierarchical Methods: This method creates a hierarchical decomposition of the
given set of data objects. We can classify hierarchical methods on the basis of how
the hierarchical decomposition is formed.
Density-based Method: This method is based on the notion of density. The basic
idea is to continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a given
cluster, the radius of a given cluster has to contain at least a minimum number of
points.
Grid-based Method: In this, the objects together form a grid. The object space is
quantized into finite number of cells that form a grid structure.
• The major advantage of this method is fast processing time.
• It is dependent only on the number of cells in each dimension in the quantized
space.
Model-based methods: In this method, a model is hypothesized for each cluster to
find the best fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data points.
Constraint-based Method: In this method, the clustering is performed by the
incorporation of user or application-oriented constraints. A constraint refers to the
user expectation or the properties of desired clustering results.
The following are some points why clustering is important in data mining.
Scalability: Require highly scalable clustering algorithms to work with large
databases.
Ability to deal with different kinds of attributes: Algorithms should be able to
work with the type of data such as categorical, numerical, and binary data.
Discovery of clusters with attribute shape: The algorithm should be able to detect
clusters in arbitrary shape and it should not be bounded to distance measures.
Interpretability: The results should be comprehensive, usable, and interpretable.
High dimensionality: The algorithm should be able to handle high dimensional
space instead of only handling low dimensional data.
Business Intelligence and Data 7
Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used
to solve the clustering problems in machine learning or data science.
The algorithm takes the unlabelled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.
The k-means clustering algorithm mainly performs two tasks:
1. Determines the best value for K center points or centroids by an
iterative process.
2. Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.
Hence each cluster has data points with some commonalities, and it is
away from other clusters. Business Intelligence and Data 8
Warehousing (CS122) Dr. Atul Garg
K-Means Clustering Algorithm
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each data point to
the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to
FINISH.
Step-7: The model is ready.
As we got the new centroids so We can see in the previous image; there
again will draw the median line are no dissimilar data points on either side
and reassign the data points. So, of the line, which means our model is
the image will be: formed. Consider the below image:
Text mining
Application of data mining to non-structured or less structured text files. It
entails the generation of meaningful numerical indices from the
unstructured text and then processing these indices using various data
mining algorithms
Text mining helps organizations:
Find the “hidden” content of documents, including additional useful relationships
Relate documents across previous unnoticed divisions
Group documents by common themes
Text Mining
Web mining
The discovery and analysis of interesting and useful information from the
Web, about the Web, and usually through Web-based tools
Types of Web Mining
Web Mining
R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”, Wiley India
Publishers.
Knowledge Management:
Introduction, purpose and
strategies
Successful knowledge management initiatives depend on a few key factors. Ensure you take
the following elements into account when designing your knowledge management strategy:
• People and Culture — Knowledge management is not a stand-alone function in an
organization. It should include a cross-functional, culture-driven approach to how the
organization operates and needs to be made a top priority to be successful.
• Process — Your company needs to develop a plan for how your knowledge management
becomes part of your everyday business operations.
• Technology — Successful knowledge management initiatives depend heavily on
technology. You will need a technology infrastructure that supports your knowledge
management plan.
• Strategy — Your knowledge management process should focus on identifying and
eliminating knowledge and process gaps.
A knowledge management strategy is a written plan of action that outlines your company’s steps to
implement a knowledge management strategy and system. A strategy will help you identify what
knowledge you need to manage and keep your project on track.
1. Build Your Knowledge Management Team: To build a comprehensive strategy, gather team
members who understand the value of managing your company’s knowledge. Members of your
knowledge management team should become role models and influencers when it comes time for
employees to use your system.
2. Identify Your Goals: Identify your company’s business goals and create goals for your knowledge
management system that align. Next, figure out how your knowledge management system will benefit
employees, customers, and your organization as a whole. This will help you get buy-in from
leadership as you move through the strategy and implementation process and provide a solid road map
you can refer back to at any time.
3. Perform a Knowledge Audit: A knowledge audit takes a look at your company’s information to
understand how you are currently managing that information. Unlike a content audit, a knowledge
audit takes a step back to look at the overall amount of content you are storing.
Business Intelligence and Data Warehousing (CS122) Dr. Atul Garg
How to Build Knowledge Management Strategy
4. Choose Your Technology: Choose the primary tool you’ll use to build your knowledge management system.
Knowledge management tools provide a central location for all of your knowledge, making it easy to store and
retrieve your information. Examples of knowledge management tools include customer relationship
systems, knowledge base software, internal wikis, and learning management systems.
5. Create a Communication Plan: Create a plan for sharing your new knowledge management system with
your employees to make sure they know and understand how it works. This plan should include the messaging
you’ll use and the channels you will use to distribute communications.
6. Establish Milestones: Create specific milestones to keep your project on track. Be specific when designing
your milestones so they can be easily measured and managed. For example, a proper milestone will include
specific dates so you can set delivery expectations. A milestone looks like “select a knowledge base by April
27th” instead of “find a knowledge base to use.”
7. Build a Roadmap: As soon as you have put all the pieces in place, you can begin constructing your
implementation roadmap. The roadmap should describe the complete picture of your implementation, broken up
into stages, and include your objectives, milestones, and timelines. Describe each step clearly so stakeholders
can easily understand it.