Nothing Special   »   [go: up one dir, main page]

Data Mining Questions

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 9

Assignment 3.

Data Mining

Guided By- Asst. Prof Alka mam.


Submitted By- Swarnim Shukla
============================================================================
ANSWERS
Q.1 How data partitioning is helpful in reducing query access time for data warehouse ?
Ans. Partitioning is done to enhance performance and facilitate easy management of data. Partitioning also helps in
balancing the various requirements of the system. It optimizes the hardware performance and simplifies the management
of data warehouse by partitioning each fact table into multiple separate partitions. In this chapter, we will discuss
different partitioning strategies.
Partitioning is important for the following reasons –
For Easy Management
The fact table in a data warehouse can grow up to hundreds of gigabytes in size. This huge size of fact table is very hard
to manage as a single entity. Therefore it needs partitioning.
To Assist Backup/Recovery
If we do not partition the fact table, then we have to load the complete fact table with all the data. Partitioning allows us
to load only as much data as is required on a regular basis. It reduces the time to load and also enhances the performance
of the system.
Note − To cut down on the backup size, all partitions other than the current partition can be marked as read-only. We can
then put these partitions into a state where they cannot be modified. Then they can be backed up. It means only the
current partition is to be backed up.
To Enhance Performance
By partitioning the fact table into sets of data, the query procedures can be enhanced. Query performance is enhanced
because now the query scans only those partitions that are relevant. It does not have to scan the whole data.
Horizontal Partitioning
There are various ways in which a fact table can be partitioned. In horizontal partitioning, we have to keep in mind the
requirements for manageability of the data warehouse.
Partitioning by Time into Equal Segments
In this partitioning strategy, the fact table is partitioned on the basis of time period. Here each time period represents a
significant retention period within the business. For example, if the user queries for month to date data then it is
appropriate to partition the data into monthly segments. We can reuse the partitioned tables by removing the data in
them.
Partition by Time into Different-sized Segments
This kind of partition is done where the aged data is accessed infrequently. It is implemented as a set of small partitions
for relatively current data, larger partition for inactive data.
Points to Note
1. The detailed information remains available online.
2. The number of physical tables is kept relatively small, which reduces the operating cost.
3. This technique is suitable where a mix of data dipping recent history and data mining through entire history is
required.
4. This technique is not useful where the partitioning profile changes on a regular basis, because repartitioning will
increase the operation cost of data warehouse.

Partition on a Different Dimension


The fact table can also be partitioned on the basis of dimensions other than time such as product group, region, supplier,
or any other dimension. Let's have an example.
Suppose a market function has been structured into distinct regional departments like on a state by state basis. If each
region wants to query on information captured within its region, it would prove to be more effective to partition the fact
table into regional partitions. This will cause the queries to speed up because it does not require to scan information that
is not relevant.
Points to Note
 The query does not have to scan irrelevant data which speeds up the query process.
 This technique is not appropriate where the dimensions are unlikely to change in future. So, it is worth
determining that the dimension does not change in future.
 If the dimension changes, then the entire fact table would have to be repartitioned.
Note − We recommend to perform the partition only on the basis of time dimension, unless you are certain that the
suggested dimension grouping will not change within the life of the data warehouse.
Partition by Size of Table
When there are no clear basis for partitioning the fact table on any dimension, then we should partition the fact table on
the basis of their size. We can set the predetermined size as a critical point. When the table exceeds the predetermined
size, a new table partition is created.

Points to Note
This partitioning is complex to manage.
It requires metadata to identify what data is stored in each partition.
2
Partitioning Dimensions
If a dimension contains large number of entries, then it is required to partition the dimensions. Here we have to check the
size of a dimension.
Consider a large design that changes over time. If we need to store all the variations in order to apply comparisons, that
dimension may be very large. This would definitely affect the response time.
Round Robin Partitions
In the round robin technique, when a new partition is needed, the old one is archived. It uses metadata to allow user
access tool to refer to the correct table partition.
This technique makes it easy to automate table management facilities within the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical partitioning is done.

Vertical partitioning can be performed in the following two ways −

 Normalization
 Row Splitting

Normalization
Normalization is the standard relational method of database organization. In this method, the rows are collapsed into a
single row, hence it reduce space. Take a look at the following tables that show how normalization is performed.
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row splitting is to speed up the access to
large table by reducing its size.
Note − While using vertical partitioning, make sure that there is no requirement to perform a major join operation
between two partitions.
Identify Key to Partition

3
It is very crucial to choose the right partition key. Choosing a wrong partition key will lead to reorganizing the fact table.
We can choose to partition on any key.
Suppose the business is organized in 30 geographical regions and each region has different number of branches. That
will give us 30 partitions, which is reasonable. This partitioning is good enough because our requirements capture has
shown that a vast majority of queries are restricted to the user's own business region.
If we partition by transaction_date instead of region, then the latest transaction from every region will be in one partition.
Now the user who wants to look at data within his own region has to query across multiple partitions.
Hence it is worth determining the right partitioning key.

Q.2 Mention the guidelines given by E.F Codd for OLAP System ?
Ans. OLAP Guidelines (Codd’s Rule)
On-line Analytical Processing (OLAP) is a category of software technology that enables analytics, managers and
executives to gain insight into data through fast, consistent, interactive access in a wide variety of infomation that has
been transformed from the raw data to reflect the real dimensionality of the enterprise as understood by the user.

OLAP was introduces by Dr.E.F.Codd in 1993 and he presented 12 rules regarding OLAP:

 Multidimensional Conceptual View:


Multidimensional data model is provided that is intuitively analytical and easy to use. A multidimensional data
model decides how the users perceive business problems.

 Transparency:
It makes the technology, underlying data repository, computing architecture and the diverse nature of source data
totally transparent to users.

 Accessibility:
Access should provided only to the data that is actually needed to perform the specific analysis, presenting a single,
coherent and consistent view to the users.
 Consistent Reporting Performance:
Users should not experience any significant degradation in reporting performance as the number of dimensions or the
size of the database increases. It also ensures users must perceive consistent run time, response time or machine
utilization every time a given query is run.
 Client/Server Architecture:
It conforms the system to the principles of client/server architecture for optimum performance, flexibility, adaptability
and interoperability.
 Generic Dimensionality:
It should be ensured that very data dimension is equivalent in both structure and operational capabilities. Have one
logical structure for all dimensions.
 Dynamic Sparse Matrix Handling:
Adaption should be of the physical schema to the specific analytical model being created and loaded that optimizes
sparse matrix handling.
 Multi-user Support:
Support should be provided for end users to work concurrently with either the same analytical model or to create
different models from the same data.
4
 Unrestricted Cross-dimensional Operations:
System should have abilities to recognize dimensional and automatically perform roll-up and drill-down operations
within a dimension or across dimensions.
 Intuitive Data Manipulation:
Consolidation path reorientation, drill-down and roll-up and other manipulations to be accomplished intuitively should
be enabled and directly via point and click actions.
 Flexible Reporting:
Business user is provided capabilities to arrange columns, rows and cells in manner that gives the facility of easy
manipulation, analysis and synthesis of information.
 Unlimited Dimensions and Aggregation Levels:
There should be at least fifteen or twenty data dimensions within a common analytical model.

Q.3 Define Physical design process in detail .


Ans. During the physical design process, you translate the expected schemas into actual database structures. At this time,
you have to map:
 Entities to tables
 Relationships to foreign key constraints
 Attributes to columns
 Primary unique identifiers to primary key constraints
 Unique identifiers to unique key constraints
Physical Design Structures
Once you have converted your logical design to a physical one, you will need to create some or all of the following
structures:

 Tablespaces

 Tables and Partitioned Tables

 Views

 Integrity Constraints

 Dimensions
Some of these structures require disk space. Others exist only in the data dictionary. Additionally, the following structures
may be created for performance improvement:

 Indexes and Partitioned Indexes

 Materialized Views
Tablespaces
A tablespace consists of one or more datafiles, which are physical structures within the operating system you are using. A
datafile is associated with only one tablespace. From a design perspective, tablespaces are containers for physical design
structures.

5
Tablespaces need to be separated by differences. For example, tables should be separated from their indexes and small
tables should be separated from large tables. Tablespaces should also represent logical business units if possible. Because
a tablespace is the coarsest granularity for backup and recovery or the transportable tablespaces mechanism, the logical
business design affects availability and maintenance operations.
You can now use ultralarge data files, a significant improvement in very large databases.
Tables and Partitioned Tables
Tables are the basic unit of data storage. They are the container for the expected amount of raw data in your data
warehouse.
Using partitioned tables instead of nonpartitioned ones addresses the key problem of supporting very large data volumes
by allowing you to divide them into smaller and more manageable pieces. The main design criterion for partitioning is
manageability, though you will also see performance benefits in most cases because of partition pruning or intelligent
parallel processing. For example, you might choose a partitioning strategy based on a sales transaction date and a monthly
granularity. If you have four years' worth of data, you can delete a month's data as it becomes older than four years with a
single, fast DDL statement and load new data while only affecting 1/48th of the complete table. Business questions
regarding the last quarter will only affect three months, which is equivalent to three partitions, or 3/48ths of the total
volume.
Partitioning large tables improves performance because each partitioned piece is more manageable. Typically, you
partition based on transaction dates in a data warehouse. For example, each month, one month's worth of data can be
assigned its own partition.
Table Compression
You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should
consider for table compression is partitioned tables.
To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a
compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression
can also speed up query execution. There is, however, a cost in CPU overhead.
Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid
compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable,
there is some overhead in updating these tables, and high update activity may work against compression by causing some
space to be wasted.
Views
A view is a tailored presentation of the data contained in one or more tables or other views. A view takes the output of a
query and treats it as a table. Views do not require any space in the database.
Indexes and Partitioned Indexes
Indexes are optional structures associated with tables or clusters. In addition to the classical B-tree indexes, bitmap
indexes are very common in data warehousing environments. Bitmap indexes are optimized index structures for set-
oriented operations. Additionally, they are necessary for some optimized data access methods such as star
transformations.
Indexes are just like tables in that you can partition them, although the partitioning strategy is not dependent upon the
table structure. Partitioning indexes makes it easier to manage the data warehouse during refresh and improves query
performance.
Materialized Views

6
Materialized views are query results that have been stored in advance so long-running calculations are not necessary when
you actually execute your SQL statements. From a physical design point of view, materialized views resemble tables or
partitioned tables and behave like indexes in that they are used transparently and improve performance.
Dimensions
A dimension is a schema object that defines hierarchical relationships between columns or column sets. A hierarchical
relationship is a functional dependency from one level of a hierarchy to the next one. A dimension is a container of logical
relationships and does not require any space in the database. A typical dimension is city, state (or province), region, and
country.

Q.4 Explain characteristics of different classes of users in data warehouse .


Ans. The success of a data warehouse is measured solely by its acceptance by users Without users, historical data might as
well be archived to magnetic tape and stored in the basement. Successful data warehouse design starts with understanding
the users and their needs.
Following is the general classification of users :
Casual or Naive Users : Use the data warehouse occasionally, not daily. Needs a ve intuitive information interface.
Looks for the information delivery to prompt the users with available choices.Needs big button navigation.
Regular User : Uses the data warehouse almost daily. Comfortable with computing options but cannot create own reports
and queries from scratch. Needs query, templates and predefined reports.
Power User : Is highly proficient with technology. Can create reports and queries from scratch. Some can write their own
macros and scripts. Can import data into spread sheets and other applications of users according to the interaction of users
to obtain information:
Classification Preprocessed Reports: Use routine reports run and delivered at regular intervals.
Predefined Queries and Templates : Enter own set of parameters and run queries with predefined templates and reports
with predefined formats.
Limited Ad Hoc Access : Create from scratch and run limited number and simple types of queries and analysis .
Complex Ad Hoc Access : Create complex queries and run analysis sessions from scratch regularly. Provide the basis for
preprocessed and predefined queries and reports.
High-Level Executives and Managers : Need information for high-level strategic decisions. Standard reports on key
metrics are useful. Customized and personalized information is preferable.
Technical Analysts : Look for complex analysis, statistical analysis, drill-down and slice-dice capabilities, and freedom to
access the entire data warehouse.
Business Analysts : Although comfortable with technology, are not quite adept at creating queries and reports from
scratch. Predefined navigation helpful. Want to look at the results in many different ways. To some extent, can modify
and customize predefined reports.
Business-Oriented Users: These are knowledge workers who like point-and-click GUIS. Desire to have standard reports
and some measure of ad hoc querying.

7
Q.5 Describe 4 main activities of data warehouse deployment .
Ans. Project Scoping and Planning

Project Triangle – Scope, Time and Resource.

 Determine the scope of the project – what you would like to accomplish? This can be defined by questions to be
answered. The number of logical star and number of the OLTP sources
 Time – What is the target date for the system to be available to the users
 Resource – What is our budget? What is the role and profile requirement of the resources needed to make this happen.

1. Requirement

5. What are the business questions? How does the answers of these questions can change the business decision or trigger
actions.
6. What are the role of the users? How often do they use the system? Do they do any interactive reporting or just view
the defined reports in guided navigation?
7. How do you measure? What are the metrics?

2. Front-End Design

 The front end design needs for both interactive analysis and the designed analytics workflow.
 How does the user interact with the system?
 What are their analysis process?

3. Warehouse Schema Design

Dimensional modeling – define the dimensions and fact and define the grain of each star schema.
Define the physical schema – depending on the technology decision. If you use the relational tecknology, design the
database tables
4. OLTP to data warehouse mapping

8
 Logical mapping – table to table and column to column mapping. Also define the transformation rules
 You may need to perform OLTP data profiling. How often the data changes? What are the data distribution?
 ETL Design -include data staging and the detail ETL process flow.

5. Implementation

1. Create the warehouse and ETL staging schema


2. Develop the ETL programs
3. Create the logical to physical mapping in the repository
4. Build the end user dashboard and reports

6. Deployment

1. Install the Analytics reporting and the ETL tools.


2. Specific Setup and Configuration for OLTP, ETL, and data warehouse.
3. Sizing of the system and database
4. Performance Tuning and Optimization

7. Management and Maintenance of the system

1. Ongoing support of the end-users, including security, training, and enhancing the system.
2. You need to monitor the growth of the data.

🎗️ 🎗️ 🎗️

You might also like