Star Schema The Complete Reference
()
About this ebook
Learn the best practices of dimensional design. Star Schema: The Complete Reference offers in-depth coverage of design principles and their underlying rationales. Organized around design concepts and illustrated with detailed examples, this is a step-by-step guidebook for beginners and a comprehensive resource for experts.
This all-inclusive volume begins with dimensional design fundamentals and shows how they fit into diverse data warehouse architectures, including those of W.H. Inmon and Ralph Kimball. The book progresses through a series of advanced techniques that help you address real-world complexity, maximize performance, and adapt to the requirements of BI and ETL software products. You are furnished with design tasks and deliverables that can be incorporated into any project, regardless of architecture or methodology.
- Master the fundamentals of star schema design and slow change processing
- Identify situations that call for multiple stars or cubes
- Ensure compatibility across subject areas as your data warehouse grows
- Accommodate repeating attributes, recursive hierarchies, and poor data quality
- Support conflicting requirements for historic data
- Handle variation within a business process and correlation of disparate activities
- Boost performance using derived schemas and aggregates
- Learn when it's appropriate to adjust designs for BI and ETL tools
Related to Star Schema The Complete Reference
Related ebooks
Modelling Business Information: Entity relationship and class modelling for Business Analysts Rating: 0 out of 5 stars0 ratingsThe Data Model Resource Book: Volume 3: Universal Patterns for Data Modeling Rating: 0 out of 5 stars0 ratingsDatabase Management for Business Leaders: Building and Using Data Solutions That Work for You Rating: 0 out of 5 stars0 ratingsData Analytics with Google Cloud Platform: Build Real Time Data Analytics on Google Cloud Platform Rating: 0 out of 5 stars0 ratingsData warehouse Complete Self-Assessment Guide Rating: 4 out of 5 stars4/5The Data Governance Imperative Rating: 0 out of 5 stars0 ratingsData Governance A Complete Guide - 2021 Edition Rating: 0 out of 5 stars0 ratingsTabular Modeling with SQL Server 2016 Analysis Services Cookbook Rating: 4 out of 5 stars4/5Expert Cube Development with SSAS Multidimensional Models Rating: 0 out of 5 stars0 ratingsMicrosoft Tabular Modeling Cookbook Rating: 0 out of 5 stars0 ratingsData Strategy: From definition to execution Rating: 0 out of 5 stars0 ratingsManaging Data Quality: A practical guide Rating: 0 out of 5 stars0 ratingsImplementing Power BI in the Enterprise Rating: 5 out of 5 stars5/5The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling Rating: 0 out of 5 stars0 ratingsTableau 10 Business Intelligence Cookbook Rating: 0 out of 5 stars0 ratingsAzure Storage, Streaming, and Batch Analytics: A guide for data engineers Rating: 0 out of 5 stars0 ratingsData Engineering on Azure Rating: 0 out of 5 stars0 ratingsDAX Patterns: Second Edition Rating: 5 out of 5 stars5/5The Data Model Resource Book, Volume 1: A Library of Universal Data Models for All Enterprises Rating: 0 out of 5 stars0 ratingsLearn Data Warehousing in 24 Hours Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5Microsoft Power BI The Ultimate Step-By-Step Guide Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with R Rating: 5 out of 5 stars5/5
Databases For You
Oracle DBA Mentor: Succeeding as an Oracle Database Administrator Rating: 0 out of 5 stars0 ratingsBlockchain Basics: A Non-Technical Introduction in 25 Steps Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Beginning Microsoft Power BI: A Practical Guide to Self-Service Data Analytics Rating: 0 out of 5 stars0 ratingsAccess 2019 For Dummies Rating: 0 out of 5 stars0 ratingsGrokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5COMPUTER SCIENCE FOR ROOKIES Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Rating: 4 out of 5 stars4/5CompTIA DataSys+ Study Guide: Exam DS0-001 Rating: 0 out of 5 stars0 ratingsBehind Every Good Decision: How Anyone Can Use Business Analytics to Turn Data into Profitable Insight Rating: 5 out of 5 stars5/5Python Projects for Everyone Rating: 0 out of 5 stars0 ratingsLearn SQL in 24 Hours Rating: 5 out of 5 stars5/5Access 2016 For Dummies Rating: 0 out of 5 stars0 ratingsGo in Action Rating: 5 out of 5 stars5/5The Analytic Detective: Decipher Your Company’s Data Clues and Become Irreplaceable Rating: 0 out of 5 stars0 ratingsAccess for Beginners: Access Essentials, #1 Rating: 0 out of 5 stars0 ratingsLearn SQL Server Administration in a Month of Lunches Rating: 3 out of 5 stars3/5Learning Oracle 12c: A PL/SQL Approach Rating: 0 out of 5 stars0 ratingsAccess 2010 All-in-One For Dummies Rating: 4 out of 5 stars4/5Learn Git in a Month of Lunches Rating: 0 out of 5 stars0 ratingsAzure SQL Revealed: A Guide to the Cloud for SQL Server Professionals Rating: 0 out of 5 stars0 ratingsA Concise Guide to Object Orientated Programming Rating: 0 out of 5 stars0 ratingsGetting Started with SQL Server 2014 Administration Rating: 0 out of 5 stars0 ratingsPython and SQLite Development Rating: 0 out of 5 stars0 ratingsPractical SQL Rating: 4 out of 5 stars4/5SQL in 30 Pages Rating: 4 out of 5 stars4/5Learning PostgreSQL Rating: 1 out of 5 stars1/5
Reviews for Star Schema The Complete Reference
0 ratings0 reviews
Book preview
Star Schema The Complete Reference - Christopher Adamson
Star Schema
The Complete Reference™
About the Author
Christopher Adamson is an expert in star schema design and an experienced data warehousing practitioner. He speaks and teaches worldwide, writes about data warehousing, and provides data warehousing services through Oakton Software LLC.
As a consultant, Chris works with customers in all industries to develop data warehousing strategies, identify and prioritize projects, design and modify solutions, and manage implementations. His customers have included Fortune 500 companies, government agencies, and data warehousing software vendors.
Chris has taught dimensional design to thousands of students worldwide. He is a faculty member at The Data Warehousing Institute (TDWI), teaches at conference events, and provides onsite education.
Chris has penned many articles and books about data warehousing. He is author of Mastering Data Warehouse Aggregates (Wiley 2006) and co-author of Data Warehouse Design Solutions (Wiley 1998). He blogs about data warehousing and answers questions from readers at www.StarSchemaCentral.com.
Chris is the founder of Oakton Software LLC, and can be contacted through his web site at www.ChrisAdamson.net.
Star Schema
The Complete Reference™
Christopher Adamson
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written permission of the publisher.
ISBN: 978-0-07-174433-1
MHID: 0-07-174433-9
The material in this eBook also appears in the print version of this title: ISBN: 978-0-07-174432-4, MHID: 0-07-174432-0.
All trademarks are trademarks of their respective owners. Rather than put a trademark symbol after every occurrence of a trademarked name, we use names in an editorial fashion only, and to the benefit of the trademark owner, with no intention of infringement of the trademark. Where such designations appear in this book, they have been printed with initial caps.
McGraw-Hill eBooks are available at special quantity discounts to use as premiums and sales promotions, or for use in corporate training programs. To contact a representative please e-mail us at bulksales@mcgraw-hill.com.
Information has been obtained by McGraw-Hill from sources believed to be reliable. However, because of the possibility of human or mechanical error by our sources, McGraw-Hill, or others, McGraw-Hill does not guarantee the accuracy, adequacy, or completeness of any information and is not responsible for any errors or omissions or the results obtained from the use of such information.
TERMS OF USE
This is a copyrighted work and The McGraw-Hill Companies, Inc. (McGraw-Hill
) and its licensors reserve all rights in and to the work. Use of this work is subject to these terms. Except as permitted under the Copyright Act of 1976 and the right to store and retrieve one copy of the work, you may not decompile, disassemble, reverse engineer, reproduce, modify, create derivative works based upon, transmit, distribute, disseminate, sell, publish or sublicense the work or any part of it without McGraw-Hill’s prior consent. You may use the work for your own noncommercial and personal use; any other use of the work is strictly prohibited. Your right to use the work may be terminated if you fail to comply with these terms.
THE WORK IS PROVIDED AS IS.
McGRAW-HILL AND ITS LICENSORS MAKE NO GUARANTEES OR WARRANTIES AS TO THE ACCURACY, ADEQUACY OR COMPLETENESS OF OR RESULTS TO BE OBTAINED FROM USING THE WORK, INCLUDING ANY INFORMATION THAT CAN BE ACCESSED THROUGH THE WORK VIA HYPERLINK OR OTHERWISE, AND EXPRESSLY DISCLAIM ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. McGraw-Hill and its licensors do not warrant or guarantee that the functions contained in the work will meet your requirements or that its operation will be uninterrupted or error free. Neither McGraw-Hill nor its licensors shall be liable to you or anyone else for any inaccuracy, error or omission, regardless of cause, in the work or for any damages resulting there from. McGraw-Hill has no responsibility for the content of any information accessed through the work. Under no circumstances shall McGraw-Hill and/or its licensors be liable for any indirect, incidental, special, punitive, consequential or similar damages that result from the use of or inability to use the work, even if any of them has been advised of the possibility of such damages. This limitation of liability shall apply to any claim or cause whatsoever whether such claim or cause arises in contract, tort or otherwise.
To Gladys, Justin, and Carter. You are the source of unending happiness, you lift me up when I need it, and I will always be proud of you.
Contents at a Glance
Part I Fundamentals
1 Analytic Databases and Dimensional Design
2 Data Warehouse Architectures
3 Stars and Cubes
Part II Multiple Stars
4 A Fact Table for Each Process
5 Conformed Dimensions
Part III Dimension Design
6 More on Dimension Tables
7 Hierarchies and Snowflakes
8 More Slow Change Techniques
9 Multi-Valued Dimensions and Bridges
10 Recursive Hierarchies and Bridges
Part IV Fact Table Design
11 Transactions, Snapshots, and Accumulating Snapshots
12 Factless Fact Tables
13 Type-Specific Stars
Part V Performance
14 Derived Schemas
15 Aggregates
Part VI Tools and Documentation
16 Design and Business Intelligence
17 Design and ETL
18 How to Design and Document a Dimensional Model Index
Contents
Acknowledgments
Introduction
Part I Fundamentals
Chapter 1 Analytic Databases and Dimensional Design
Dimensional Design
Purpose
Measurement and Context
Facts and Dimensions
Grouping Dimensions and Facts
The Star Schema
Dimension Tables
Keys and History
Fact Tables
Using a Star Schema
Querying Facts
Browsing Dimensions
Guiding Principles
Summary
Further Reading
Chapter 2 Data Warehouse Architectures
Inmon’s Corporate Information Factory
Kimball’s Dimensional Data Warehouse
Stand-Alone Data Marts
Architecture and Dimensional Design
Contrasting the Approaches
The Common Element
Terms Used in This Book
Summary
Further Reading
Chapter 3 Stars and Cubes
Dimension Table Features
Surrogate Keys and Natural Keys
Rich Set of Dimensions
Grouping Dimensions into Dimension Tables
Fact Table Features
Fact Tables and Processes
Capturing Facts
Grain
Sparsity
Degenerate Dimensions
Slowly Changing Dimensions
Type 1 Change
Type 2 Change
Choosing and Implementing Response Types
Cubes
Summary
Further Reading
Part II Multiple Stars
Chapter 4 A Fact Table for Each Process
Fact Tables and Business Processes
Facts that Have Different Timing
A Single Fact Table Causes Difficulties
Modeling in Separate Fact Tables
Facts that Have Different Grain
A Single Fact Table Causes Difficulties
Modeling in Separate Fact Tables
Analyzing Facts from More than One Fact Table
The Peril of Joining Fact Tables
Drilling Across
Drill-Across Implementations
Summary
Further Reading
Chapter 5 Conformed Dimensions
The Synergy of Multiple Stars
Dimensions and Drilling Across
What Causes Failure?
Identical Tables Not Required
Conformed Dimensions
Types of Dimensional Conformance
Planning Conformance
Architecture and Conformance
Dimensional Data Warehouse
Corporate Information Factory
Stand-Alone Data Marts
Summary
Further Reading
Part III Dimension Design
Chapter 6 More on Dimension Tables
Grouping Dimensions into Tables
Two Ways of Relating Dimension Attributes
When Struggling with Dimension Groupings
Breaking Up Large Dimensions
Splitting Dimension Tables Arbitrarily
Alternatives to Split Dimensions
Mini-Dimensions Alleviate ETL Bottlenecks and Excessive Growth
Dimension Roles and Aliasing
Avoiding the NULL
Problems Caused by NULL
Avoiding NULL Foreign Key Values
Uses for Special-Case Rows
Behavioral Dimensions
Converting Facts to Dimensions at Query Time
Designing and Using Behavioral Dimensions
Design Considerations for Behavioral Dimensions
Summary
Further Reading
Chapter 7 Hierarchies and Snowflakes
Drilling
The Concept of Drilling
The Reality of Drilling
Attribute Hierarchies and Drilling
The Attribute Hierarchy
Drilling Within an Attribute Hierarchy
Other Ways to Drill
Documenting Attribute Hierarchies
Snowflakes
Avoiding the Snowflake
Embracing the Snowflake
Outriggers
Repeating Groups
Eliminating Repeating Groups with Outriggers
Outriggers and Slow Change Processing
Summary
Further Reading
Chapter 8 More Slow Change Techniques
Time-Stamped Dimensions
Point-in-Time Status of a Dimension
The Time-Stamped Solution
Type 3 Changes
Study All Facts with Old or New Dimension Values
The Type 3 Solution
Hybrid Slow Changes
Conflicting Requirements
The Hybrid Response
Evaluating and Extending the Hybrid Approach
Summary
Further Reading
Chapter 9 Multi-Valued Dimensions and Bridges
Standard One-to-Many Relationships
Multi-Valued Dimensions
Simplifying the Relationship
Using a Bridge for Multi-Valued Dimensions
Multi-Valued Attributes
Simplifying the Multi-Valued Attribute
Using an Attribute Bridge
Summary
Further Reading
Chapter 10 Recursive Hierarchies and Bridges
Recursive Hierarchies
Rows Referring to Other Rows
The Reporting Challenge
Flattening a Recursive Hierarchy
A Flattened Hierarchy
Drawbacks of Flattening
When Flattening Works Best
The Hierarchy Bridge
Hierarchy Bridge Design
Using the Bridge
Double-Counting
Resolving the Many-to-Many Relationship
Potential Misuse
Changes and the Hierarchy Bridge
Type 1 Changes in the Dimension or Bridge
Type 2 Changes to the Dimension
Type 2 Changes to the Hierarchy
Variations on the Hierarchy Bridge
Embellishing the Bridge
Multiple Parents
Multiple Hierarchies
Summary
Further Reading
Part IV Fact Table Design
Chapter 11 Transactions, Snapshots, and Accumulating Snapshots
Transaction Fact Tables
Describing Events
Properties of Transaction Fact Tables
Snapshot Fact Tables
The Challenge: Studying Status
The Snapshot Model
Snapshot Considerations
Accumulating Snapshot Fact Tables
Challenge: Studying Elapsed Time Between Events
The Accumulating Snapshot
Accumulating Snapshot Considerations
Summary
Further Reading
Chapter 12 Factless Fact Tables
Events with No Facts
Nothing to Measure?
The Factless Fact Table
Using a Factless Fact Table
Adding a Fact
Conditions, Coverage, or Eligibility
Why Model Conditions?
Factless Fact Tables for Conditions
Comparing Activities and Conditions
Slowly Changing Dimensions and Conditions
Summary
Further Reading
Chapter 13 Type-Specific Stars
Type-Specific Attributes
Operational Systems
Analytic Systems
Core and Custom Stars
Core and Custom Dimension Tables
Core and Custom Fact Tables
Other Considerations
Using Generic Attributes
Generic Attributes
Using a Generic Design
Summary
Further Reading
Part V Performance
Chapter 14 Derived Schemas
Restructuring Dimensional Data
Uses for Derived Schemas
Derived Schemas Already Covered
The Cost of Derived Schemas
The Merged Fact Table
Precomputed Drill-Across Results
Simplified Process Comparison
Improved Performance
Supporting Tools that Cannot Drill Across
Single-Process Analysis
Including a Nonshared Dimension
The Pivoted Fact Table
The Need to Pivot Data
The Pivoted Advantage
Drawbacks to Pivoting
The Sliced Fact Table
Creating Slices of a Star
Uses for Sliced Fact Tables
Slices First
Set Operation Fact Tables
Comparing Two Sets of Data
Several Possible Comparisons
Choosing to Precompute Set Operations
Summary
Further Reading
Chapter 15 Aggregates
Fundamentals of Aggregates
Summarizing Base Data
Using Aggregates
Loading Aggregates
Cubes as Aggregates
Making Aggregates Invisible
Aggregate Navigation
Aggregate Generation
Alternative Summary Designs
Transformative Summaries May Also Be Useful
Single Table Designs Should Be Avoided
Summary
Further Reading
Part VI Tools and Documentation
Chapter 16 Design and Business Intelligence
Business Intelligence and SQL Generation
SQL Generators
The Limitations of SQL Generators
Guidelines for the Semantic Layer
Features to Avoid
Features to Use
Working with SQL-Generating BI Tools
Multiple Stars
Semi-Additivity
Browse Queries
Bridge Tables
Working with Cube-Based BI
Cube-Centric Business Intelligence
Auto-Generation of Cubes
Summary
Further Reading
Chapter 17 Design and ETL
The ETL Process
A Complex Task
Tools Used by the ETL Process
Architecture and the ETL Process
Loading a Star
A Top-Level Dependency
Loading a Dimension Table
Loading the Fact Table
Optimizing the Load
Changed Data Identification
Simplifying Processing
Cleansing Data
What Should Be Cleaned Up
Cleaning Up Dimensional Data
Facts with Invalid Details
Housekeeping Columns
Housekeeping Columns in Dimension Tables
Housekeeping and Fact Tables
Summary
Further Reading
Chapter 18 How to Design and Document a Dimensional Model
Dimensional Design and the Data Warehouse Life Cycle
The Strategic Importance of Dimensional Design
When to Do Dimensional Design
Design Activities
Planning the Design Effort
Conducting Interviews
Designing the Dimensional Model
Prioritizing Plans
Documenting the Results
Documenting a Dimensional Model
Requirements Documentation
Top-Level Design Documentation
Detailed Design Documentation
Logical vs. Physical
Summary
Further Reading
Index
Acknowledgments
This book would not have been possible without the help of many people. That list starts with people like you—people who have read my books, attended my classes, e-mailed me questions, and visited my blog. Over the years, your comments and questions have helped me shape this reference to dimensional design. I hope you will find it relevant, practical, and useful. Thanks for your feedback, and keep it coming.
This book was also made possible by my customers, past and present. Thank you for always providing new challenges. Meeting them together never fails to be a fulfilling experience.
Lisa McClain of McGraw-Hill made this book a reality. My thanks to her for championing the project, and to everyone who has worked on this book’s development, production, marketing, and distribution.
Several people have given me opportunities over the years, without which you would not be reading these words today. Thanks to Cameron Hendershot and Marilyn Feldman, Joe Warren, Argee Mahecha, Matt Comstock, Mike Venerable, and David Wells. Many colleagues have also influenced my trajectory, some unknowingly so. Thanks to Jesse Baldwin, Jim Hadley, Mike Lynott (who taught me about unboiling frogs
), Greg Jones, and Randall Porter. I am also grateful to Chris Date and Laura Reeves, who have given me useful advice.
Special thanks to Ralph Kimball, whose generous advice, support, and encouragement have been greatly valued, and whose contributions to this field are of immeasurable value to us all.
I love teaching and want to thank everyone at The Data Warehousing Institute who makes it possible. Wayne Eckerson, Paul Kautza, Yvonne Baho, Sandra Prescott, Brenda Woodbridge, Jennifer Noland, Nancy Hanlon, and Heather Flynn are a few of the people at TDWI to whom I am indebted. My gratitude also goes out to all others not named here who organize TDWI events with professionalism and attention to detail.
To my friends and family: your friendship and love are what is most valuable to me. I cannot name all of you here, but you know who you are. Special thanks and love to Mom, Sis, Jason, Aya, Papa, Magal, Shell, Eli, and Sofie. My love also to Dad who, though no longer here, continues to be an important part of who I am.
Introduction
Dimensional design is a pillar of every modern-day data warehouse architecture. Based on a disarmingly simple approach to process measurement, dimensional design enables extraordinarily powerful analytics. The products of dimensional design—the star schema, the snowflake, and the cube—can be found in virtually every data warehouse implementation.
Despite this popularity, relatively little is written about dimensional design. Although some outstanding works are available, most assume a particular architecture or philosophy—my own prior work included. Additionally, these treatments are organized around vertical industries or major business functions, making them difficult to refer to when faced with a specific design challenge.
This book is a complete reference to dimensional design—the first intended for any reader. The best practices presented in this volume cut across all architectures, including those espoused by W.H. Inmon and Ralph Kimball. Organized around the key concepts of dimensional design, this book provides full, in-depth treatment of each topic, sequenced in a logical progression from fundamentals through advanced techniques.
This book is designed for both beginners and experts in dimensional design. If you are a beginner, it is the ideal place to start. Each chapter provides you with best practices and their underlying rationale, detailed examples, and the criteria you need to make design decisions. If you are an expert, you will be able to use this guide as a reference. Whenever you face a particular design challenge, you will find a chapter or section dedicated to the topic.
Dimensional design enables profoundly powerful business analysis. A solid understanding of the underlying principles is essential, whether you are directly involved in design activities, work with dimensional data structures, manage projects, or fund implementations. Mastery of the techniques and best practices in this book will help you unleash the full potential of your data warehouse, regardless of architecture, implementation scope, or software tools.
About This Book
This book has been designed as a complete, in-depth reference for anyone who works with dimensional data—the star, the snowflake, or the cube.
• The content is organized into chapters and sections dedicated to the core concepts of dimensional design so you can find everything you need to know about a particular topic in one place.
• Each topic is treated comprehensively. Full explanations for best practices allow you to make informed design decisions based on operational realities.
• No assumptions are made about your data warehouse environment. The best practices here apply in all architectures, including those espoused by W.H. Inmon and Ralph Kimball.
• Specific software products are not referenced, but the ways in which your tools may influence design decisions are fully explored.
The result is a treatment that is comprehensive and useful, regardless of your level of experience, data warehouse architecture, or available tools.
Organized Around Core Concepts
This book is organized around the core concepts of dimensional modeling, rather than a series of business scenarios by vertical industry. Focusing on these concepts allows a complete treatment of each topic, without forcing you to flip back and forth between various business cases. Each topic is explored in depth, rather than spread across multiple chapters.
This comprehensive treatment of each concept allows Star Schema: The Complete Reference to serve as a useful resource. Experienced modelers will find what they need with a quick scan through the Table of Contents. Need to brush up on the implications of a snapshot design? Everything you need can be found in Chapter 11. Thinking about implementing a bridge table? It’s all there in Chapter 9. Need to implement a hybrid slow change? A complete discussion can be found in Chapter 8. Each chapter concludes with references to external treatments of the topic, should you wish to search for more examples.
For those new to dimensional design, the material has been sequenced so the book can be read cover to cover. The first three chapters explore fundamentals, and subsequent sections delve deeply into various aspects of dimensional design. Help on choosing where to start is provided at the end of this introduction.
Comprehensive and Practical, Not Dogmatic
While this book highlights a series of best practices, the underlying motivation is always fully explored. You will learn the reasons for these guidelines, and develop the ability to make informed decisions on how to apply them. The result is a practical approach to data warehouse design—one that is responsive to organizational and operational context, rather than independent of it.
Dimensional designers, for example, are often trained to record information at the lowest level of detail possible. The reasons for this guideline are fully explained in Chapter 3, along with situations where these reasons might not apply. Similarly, designers are always taught that different business processes deserve their own models, or stars. Chapter 4 explains why this is the case, and fully explores what happens when this guideline is relaxed.
Even when you stick to the best practices, there is no single right way
to model a particular business process. You will learn how each design option strikes a balance among business value, the required effort to construct reports, the complexity of the load process, and cost. Flattening a recursive hierarchy, for example, simplifies reporting and reduces development cost, but limits the power of the final solution; the alternatives are fully explored in Chapter 10. Derived schemas can make reporting easier and improve performance, but provide significant additional work to load data into the data warehouse, as described in Chapter 14.
Architecture-Neutral
This book makes no assumptions about your data warehouse architecture. The best practices outlined in these pages apply whether you follow W.H. Inmon’s Corporate Information Factory approach or Ralph Kimball’s dimensional data warehouse bus
approach, or simply build subject-area data marts. In each of these paradigms, there is a place for dimensional data. No matter how you put dimensional design to work, this book will allow you to make the most of it.
If you don’t know anything about these thought leaders or their recommended architectures, you will learn something about them in Chapter 2. There, you will find a high-level overview of various approaches, and information on how dimensional design fits into each. What you won’t find is an argument in favor of one approach over another. This book’s coverage of dimensional design is disentangled from such considerations. Anyone can use it.
Common Vocabulary
This book is designed to service any data warehouse architecture, but it is necessary to establish a common vocabulary. When it comes to dimensional design, that vocabulary comes from Ralph Kimball. By providing a way to talk about dimensional design, he has made a valuable contribution to the world of data warehousing, giving us terms like grain, conformance, and slowly changing dimensions. These and other terms can be found in his seminal work on dimensional design: The Data Warehouse Toolkit, Second Edition, by Ralph Kimball and Margy Ross (Wiley, 2002).
Wherever possible, this book makes use of terminology established by Kimball and Ross. Each term will be fully explained. However, it is not presumed that the reader adheres to Kimball’s approach to data warehousing. His approach is one of several architectures that make use of dimensional design. These architectures are discussed in Chapter 2; the principles in this book can be employed in any of these situations.
Product-Independent
This book makes no assumptions about specific hardware or software products in your data warehouse architecture. The dimensional techniques described are largely universal, and can be implemented using tools and technologies from a variety of vendors.
This is not to say that the software products used by your organization will not influence your dimensional design. To the contrary, they can, will, and should bear such influence. Although specific software products will not be discussed, the influence of various kinds of tools will be explored. These include database management systems (DBMSs), reporting or business intelligence (BI) software, and data integration or extract transform load (ETL) tools.
The capabilities of your RDBMS and reporting tools, for example, may drive the decision to produce a snowflake
design, rather than a star, as you will learn in Chapter 7. The capabilities of a business intelligence tool, or the sophistication of its users, may shape your approach to schema design issues outlined in Chapter 16. Development of the ETL process is complex, and may benefit from some design considerations discussed in Chapter 17.
Snowflakes and Cubes
Most of the examples in this book feature the star schema. The principles of dimensional modeling can also be used to design snowflakes and cubes. The best practices are largely the same, with a few exceptions that are highlighted and explored. The snowflake is featured in Chapter 7; the influence of business intelligence tools on this design option are discussed in Chapter 16. The cube is introduced in Chapter 3; many useful ways to pair stars with cubes are explored in Chapters 14, 15, and 16.
Who Should Read This Book
This book is written for you, the data warehouse practitioner. If your work in any way involves stars, snowflakes, or cubes, then this is your guide to all things dimensional. No assumptions are made regarding your skill level, role, or preferred architecture.
You may design dimensional models, work with dimensional data, manage activities, or pay the bills. Your role may fall into a variety of categories, including:
• Business Analysis
• Data Architecture / Star Schema Design
• Business Intelligence and Reporting
• Data Integration or ETL
• Database Administration
• Quality Assurance
• Data Administration
• Project Management
• Executive Leadership / IT Management
• Power User
It will be assumed that you have a basic familiarity with relational database concepts like tables, columns, and joins. There will be occasional examples of SQL code; these will be fully explained for the benefit of novice readers.
No assumptions are made about your level of experience. If you are new to dimensional design, you will probably want to read this book from cover to cover. Experienced practitioners may prefer to skip directly to areas of particular interest. The next section provides advice on how to proceed.
Using This Book
This book is designed to be used in two ways. You can read it cover to cover, or consult it as a reference. The book is divided into six parts, with chapters organized in a logical progression. When looking for a particular topic, you can scan the chapter and section headings to find what you need. Key features of each chapter help highlight important concepts and aid your understanding of them.
Key Features of Each Chapter
Each chapter in this book provides in-depth treatment of a core topic in dimensional design. Design techniques are evaluated, best practices are identified, and alternatives are fully explored. When you finish a chapter, you will understand the design principles, the reasons behind best practices, and how to evaluate possible design alternatives.
Some special elements are used to help highlight important information.
Tips
Tips
are used to highlight best practices:
TIP Whether reading the book cover to cover, or skipping around to pertinent passages, tips will be there to highlight the key lessons of each section.
These tips are like conclusions; they result from a full exploration of design alternatives. Reading the body of the chapter will bring to light the reasons behind these guidelines, along with situations in which you may wish to deviate.
Examples
Detailed examples are used to illustrate every design technique. Most examples are accompanied with schema designs, as well as instance diagrams that illustrate the content of important tables. These diagrams will help you understand design techniques; the text will refer to them, call out important features, highlight the sample data, and show how the technique works to solve the design challenge.
Notes
Some examples may feature design elements that are not central to the focus of the topic or section. Interesting but off-topic considerations are highlighted in Notes,
which may also direct you to other chapters where the topic is fully explored.
NOTE Notes are used to alert you to additional considerations dealt with elsewhere in the book, or to touch on topics not central to dimensional design.
Further Reading
Each chapter ends with a section on Further Reading.
Here you can get information on where to find more examples of the techniques presented in the chapter. Some of these references highlight refinements or alternatives to the techniques presented; others provide examples drawn from different business cases or industries. The majority of books cited focus on the Kimball approach to data warehouse design, but can be employed in other architectures as well.
Contents of This Book
Star Schema: The Complete Reference is divided into six parts, each of which focuses on a major category of dimensional design techniques. A summary of each section follows. For additional details, you may wish to scan the Table of Contents.
Part I: Fundamentals
Part I focuses on the fundamentals of dimensional design. It includes chapters that focus on process measurement, data warehouse architecture, and star schema design.
Chapter 1: Analytic Databases and Dimensional Design The fundamentals of process measurement are introduced in this chapter, including facts, dimensions, and the star schema.
Chapter 2: Data Warehouse Architectures Three very different architectures make use of the star schema, including those advocated by W.H. Inmon and Ralph Kimball. This chapter sorts through each architecture’s use of the star, and highlights how the same terms take on different meanings in each paradigm.
Chapter 3: Stars and Cubes In this chapter, you will learn the fundamentals of star schema design and slowly changing dimensions, and explore the different ways cubes may be incorporated into a data warehouse architecture.
Part II: Multiple Stars
Part II takes the first steps out of the neat and perfect world of the simple example and ventures into the real world of complex designs. It deals with a fundamental challenge that novice designers must learn to tackle: modeling different business processes as different stars.
Chapter 4: A Fact Table for Each Process This chapter teaches you how to identify discrete processes and provide separate stars for each. It also looks at how to produce analysis that crosses process boundaries.
Chapter 5: Conformed Dimensions The concept of conformed dimensions allows you to support and compare a variety of business processes, ensuring compatibility even if implementations make use of different technologies. Dimensional conformance is often considered to be of strategic importance, and can serve as the basis of a roadmap for incremental implementation.
Part III: Dimension Design
Part III dives deeply into advanced techniques that surround the dimensions of a dimensional design. It is divided into five chapters.
Chapter 6: More on Dimension Tables In this chapter, you will learn how to determine what dimensions to place in the same table, how to stem unmanageable growth in dimension tables, and how to handle information that is optional or unavailable.
Chapter 7: Hierarchies and Snowflakes This chapter explores the technique known as snowflaking, and explains how modeling attribute hierarchies may facilitate the implementation of reporting tools.
Chapter 8: More Slow Change Techniques This chapter goes beyond the basic type 1 and type 2 slow changes presented in Chapter 3, covering type 3 slow changes, time-stamping techniques, and hybrid slow change responses.
Chapter 9: Multi-Valued Dimensions and Bridges Sometimes, a dimension can take on multiple values with respect to a single fact, such as multiple salespeople collaborating on a single order. This chapter explores techniques for dealing with these situations, from simple flattening to the use of bridge tables.
Chapter 10: Recursive Hierarchies and Bridges Dimensions often embody recursive hierarchies, such as departments that report to other departments. This chapter shows how to flatten these hierarchies for a simple solution, and how to make use of a hierarchy bridge for powerful and flexible analysis.
Part IV: Fact Table Design
Part IV provides in-depth treatment of advanced features that center on fact tables. It is composed of three chapters.
Chapter 11: Transactions, Snapshots, and Accumulating Snapshots This chapter covers situations in which a standard transaction-based fact table falls short, and shows how periodic snapshots and accumulating snapshots can be put to use.
Chapter 12: Factless Fact Tables Sometimes fact tables contain no facts. In this chapter, you will learn about transaction-based factless fact tables, as well as factless fact tables that capture coverage or conditions.
Chapter 13: Type-Specific Stars This chapter looks at situations where subsets of data have different facts and dimensions, and shows how to make use of core and custom stars for a single process.
Part V: Performance
Any dimensional schema can be supplemented with additional structures that are intended to improve performance or simplify the reporting process. Part V looks at two kinds of supplemental data structures that support high performance.
Chapter 14: Derived Schemas This chapter teaches you to use derived schemas to provide enhanced performance, simplify report development, or address specific departmental needs.
Chapter 15: Aggregates This chapter shows how summary tables, or aggregates, can provide powerful performance boosts when implemented wisely.
Part VI: Tools and Documentation
The last part of this book looks at additional topics of interest for the developers of dimensional models.
Chapter 16: Design and Business Intelligence This chapter explores the influence of business intelligence (BI) tools on your dimensional design. It explores some common dimensional features that often strain BI tools, as well as techniques to mitigate any shortcomings.
Chapter 17: Design and ETL This chapter provides an overview of the process of loading the dimensional schema, and highlights numerous model features that can assist ETL developers in optimizing the process.
Chapter 18: How to Design and Document a Dimensional Model This provides standard tasks and deliverables that can be worked into your data warehouse life cycle, regardless of architecture.
Where to Start
As mentioned earlier, you can read this book from cover to cover, or skip directly to sections that deal with a topic of interest. How you make use of it will largely depend on your current skill level.
Beginners Those new to dimensional design should start by reading Part I: Fundamentals. Once you’ve finished that section, you can read the rest in sequence, or skip to any section of particular interest. If you are getting ready to start a project, you may wish to turn to the last chapter: How to Design and Document a Dimensional Model.
Novices If you have some dimensional design experience, you may wish to skip the fundamentals in Part I. However, you are encouraged to review the data warehouse terms described at the end of Chapter 2. Terms like data warehouse,
data mart,
and source system
can take on very different meanings in different architectures; Chapter 2 explains how these terms will be used in this book. If you’ve done most of your learning on the job,
you may also want to review Chapter 3 for standard dimensional terminology used in this book—terms such as grain,
natural key,
and slowly changing dimension
are all defined there.
Experts Experienced dimensional modelers can use this book as a reference, consulting it for detailed information on design issues as they arise. A quick scan through the Table of Contents will help you find any dimensional design topic of interest. You may also wish to read the book cover to cover, enhancing your mastery of dimensional design.
Regardless of how you approach the material in these pages, it will enable you to master the principles of dimensional design, unlocking the full potential of your data warehouse.
PART I
Fundamentals
CHAPTER 1
Analytic Databases and Dimensional Design
CHAPTER 2
Data Warehouse Architectures
CHAPTER 3
Stars and Cubes
CHAPTER 1
Analytic Databases and Dimensional Design
This book describes a set of powerful and effective techniques for the design of analytic databases. These techniques unleash business analytics in a very simple way: they model the measurement of business processes.
The dimensional model of a business process is made up of two components: measurements and their context. Known as facts and dimensions, these components are organized into a database design that facilitates a wide variety of analytic usage. Implemented in a relational database, the dimensional model is called a star schema. Implemented in a multidimensional database, it is known as a cube. If any part of your data warehouse includes a star schema or a cube, it leverages dimensional design.
This chapter introduces the basic concepts of dimensional design, which arose in response to the unique requirements of analytic systems. The concept of measurement is defined in terms of facts and dimensions, and translated into a database design, or star schema. This chapter shows how basic interaction with a star schema can support a wide variety of measurement scenarios, and defines two simple guiding principles for the development of dimensional solutions.
Dimensional Design
Information systems fall into two major categories: those that support the execution of business processes and those that support the analysis of business processes. The principles of dimensional design have evolved as a direct response to the unique requirements of analytic systems. The core of every dimensional model is a set of business metrics that captures how a process is evaluated, and a description of the context of every measurement.
Purpose
Analytic systems and operational systems serve fundamentally different purposes. An operational system supports the execution of a business process, while an analytic system supports the evaluation of the process. Their distinct purposes are reflected in contrasting usage profiles, which in turn suggest that different principles will guide their design.
Operational Systems
An operational system directly supports the execution of a business process. By capturing details about significant events or transactions, it constructs a record of activities. A sales system, for example, captures information about orders, shipments, and returns; a human resources system captures information about the hiring and promotion of employees; an accounting system captures information about the management of the financial assets and liabilities of the business. The activities recorded by these systems are sometimes known as transactions. The systems themselves are sometimes called online transaction processing (OLTP) systems, or transaction systems for short.
To facilitate the execution of a business process, operational systems must enable several types of database interaction, including inserts, updates, and deletes. The focus of these interactions is almost always atomic: a specific order, a shipment, a refund. These interactions will be highly predictable in nature. For example, an order entry system must provide for the management of lists of products, customers, and salespeople; the entering of orders; the printing of order summaries, invoices, and packing lists; and the tracking order status.
Because it is focused on process execution, the operational system is likely to update data as things change, and purge or archive data once its operational usefulness has ended. When a customer moves, for example, his or her old address is no longer useful for shipping products or sending invoices, so it is simply overwritten.
Implemented in a relational database, the optimal schema design for an operational system is widely accepted to be one that is in third normal form. The design may be depicted as an entity-relationship model, or ER model. Coupled with appropriate database technology, this design supports high-performance inserting, updating, and deleting of atomic transactions in a consistent and predictable manner. Developers refer to the characteristics of transaction processing as the ACID properties—atomic, consistent, isolated, and durable.
Analytic Systems
While the focus of the operational system is the execution of a business process, the analytic system supports the evaluation of the process. How are orders trending this month versus last? Where does this put us in comparison to our sales goals for the quarter? Is a particular marketing promotion having an impact on sales? Who are our best customers? These questions deal with the measurement of the overall orders process, rather than asking about individual orders.
Interaction with an analytic system takes place exclusively through queries that retrieve data about business processes; information is not created or modified. These queries can involve large numbers of transactions, in contrast to the operational system’s typical focus on individual transactions. Specific questions asked are less predictable, and more likely to change over time. Historic data will remain important to the analytic system long after its operational use has passed. The differences between operational systems and analytic systems are highlighted in Figure 1-1.
The principles of dimensional modeling address the unique requirements of analytic systems. A dimensional design is optimized for queries that may access large volumes of transactions, not just individual transactions. It is not burdened with supporting concurrent, high-performance updates. It supports the maintenance of historic data, even as the operational systems change or delete information.
Figure 1-1 Operational systems vs. analytic systems
Measurement and Context
The founding principle of dimensional design is disarmingly simple. Dimensional design supports analysis of a business process by modeling how it is measured.
Measurement is easy to discern, whether by listening to people talk or reading a report or chart. Consider the following business questions:
• What are gross margins by product category for January?
• What is the average account balance by education level?
• How many sick days were taken by marketing employees last year?
• What are the outstanding payables by vendor?
• What is the return rate by supplier?
Each of these questions centers on a business process: sales, account management, attendance, payables, return processing. These process-centric questions do not focus on individual activities or transactions. To answer them, it is necessary to look at a group of transactions.
Most importantly, each of these questions reveals something about how its respective business process is measured. The study of sales involves the measurement of gross margin. Financial institutions measure account balance. In human resources, they measure number of absences. The finance department measures payables. Purchasing managers watch the return quantities.
Without some kind of context, a measurement is meaningless. If you are told sales were $10,000,
there is not much you can do with this information. Is that sales of a single product, or many products? Does it represent a single transaction, or the company’s total sales from conception to date? Without some context, the measurement is useless.
As with the measurements themselves, context is revealed in business questions or reports. In the preceding questions, for example, gross margin is viewed in the context of product categories and time (the month of January). Sick days are viewed in the context of a department (marketing) and time (last year). Payables are viewed in the context of their status (outstanding) and vendor.
These two simple concepts, measurement and context, are the foundation of dimensional design. Every dimensional solution describes a process by capturing what is measured and the context in which the measurements are evaluated.
Facts and Dimensions
In a dimensional design, measurements are called facts, and context descriptors are called dimensions. Every dimensional design sorts information requirements into these categories. They may be identified within statements or questions, or found within report specifications. Sorting them into categories for facts and dimensions is easy, once you know what to look for.
Facts and Dimensions in Speech
In a spoken or written statement, the word by
is almost always followed by a dimension. Consider the question "What are order dollars by product category for January? Clearly, the person asking this question wants a separate measurement for each product category, as indicated by the words
by product category." Product category is a dimension.
Similarly, the word for
is also a good indicator of the presence of a dimension. In this case, the next word is likely to be an instance value of a dimension. For example, "What are order dollars by product category for January? Here,
for is followed by
January." January is a value or instance of a dimension. The dimension is not named, but we can infer that it is month.
Facts tend to be numeric in value, and people want to see them at various levels of detail. You can identify facts by looking for things that people want to roll up
or break out.
Again, look at the question What are order dollars by product category for January?
Here, order dollars will presumably be numeric, and they can be rolled up by categories.
Not everything that is numeric is a fact. Sometimes, a numeric data element is really a dimension. The key is to determine how it is being used. Is it something that can be specified at varying levels of detail? If so, it is a fact. Is it providing context? If so, it is a dimension. Show me margin dollars by order number,
contains two numeric elements. Margin dollars can be specified at various levels of detail, and the speaker is asking that it be rolled up by order number. Clearly, margin dollars
is a fact. What about order number? It too is numeric, but the speaker is not asking that order numbers be added together. Instead, order number is being used to specify the context for margin dollars. Order number is a dimension.
Other examples of numeric data elements behaving as dimensions include size, ages, phone numbers, document numbers, and unit amounts such as unit cost or unit price. The clue that these elements are dimensions lies in their use. The numbers themselves are not rolled up or down according to the question. Instead, they specify the context by which something else may be rolled up or down.
Facts and Dimensions in Reports
Facts and dimensions can also be recognized by considering how they would be used if they were stored in a database and shown on a report. Dimensions serve as filters,
or query predicates.
The report page shown in Figure 1-2, for example, is filtering detailed data for the western region and January 2009. These filters imply several dimensions: region, month, and year.
Figure 1-2 Sample page from an orders report
In a report, dimensions also serve to specify groupings or break levels,
or to identify levels of subtotals. This can be seen in the orders report, where category is used to break up the report into sections; individual categories are also used to label rows containing subtotals. These features are clues that category is a dimension.
Dimensions can also be identified by their use in controlling the aggregation of measurements. In the orders report, several measurements are specified for each product name and SKU. (SKU is a retail term that identifies an individual product, or stock keeping unit.
) Product and SKU are dimensions. Notice, too, that dimension values may be used to sort data.
Facts can also be spotted based on their use in queries or reports. Elements that are aggregated, summarized, or subtotaled are facts. Clearly, the report in Figure 1-2 contains the facts cost and order dollars. These numbers are being specified at various levels, including products, categories, and a grand total. Look again, and you may notice another fact: quantity ordered. Although it is not being subtotaled, this number is presumably being aggregated. In this case, it looks like quantity ordered by product is being aggregated from individual orders.
The style of the report in Figure 1-2 is sometimes referred to as tabular. Other reports may cross-reference facts and dimensions in matrix format, or even display facts graphically. The values being plotted are facts; axis values, bar labels, and qualifications are dimensions.
In some cases, it can be hard to tell whether a data element is a fact or a dimension. Sometimes, a single data element can be used in both ways. Chapter 6, More on Dimension Tables,
discusses behavioral attributes, which are dimensions that have been derived from facts. Later, Chapter 11, Transactions, Snapshots, and Accumulating Snapshots,
looks at techniques used to construct facts out of dimension values. It is also possible to find a process where there does not appear to be a clear measurement. Chapter 12, Factless Fact Tables,
describes how the factless fact table can be used to deal with this kind of situation. You also may find that some important business metrics are not so easily rolled up.
Chapter 3, Stars and Cubes,
considers nonadditive facts, while Chapter 11 discusses the curious case of semi-additive facts.
Grouping Dimensions and Facts
A dimensional design organizes facts and dimensions for storage in a database. It is common for a set of dimensions to share relationships to one another, independent of facts. These are grouped together in a single table to reflect their natural clustering. Similarly, facts that are available at the same level of detail are grouped together.
For example, suppose a study of the orders process reveals the facts and dimensions shown in Figure 1-3. In this example, products are being sold by a distributor to retail stores, so the customers are companies. Each of the facts listed in the column on the left may be rolled up according to the dimension values in the right-hand column.
Notice natural affinities exist among the dimension values. Some of them clearly belong together. For example, product and product description both seem to be associated with SKU. Salesperson name is determined by salesperson ID. Other dimensions are more distantly associated. Products, for example, have brands. Salespeople work in regions. Figure 1-4 shows how the facts and dimensions for the orders process might be grouped together.
Figure 1-3 Facts and dimensions for the orders process
The diagram in Figure 1-4 depicts measurement of the orders process. It identifies the important measurements for the process and shows the dimensional context in which those measurements are evaluated. The facts and dimensions in this diagram can be combined in a number of ways, answering a wide array of business questions. Take any fact (or facts), add the word by,
and then any combination of dimensions. The diagram can be easily understood by a businessperson. Things I measure
go in the middle; the ways I look at them
go in the surrounding boxes, loosely grouped or categorized.
The dimensional framework for the orders process in Figure 1-4 can easily be translated into a database design, or more specifically a dimensional design. A dimensional design organizes facts and dimensions for storage in a database management system. In a relational database management system (RDBMS), the design is referred to as a star schema. In a multidimensional database (MDB), the design is referred to as a cube. This chapter introduces the star schema; we will return to cubes in Chapter 3.
Figure 1-4 Sorting out facts and dimensions for the orders process
The Star Schema
A dimensional design for a relational database is called a star schema. Related dimensions are grouped as columns in dimension tables, and the facts are stored as columns in a fact table. The star schema gets its name from its appearance: when drawn with the fact table in the center, it looks like a star or asterisk. Figure 1-5 shows a simple star schema, based on the facts and dimensions for the orders process.
Dimension Tables
In a star schema, a dimension table contains columns representing dimensions. As discussed, these columns will provide context for facts. Sometimes, dimension table
is simply shortened to dimension.
This may give rise to confusion, since that word can also describe the individual columns. For example, when describing the schema shown in Figure 1-5, one might refer to the columns product, product_description, and SKU as dimensions, but it is also common practice to refer to the table itself as the product dimension.
Generally, you can tell from the context whether the word dimension
refers to a table or column. When it would not otherwise be clear, this book will use the terms dimension attribute and dimension table to make the distinction.
The dimension tables serve to provide the rich context needed for the study of facts. In queries and reports, the dimensions will be used to specify how facts will be rolled up—their level of aggregation. Dimension values may be used to filter reports. They will be used to provide context for each measurement, usually in the form of textual labels that precede facts on each row of a report. They may also be used to drive master-detail relationships, subtotaling, cross-tabulation, or sorts.
Figure 1-5 A simple star schema for the orders process
Readers with experience designing operational systems will notice that the dimension tables are not in third normal form. At first, many developers find this disconcerting. For example, there may be a strong desire to move brand, brand_code, and brand_manager into a separate table called brand, Remember, though, that a dimensional model serves a different purpose from an ER model. It is not necessary to isolate repeating values in an environment that does not support transaction processing. Designers do occasionally perform additional normalization within dimensions, although they usually avoid doing so. In such cases, the schema is referred to as a snowflake. The additional tables that result are sometimes called outriggers. Situations in which snowflaking may be useful are discussed in Chapter 7, Hierarchies and Snowflakes.
Because the dimensions in a dimension table are often loosely related, it can sometimes be difficult to decide how to group things. For example, should sales region be part of the salesperson dimension, or does it belong as a separate geographical dimension? Issues like this are tackled in Chapter 6.
Keys and History
In a star schema, each dimension table is given a surrogate key. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads the star schema. The surrogate key has no intrinsic meaning; it is typically an integer. Surrogate keys are sometimes referred to as warehouse keys. The surrogate key is the primary key of the dimension table.
In this book, surrogate keys will be easily identifiable by the suffix _key
in the column name. In Figure 1-5, for example, the surrogate key for the customer dimension is called customer_key, the surrogate key for the salesperson dimension is called salesperson_key, and so forth. Illustrations in this book will always list the surrogate key for a dimension table as its first attribute.
Dimension tables also contain key columns that uniquely identify something in an operational system. Examples in Figure 1-5 include customer_id, sku, and salesperson_id. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns are referred to as natural keys.
The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not. For example, suppose that customer ABC Wholesalers is identified by customer_id 10711 in an operational system. If the customer changes its headquarters location, the operational system may simply overwrite the address for customer_id 10711. For analytic purposes, however, it may be useful to track the history of ABC Wholesalers. Since the star schema does not rely on customer_id to identify a unique row in the customer dimension, it is possible to store multiple versions of ABC Wholesalers, even though both have the same customer_id—10711. The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.
The term slowly changing dimension refers to the manner in which a dimensional schema responds to changes in a source system. Detailed examples will be explored in Chapter 3, Stars and Cubes.
Advanced techniques are explored in Chapter 8, More Slow Change Techniques.
Fact Tables
At the core of a star schema is the fact table. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. The simple orders star in Figure 1-5, for example, includes the facts quantity_ordered, cost_dollars, and order_dollars. It also includes surrogate keys that refer to products, salespeople, customers, and order dates.
Together, the foreign keys in a fact table are sometimes considered to identify a unique row in the fact table. This is certainly true in Figure 1-5, where each fact table row represents orders of a product sold by a salesperson to a customer on a given day. In other cases, however, the foreign keys in a fact table are not sufficient to identify a unique row. As we will see in Chapter 3, sometimes a fact table row can be uniquely identified by a subset of its foreign keys, or even by using some nonkey attributes.
Each row in the fact table stores facts at a specific level of detail. This level of detail is known as the fact table’s grain, a term that will be explored in Chapter 3. The information held in fact tables may be consumed at a variety of different levels, however, by aggregating the facts. In some data warehouse architectures, it is critical that the star schema capture information at the lowest level of detail possible. In other architectures, this is less important because a separate part of the data warehouse architecture is reserved for atomic data. These variations in data warehouse architecture are explored in Chapter 2, Data Warehouse Architectures.
Using a Star Schema
There is still much to be said about the fundamentals of dimensional design, a discussion that continues in Chapter 3, but enough background has now been provided to look at an important feature of the star schema: how it is actually used. Understanding the basic usage pattern of the star schema allows the dimensional designer to make intelligent choices.
Querying Facts
Most queries against a star schema follow a consistent pattern. One or more facts are requested, along with the dimensional attributes that provide the desired context. The facts will be summarized in accordance with the dimensions present in the query. Dimension values are also used to limit the scope of the query, serving as the basis for filters or constraints on the data to be fetched and aggregated.
A properly configured relational database is well equipped to respond to such a query, which is issued using Structured Query Language (SQL). Suppose that someone has asked to see a report showing order dollars by product category and product name during the month of January 2009. The orders star schema from Figure 1-5 can provide this information, even though order dollars is stored at a lower level of detail. The SQL query in Figure 1-6 produces the required results, summarizing tens of thousands of fact table rows.
The SELECT clause of the query indicates the dimensions that should appear in the query results (category and product), the fact that is requested (order dollars), and the manner in which it will be aggregated (through the SQL SUM() operation). The FROM clause specifies the star schema tables that are involved in the query.
The WHERE clause serves two purposes. First, it filters the query results based on the values of specific dimension columns (month and year). It also specifies the join relationships between tables in the query. In terms of processing time, joins are among the most expensive operations the database must perform; notice that in the case of a star schema, dimension attributes are always a maximum of one join away from facts. The GROUP BY clause specifies the context to which the fact will be aggregated by the relational database; the ORDER