Nothing Special   »   [go: up one dir, main page]

Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Star Schema The Complete Reference
Star Schema The Complete Reference
Star Schema The Complete Reference
Ebook1,090 pages12 hours

Star Schema The Complete Reference

Rating: 0 out of 5 stars

()

Read preview

About this ebook

The definitive guide to dimensional design for your data warehouse

Learn the best practices of dimensional design. Star Schema: The Complete Reference offers in-depth coverage of design principles and their underlying rationales. Organized around design concepts and illustrated with detailed examples, this is a step-by-step guidebook for beginners and a comprehensive resource for experts.

This all-inclusive volume begins with dimensional design fundamentals and shows how they fit into diverse data warehouse architectures, including those of W.H. Inmon and Ralph Kimball. The book progresses through a series of advanced techniques that help you address real-world complexity, maximize performance, and adapt to the requirements of BI and ETL software products. You are furnished with design tasks and deliverables that can be incorporated into any project, regardless of architecture or methodology.

  • Master the fundamentals of star schema design and slow change processing
  • Identify situations that call for multiple stars or cubes
  • Ensure compatibility across subject areas as your data warehouse grows
  • Accommodate repeating attributes, recursive hierarchies, and poor data quality
  • Support conflicting requirements for historic data
  • Handle variation within a business process and correlation of disparate activities
  • Boost performance using derived schemas and aggregates
  • Learn when it's appropriate to adjust designs for BI and ETL tools
LanguageEnglish
Release dateJul 22, 2010
ISBN9780071744331
Star Schema The Complete Reference

Related to Star Schema The Complete Reference

Related ebooks

Databases For You

View More

Related articles

Reviews for Star Schema The Complete Reference

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Star Schema The Complete Reference - Christopher Adamson

    Star Schema

    The Complete Reference™

    About the Author

    Christopher Adamson is an expert in star schema design and an experienced data warehousing practitioner. He speaks and teaches worldwide, writes about data warehousing, and provides data warehousing services through Oakton Software LLC.

    As a consultant, Chris works with customers in all industries to develop data warehousing strategies, identify and prioritize projects, design and modify solutions, and manage implementations. His customers have included Fortune 500 companies, government agencies, and data warehousing software vendors.

    Chris has taught dimensional design to thousands of students worldwide. He is a faculty member at The Data Warehousing Institute (TDWI), teaches at conference events, and provides onsite education.

    Chris has penned many articles and books about data warehousing. He is author of Mastering Data Warehouse Aggregates (Wiley 2006) and co-author of Data Warehouse Design Solutions (Wiley 1998). He blogs about data warehousing and answers questions from readers at www.StarSchemaCentral.com.

    Chris is the founder of Oakton Software LLC, and can be contacted through his web site at www.ChrisAdamson.net.

    Star Schema

    The Complete Reference™

    Christopher Adamson

    Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Except as permitted under the United States Copyright Act of 1976, no part of this publication may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without the prior written permission of the publisher.

    ISBN: 978-0-07-174433-1

    MHID: 0-07-174433-9

    The material in this eBook also appears in the print version of this title: ISBN: 978-0-07-174432-4, MHID: 0-07-174432-0.

    All trademarks are trademarks of their respective owners. Rather than put a trademark symbol after every occurrence of a trademarked name, we use names in an editorial fashion only, and to the benefit of the trademark owner, with no intention of infringement of the trademark. Where such designations appear in this book, they have been printed with initial caps.

    McGraw-Hill eBooks are available at special quantity discounts to use as premiums and sales promotions, or for use in corporate training programs. To contact a representative please e-mail us at bulksales@mcgraw-hill.com.

    Information has been obtained by McGraw-Hill from sources believed to be reliable. However, because of the possibility of human or mechanical error by our sources, McGraw-Hill, or others, McGraw-Hill does not guarantee the accuracy, adequacy, or completeness of any information and is not responsible for any errors or omissions or the results obtained from the use of such information.

    TERMS OF USE

    This is a copyrighted work and The McGraw-Hill Companies, Inc. (McGraw-Hill) and its licensors reserve all rights in and to the work. Use of this work is subject to these terms. Except as permitted under the Copyright Act of 1976 and the right to store and retrieve one copy of the work, you may not decompile, disassemble, reverse engineer, reproduce, modify, create derivative works based upon, transmit, distribute, disseminate, sell, publish or sublicense the work or any part of it without McGraw-Hill’s prior consent. You may use the work for your own noncommercial and personal use; any other use of the work is strictly prohibited. Your right to use the work may be terminated if you fail to comply with these terms.

    THE WORK IS PROVIDED AS IS. McGRAW-HILL AND ITS LICENSORS MAKE NO GUARANTEES OR WARRANTIES AS TO THE ACCURACY, ADEQUACY OR COMPLETENESS OF OR RESULTS TO BE OBTAINED FROM USING THE WORK, INCLUDING ANY INFORMATION THAT CAN BE ACCESSED THROUGH THE WORK VIA HYPERLINK OR OTHERWISE, AND EXPRESSLY DISCLAIM ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. McGraw-Hill and its licensors do not warrant or guarantee that the functions contained in the work will meet your requirements or that its operation will be uninterrupted or error free. Neither McGraw-Hill nor its licensors shall be liable to you or anyone else for any inaccuracy, error or omission, regardless of cause, in the work or for any damages resulting there from. McGraw-Hill has no responsibility for the content of any information accessed through the work. Under no circumstances shall McGraw-Hill and/or its licensors be liable for any indirect, incidental, special, punitive, consequential or similar damages that result from the use of or inability to use the work, even if any of them has been advised of the possibility of such damages. This limitation of liability shall apply to any claim or cause whatsoever whether such claim or cause arises in contract, tort or otherwise.

    To Gladys, Justin, and Carter. You are the source of unending happiness, you lift me up when I need it, and I will always be proud of you.

    Contents at a Glance

    Part I Fundamentals

    1 Analytic Databases and Dimensional Design

    2 Data Warehouse Architectures

    3 Stars and Cubes

    Part II Multiple Stars

    4 A Fact Table for Each Process

    5 Conformed Dimensions

    Part III Dimension Design

    6 More on Dimension Tables

    7 Hierarchies and Snowflakes

    8 More Slow Change Techniques

    9 Multi-Valued Dimensions and Bridges

    10 Recursive Hierarchies and Bridges

    Part IV Fact Table Design

    11 Transactions, Snapshots, and Accumulating Snapshots

    12 Factless Fact Tables

    13 Type-Specific Stars

    Part V Performance

    14 Derived Schemas

    15 Aggregates

    Part VI Tools and Documentation

    16 Design and Business Intelligence

    17 Design and ETL

    18 How to Design and Document a Dimensional Model Index

    Contents

    Acknowledgments

    Introduction

    Part I Fundamentals

    Chapter 1 Analytic Databases and Dimensional Design

    Dimensional Design

    Purpose

    Measurement and Context

    Facts and Dimensions

    Grouping Dimensions and Facts

    The Star Schema

    Dimension Tables

    Keys and History

    Fact Tables

    Using a Star Schema

    Querying Facts

    Browsing Dimensions

    Guiding Principles

    Summary

    Further Reading

    Chapter 2 Data Warehouse Architectures

    Inmon’s Corporate Information Factory

    Kimball’s Dimensional Data Warehouse

    Stand-Alone Data Marts

    Architecture and Dimensional Design

    Contrasting the Approaches

    The Common Element

    Terms Used in This Book

    Summary

    Further Reading

    Chapter 3 Stars and Cubes

    Dimension Table Features

    Surrogate Keys and Natural Keys

    Rich Set of Dimensions

    Grouping Dimensions into Dimension Tables

    Fact Table Features

    Fact Tables and Processes

    Capturing Facts

    Grain

    Sparsity

    Degenerate Dimensions

    Slowly Changing Dimensions

    Type 1 Change

    Type 2 Change

    Choosing and Implementing Response Types

    Cubes

    Summary

    Further Reading

    Part II Multiple Stars

    Chapter 4 A Fact Table for Each Process

    Fact Tables and Business Processes

    Facts that Have Different Timing

    A Single Fact Table Causes Difficulties

    Modeling in Separate Fact Tables

    Facts that Have Different Grain

    A Single Fact Table Causes Difficulties

    Modeling in Separate Fact Tables

    Analyzing Facts from More than One Fact Table

    The Peril of Joining Fact Tables

    Drilling Across

    Drill-Across Implementations

    Summary

    Further Reading

    Chapter 5 Conformed Dimensions

    The Synergy of Multiple Stars

    Dimensions and Drilling Across

    What Causes Failure?

    Identical Tables Not Required

    Conformed Dimensions

    Types of Dimensional Conformance

    Planning Conformance

    Architecture and Conformance

    Dimensional Data Warehouse

    Corporate Information Factory

    Stand-Alone Data Marts

    Summary

    Further Reading

    Part III Dimension Design

    Chapter 6 More on Dimension Tables

    Grouping Dimensions into Tables

    Two Ways of Relating Dimension Attributes

    When Struggling with Dimension Groupings

    Breaking Up Large Dimensions

    Splitting Dimension Tables Arbitrarily

    Alternatives to Split Dimensions

    Mini-Dimensions Alleviate ETL Bottlenecks and Excessive Growth

    Dimension Roles and Aliasing

    Avoiding the NULL

    Problems Caused by NULL

    Avoiding NULL Foreign Key Values

    Uses for Special-Case Rows

    Behavioral Dimensions

    Converting Facts to Dimensions at Query Time

    Designing and Using Behavioral Dimensions

    Design Considerations for Behavioral Dimensions

    Summary

    Further Reading

    Chapter 7 Hierarchies and Snowflakes

    Drilling

    The Concept of Drilling

    The Reality of Drilling

    Attribute Hierarchies and Drilling

    The Attribute Hierarchy

    Drilling Within an Attribute Hierarchy

    Other Ways to Drill

    Documenting Attribute Hierarchies

    Snowflakes

    Avoiding the Snowflake

    Embracing the Snowflake

    Outriggers

    Repeating Groups

    Eliminating Repeating Groups with Outriggers

    Outriggers and Slow Change Processing

    Summary

    Further Reading

    Chapter 8 More Slow Change Techniques

    Time-Stamped Dimensions

    Point-in-Time Status of a Dimension

    The Time-Stamped Solution

    Type 3 Changes

    Study All Facts with Old or New Dimension Values

    The Type 3 Solution

    Hybrid Slow Changes

    Conflicting Requirements

    The Hybrid Response

    Evaluating and Extending the Hybrid Approach

    Summary

    Further Reading

    Chapter 9 Multi-Valued Dimensions and Bridges

    Standard One-to-Many Relationships

    Multi-Valued Dimensions

    Simplifying the Relationship

    Using a Bridge for Multi-Valued Dimensions

    Multi-Valued Attributes

    Simplifying the Multi-Valued Attribute

    Using an Attribute Bridge

    Summary

    Further Reading

    Chapter 10 Recursive Hierarchies and Bridges

    Recursive Hierarchies

    Rows Referring to Other Rows

    The Reporting Challenge

    Flattening a Recursive Hierarchy

    A Flattened Hierarchy

    Drawbacks of Flattening

    When Flattening Works Best

    The Hierarchy Bridge

    Hierarchy Bridge Design

    Using the Bridge

    Double-Counting

    Resolving the Many-to-Many Relationship

    Potential Misuse

    Changes and the Hierarchy Bridge

    Type 1 Changes in the Dimension or Bridge

    Type 2 Changes to the Dimension

    Type 2 Changes to the Hierarchy

    Variations on the Hierarchy Bridge

    Embellishing the Bridge

    Multiple Parents

    Multiple Hierarchies

    Summary

    Further Reading

    Part IV Fact Table Design

    Chapter 11 Transactions, Snapshots, and Accumulating Snapshots

    Transaction Fact Tables

    Describing Events

    Properties of Transaction Fact Tables

    Snapshot Fact Tables

    The Challenge: Studying Status

    The Snapshot Model

    Snapshot Considerations

    Accumulating Snapshot Fact Tables

    Challenge: Studying Elapsed Time Between Events

    The Accumulating Snapshot

    Accumulating Snapshot Considerations

    Summary

    Further Reading

    Chapter 12 Factless Fact Tables

    Events with No Facts

    Nothing to Measure?

    The Factless Fact Table

    Using a Factless Fact Table

    Adding a Fact

    Conditions, Coverage, or Eligibility

    Why Model Conditions?

    Factless Fact Tables for Conditions

    Comparing Activities and Conditions

    Slowly Changing Dimensions and Conditions

    Summary

    Further Reading

    Chapter 13 Type-Specific Stars

    Type-Specific Attributes

    Operational Systems

    Analytic Systems

    Core and Custom Stars

    Core and Custom Dimension Tables

    Core and Custom Fact Tables

    Other Considerations

    Using Generic Attributes

    Generic Attributes

    Using a Generic Design

    Summary

    Further Reading

    Part V Performance

    Chapter 14 Derived Schemas

    Restructuring Dimensional Data

    Uses for Derived Schemas

    Derived Schemas Already Covered

    The Cost of Derived Schemas

    The Merged Fact Table

    Precomputed Drill-Across Results

    Simplified Process Comparison

    Improved Performance

    Supporting Tools that Cannot Drill Across

    Single-Process Analysis

    Including a Nonshared Dimension

    The Pivoted Fact Table

    The Need to Pivot Data

    The Pivoted Advantage

    Drawbacks to Pivoting

    The Sliced Fact Table

    Creating Slices of a Star

    Uses for Sliced Fact Tables

    Slices First

    Set Operation Fact Tables

    Comparing Two Sets of Data

    Several Possible Comparisons

    Choosing to Precompute Set Operations

    Summary

    Further Reading

    Chapter 15 Aggregates

    Fundamentals of Aggregates

    Summarizing Base Data

    Using Aggregates

    Loading Aggregates

    Cubes as Aggregates

    Making Aggregates Invisible

    Aggregate Navigation

    Aggregate Generation

    Alternative Summary Designs

    Transformative Summaries May Also Be Useful

    Single Table Designs Should Be Avoided

    Summary

    Further Reading

    Part VI Tools and Documentation

    Chapter 16 Design and Business Intelligence

    Business Intelligence and SQL Generation

    SQL Generators

    The Limitations of SQL Generators

    Guidelines for the Semantic Layer

    Features to Avoid

    Features to Use

    Working with SQL-Generating BI Tools

    Multiple Stars

    Semi-Additivity

    Browse Queries

    Bridge Tables

    Working with Cube-Based BI

    Cube-Centric Business Intelligence

    Auto-Generation of Cubes

    Summary

    Further Reading

    Chapter 17 Design and ETL

    The ETL Process

    A Complex Task

    Tools Used by the ETL Process

    Architecture and the ETL Process

    Loading a Star

    A Top-Level Dependency

    Loading a Dimension Table

    Loading the Fact Table

    Optimizing the Load

    Changed Data Identification

    Simplifying Processing

    Cleansing Data

    What Should Be Cleaned Up

    Cleaning Up Dimensional Data

    Facts with Invalid Details

    Housekeeping Columns

    Housekeeping Columns in Dimension Tables

    Housekeeping and Fact Tables

    Summary

    Further Reading

    Chapter 18 How to Design and Document a Dimensional Model

    Dimensional Design and the Data Warehouse Life Cycle

    The Strategic Importance of Dimensional Design

    When to Do Dimensional Design

    Design Activities

    Planning the Design Effort

    Conducting Interviews

    Designing the Dimensional Model

    Prioritizing Plans

    Documenting the Results

    Documenting a Dimensional Model

    Requirements Documentation

    Top-Level Design Documentation

    Detailed Design Documentation

    Logical vs. Physical

    Summary

    Further Reading

    Index

    Acknowledgments

    This book would not have been possible without the help of many people. That list starts with people like you—people who have read my books, attended my classes, e-mailed me questions, and visited my blog. Over the years, your comments and questions have helped me shape this reference to dimensional design. I hope you will find it relevant, practical, and useful. Thanks for your feedback, and keep it coming.

    This book was also made possible by my customers, past and present. Thank you for always providing new challenges. Meeting them together never fails to be a fulfilling experience.

    Lisa McClain of McGraw-Hill made this book a reality. My thanks to her for championing the project, and to everyone who has worked on this book’s development, production, marketing, and distribution.

    Several people have given me opportunities over the years, without which you would not be reading these words today. Thanks to Cameron Hendershot and Marilyn Feldman, Joe Warren, Argee Mahecha, Matt Comstock, Mike Venerable, and David Wells. Many colleagues have also influenced my trajectory, some unknowingly so. Thanks to Jesse Baldwin, Jim Hadley, Mike Lynott (who taught me about unboiling frogs), Greg Jones, and Randall Porter. I am also grateful to Chris Date and Laura Reeves, who have given me useful advice.

    Special thanks to Ralph Kimball, whose generous advice, support, and encouragement have been greatly valued, and whose contributions to this field are of immeasurable value to us all.

    I love teaching and want to thank everyone at The Data Warehousing Institute who makes it possible. Wayne Eckerson, Paul Kautza, Yvonne Baho, Sandra Prescott, Brenda Woodbridge, Jennifer Noland, Nancy Hanlon, and Heather Flynn are a few of the people at TDWI to whom I am indebted. My gratitude also goes out to all others not named here who organize TDWI events with professionalism and attention to detail.

    To my friends and family: your friendship and love are what is most valuable to me. I cannot name all of you here, but you know who you are. Special thanks and love to Mom, Sis, Jason, Aya, Papa, Magal, Shell, Eli, and Sofie. My love also to Dad who, though no longer here, continues to be an important part of who I am.

    Introduction

    Dimensional design is a pillar of every modern-day data warehouse architecture. Based on a disarmingly simple approach to process measurement, dimensional design enables extraordinarily powerful analytics. The products of dimensional design—the star schema, the snowflake, and the cube—can be found in virtually every data warehouse implementation.

    Despite this popularity, relatively little is written about dimensional design. Although some outstanding works are available, most assume a particular architecture or philosophy—my own prior work included. Additionally, these treatments are organized around vertical industries or major business functions, making them difficult to refer to when faced with a specific design challenge.

    This book is a complete reference to dimensional design—the first intended for any reader. The best practices presented in this volume cut across all architectures, including those espoused by W.H. Inmon and Ralph Kimball. Organized around the key concepts of dimensional design, this book provides full, in-depth treatment of each topic, sequenced in a logical progression from fundamentals through advanced techniques.

    This book is designed for both beginners and experts in dimensional design. If you are a beginner, it is the ideal place to start. Each chapter provides you with best practices and their underlying rationale, detailed examples, and the criteria you need to make design decisions. If you are an expert, you will be able to use this guide as a reference. Whenever you face a particular design challenge, you will find a chapter or section dedicated to the topic.

    Dimensional design enables profoundly powerful business analysis. A solid understanding of the underlying principles is essential, whether you are directly involved in design activities, work with dimensional data structures, manage projects, or fund implementations. Mastery of the techniques and best practices in this book will help you unleash the full potential of your data warehouse, regardless of architecture, implementation scope, or software tools.

    About This Book

    This book has been designed as a complete, in-depth reference for anyone who works with dimensional data—the star, the snowflake, or the cube.

    • The content is organized into chapters and sections dedicated to the core concepts of dimensional design so you can find everything you need to know about a particular topic in one place.

    • Each topic is treated comprehensively. Full explanations for best practices allow you to make informed design decisions based on operational realities.

    • No assumptions are made about your data warehouse environment. The best practices here apply in all architectures, including those espoused by W.H. Inmon and Ralph Kimball.

    • Specific software products are not referenced, but the ways in which your tools may influence design decisions are fully explored.

    The result is a treatment that is comprehensive and useful, regardless of your level of experience, data warehouse architecture, or available tools.

    Organized Around Core Concepts

    This book is organized around the core concepts of dimensional modeling, rather than a series of business scenarios by vertical industry. Focusing on these concepts allows a complete treatment of each topic, without forcing you to flip back and forth between various business cases. Each topic is explored in depth, rather than spread across multiple chapters.

    This comprehensive treatment of each concept allows Star Schema: The Complete Reference to serve as a useful resource. Experienced modelers will find what they need with a quick scan through the Table of Contents. Need to brush up on the implications of a snapshot design? Everything you need can be found in Chapter 11. Thinking about implementing a bridge table? It’s all there in Chapter 9. Need to implement a hybrid slow change? A complete discussion can be found in Chapter 8. Each chapter concludes with references to external treatments of the topic, should you wish to search for more examples.

    For those new to dimensional design, the material has been sequenced so the book can be read cover to cover. The first three chapters explore fundamentals, and subsequent sections delve deeply into various aspects of dimensional design. Help on choosing where to start is provided at the end of this introduction.

    Comprehensive and Practical, Not Dogmatic

    While this book highlights a series of best practices, the underlying motivation is always fully explored. You will learn the reasons for these guidelines, and develop the ability to make informed decisions on how to apply them. The result is a practical approach to data warehouse design—one that is responsive to organizational and operational context, rather than independent of it.

    Dimensional designers, for example, are often trained to record information at the lowest level of detail possible. The reasons for this guideline are fully explained in Chapter 3, along with situations where these reasons might not apply. Similarly, designers are always taught that different business processes deserve their own models, or stars. Chapter 4 explains why this is the case, and fully explores what happens when this guideline is relaxed.

    Even when you stick to the best practices, there is no single right way to model a particular business process. You will learn how each design option strikes a balance among business value, the required effort to construct reports, the complexity of the load process, and cost. Flattening a recursive hierarchy, for example, simplifies reporting and reduces development cost, but limits the power of the final solution; the alternatives are fully explored in Chapter 10. Derived schemas can make reporting easier and improve performance, but provide significant additional work to load data into the data warehouse, as described in Chapter 14.

    Architecture-Neutral

    This book makes no assumptions about your data warehouse architecture. The best practices outlined in these pages apply whether you follow W.H. Inmon’s Corporate Information Factory approach or Ralph Kimball’s dimensional data warehouse bus approach, or simply build subject-area data marts. In each of these paradigms, there is a place for dimensional data. No matter how you put dimensional design to work, this book will allow you to make the most of it.

    If you don’t know anything about these thought leaders or their recommended architectures, you will learn something about them in Chapter 2. There, you will find a high-level overview of various approaches, and information on how dimensional design fits into each. What you won’t find is an argument in favor of one approach over another. This book’s coverage of dimensional design is disentangled from such considerations. Anyone can use it.

    Common Vocabulary

    This book is designed to service any data warehouse architecture, but it is necessary to establish a common vocabulary. When it comes to dimensional design, that vocabulary comes from Ralph Kimball. By providing a way to talk about dimensional design, he has made a valuable contribution to the world of data warehousing, giving us terms like grain, conformance, and slowly changing dimensions. These and other terms can be found in his seminal work on dimensional design: The Data Warehouse Toolkit, Second Edition, by Ralph Kimball and Margy Ross (Wiley, 2002).

    Wherever possible, this book makes use of terminology established by Kimball and Ross. Each term will be fully explained. However, it is not presumed that the reader adheres to Kimball’s approach to data warehousing. His approach is one of several architectures that make use of dimensional design. These architectures are discussed in Chapter 2; the principles in this book can be employed in any of these situations.

    Product-Independent

    This book makes no assumptions about specific hardware or software products in your data warehouse architecture. The dimensional techniques described are largely universal, and can be implemented using tools and technologies from a variety of vendors.

    This is not to say that the software products used by your organization will not influence your dimensional design. To the contrary, they can, will, and should bear such influence. Although specific software products will not be discussed, the influence of various kinds of tools will be explored. These include database management systems (DBMSs), reporting or business intelligence (BI) software, and data integration or extract transform load (ETL) tools.

    The capabilities of your RDBMS and reporting tools, for example, may drive the decision to produce a snowflake design, rather than a star, as you will learn in Chapter 7. The capabilities of a business intelligence tool, or the sophistication of its users, may shape your approach to schema design issues outlined in Chapter 16. Development of the ETL process is complex, and may benefit from some design considerations discussed in Chapter 17.

    Snowflakes and Cubes

    Most of the examples in this book feature the star schema. The principles of dimensional modeling can also be used to design snowflakes and cubes. The best practices are largely the same, with a few exceptions that are highlighted and explored. The snowflake is featured in Chapter 7; the influence of business intelligence tools on this design option are discussed in Chapter 16. The cube is introduced in Chapter 3; many useful ways to pair stars with cubes are explored in Chapters 14, 15, and 16.

    Who Should Read This Book

    This book is written for you, the data warehouse practitioner. If your work in any way involves stars, snowflakes, or cubes, then this is your guide to all things dimensional. No assumptions are made regarding your skill level, role, or preferred architecture.

    You may design dimensional models, work with dimensional data, manage activities, or pay the bills. Your role may fall into a variety of categories, including:

    • Business Analysis

    • Data Architecture / Star Schema Design

    • Business Intelligence and Reporting

    • Data Integration or ETL

    • Database Administration

    • Quality Assurance

    • Data Administration

    • Project Management

    • Executive Leadership / IT Management

    • Power User

    It will be assumed that you have a basic familiarity with relational database concepts like tables, columns, and joins. There will be occasional examples of SQL code; these will be fully explained for the benefit of novice readers.

    No assumptions are made about your level of experience. If you are new to dimensional design, you will probably want to read this book from cover to cover. Experienced practitioners may prefer to skip directly to areas of particular interest. The next section provides advice on how to proceed.

    Using This Book

    This book is designed to be used in two ways. You can read it cover to cover, or consult it as a reference. The book is divided into six parts, with chapters organized in a logical progression. When looking for a particular topic, you can scan the chapter and section headings to find what you need. Key features of each chapter help highlight important concepts and aid your understanding of them.

    Key Features of Each Chapter

    Each chapter in this book provides in-depth treatment of a core topic in dimensional design. Design techniques are evaluated, best practices are identified, and alternatives are fully explored. When you finish a chapter, you will understand the design principles, the reasons behind best practices, and how to evaluate possible design alternatives.

    Some special elements are used to help highlight important information.

    Tips

    Tips are used to highlight best practices:


    TIP Whether reading the book cover to cover, or skipping around to pertinent passages, tips will be there to highlight the key lessons of each section.

    These tips are like conclusions; they result from a full exploration of design alternatives. Reading the body of the chapter will bring to light the reasons behind these guidelines, along with situations in which you may wish to deviate.

    Examples

    Detailed examples are used to illustrate every design technique. Most examples are accompanied with schema designs, as well as instance diagrams that illustrate the content of important tables. These diagrams will help you understand design techniques; the text will refer to them, call out important features, highlight the sample data, and show how the technique works to solve the design challenge.

    Notes

    Some examples may feature design elements that are not central to the focus of the topic or section. Interesting but off-topic considerations are highlighted in Notes, which may also direct you to other chapters where the topic is fully explored.


    NOTE Notes are used to alert you to additional considerations dealt with elsewhere in the book, or to touch on topics not central to dimensional design.

    Further Reading

    Each chapter ends with a section on Further Reading. Here you can get information on where to find more examples of the techniques presented in the chapter. Some of these references highlight refinements or alternatives to the techniques presented; others provide examples drawn from different business cases or industries. The majority of books cited focus on the Kimball approach to data warehouse design, but can be employed in other architectures as well.

    Contents of This Book

    Star Schema: The Complete Reference is divided into six parts, each of which focuses on a major category of dimensional design techniques. A summary of each section follows. For additional details, you may wish to scan the Table of Contents.

    Part I: Fundamentals

    Part I focuses on the fundamentals of dimensional design. It includes chapters that focus on process measurement, data warehouse architecture, and star schema design.

    Chapter 1: Analytic Databases and Dimensional Design The fundamentals of process measurement are introduced in this chapter, including facts, dimensions, and the star schema.

    Chapter 2: Data Warehouse Architectures Three very different architectures make use of the star schema, including those advocated by W.H. Inmon and Ralph Kimball. This chapter sorts through each architecture’s use of the star, and highlights how the same terms take on different meanings in each paradigm.

    Chapter 3: Stars and Cubes In this chapter, you will learn the fundamentals of star schema design and slowly changing dimensions, and explore the different ways cubes may be incorporated into a data warehouse architecture.

    Part II: Multiple Stars

    Part II takes the first steps out of the neat and perfect world of the simple example and ventures into the real world of complex designs. It deals with a fundamental challenge that novice designers must learn to tackle: modeling different business processes as different stars.

    Chapter 4: A Fact Table for Each Process This chapter teaches you how to identify discrete processes and provide separate stars for each. It also looks at how to produce analysis that crosses process boundaries.

    Chapter 5: Conformed Dimensions The concept of conformed dimensions allows you to support and compare a variety of business processes, ensuring compatibility even if implementations make use of different technologies. Dimensional conformance is often considered to be of strategic importance, and can serve as the basis of a roadmap for incremental implementation.

    Part III: Dimension Design

    Part III dives deeply into advanced techniques that surround the dimensions of a dimensional design. It is divided into five chapters.

    Chapter 6: More on Dimension Tables In this chapter, you will learn how to determine what dimensions to place in the same table, how to stem unmanageable growth in dimension tables, and how to handle information that is optional or unavailable.

    Chapter 7: Hierarchies and Snowflakes This chapter explores the technique known as snowflaking, and explains how modeling attribute hierarchies may facilitate the implementation of reporting tools.

    Chapter 8: More Slow Change Techniques This chapter goes beyond the basic type 1 and type 2 slow changes presented in Chapter 3, covering type 3 slow changes, time-stamping techniques, and hybrid slow change responses.

    Chapter 9: Multi-Valued Dimensions and Bridges Sometimes, a dimension can take on multiple values with respect to a single fact, such as multiple salespeople collaborating on a single order. This chapter explores techniques for dealing with these situations, from simple flattening to the use of bridge tables.

    Chapter 10: Recursive Hierarchies and Bridges Dimensions often embody recursive hierarchies, such as departments that report to other departments. This chapter shows how to flatten these hierarchies for a simple solution, and how to make use of a hierarchy bridge for powerful and flexible analysis.

    Part IV: Fact Table Design

    Part IV provides in-depth treatment of advanced features that center on fact tables. It is composed of three chapters.

    Chapter 11: Transactions, Snapshots, and Accumulating Snapshots This chapter covers situations in which a standard transaction-based fact table falls short, and shows how periodic snapshots and accumulating snapshots can be put to use.

    Chapter 12: Factless Fact Tables Sometimes fact tables contain no facts. In this chapter, you will learn about transaction-based factless fact tables, as well as factless fact tables that capture coverage or conditions.

    Chapter 13: Type-Specific Stars This chapter looks at situations where subsets of data have different facts and dimensions, and shows how to make use of core and custom stars for a single process.

    Part V: Performance

    Any dimensional schema can be supplemented with additional structures that are intended to improve performance or simplify the reporting process. Part V looks at two kinds of supplemental data structures that support high performance.

    Chapter 14: Derived Schemas This chapter teaches you to use derived schemas to provide enhanced performance, simplify report development, or address specific departmental needs.

    Chapter 15: Aggregates This chapter shows how summary tables, or aggregates, can provide powerful performance boosts when implemented wisely.

    Part VI: Tools and Documentation

    The last part of this book looks at additional topics of interest for the developers of dimensional models.

    Chapter 16: Design and Business Intelligence This chapter explores the influence of business intelligence (BI) tools on your dimensional design. It explores some common dimensional features that often strain BI tools, as well as techniques to mitigate any shortcomings.

    Chapter 17: Design and ETL This chapter provides an overview of the process of loading the dimensional schema, and highlights numerous model features that can assist ETL developers in optimizing the process.

    Chapter 18: How to Design and Document a Dimensional Model This provides standard tasks and deliverables that can be worked into your data warehouse life cycle, regardless of architecture.

    Where to Start

    As mentioned earlier, you can read this book from cover to cover, or skip directly to sections that deal with a topic of interest. How you make use of it will largely depend on your current skill level.

    Beginners Those new to dimensional design should start by reading Part I: Fundamentals. Once you’ve finished that section, you can read the rest in sequence, or skip to any section of particular interest. If you are getting ready to start a project, you may wish to turn to the last chapter: How to Design and Document a Dimensional Model.

    Novices If you have some dimensional design experience, you may wish to skip the fundamentals in Part I. However, you are encouraged to review the data warehouse terms described at the end of Chapter 2. Terms like data warehouse, data mart, and source system can take on very different meanings in different architectures; Chapter 2 explains how these terms will be used in this book. If you’ve done most of your learning on the job, you may also want to review Chapter 3 for standard dimensional terminology used in this book—terms such as grain, natural key, and slowly changing dimension are all defined there.

    Experts Experienced dimensional modelers can use this book as a reference, consulting it for detailed information on design issues as they arise. A quick scan through the Table of Contents will help you find any dimensional design topic of interest. You may also wish to read the book cover to cover, enhancing your mastery of dimensional design.

    Regardless of how you approach the material in these pages, it will enable you to master the principles of dimensional design, unlocking the full potential of your data warehouse.

    PART I

    Fundamentals

    CHAPTER 1

    Analytic Databases and Dimensional Design

    CHAPTER 2

    Data Warehouse Architectures

    CHAPTER 3

    Stars and Cubes

    CHAPTER 1

    Analytic Databases and Dimensional Design

    This book describes a set of powerful and effective techniques for the design of analytic databases. These techniques unleash business analytics in a very simple way: they model the measurement of business processes.

    The dimensional model of a business process is made up of two components: measurements and their context. Known as facts and dimensions, these components are organized into a database design that facilitates a wide variety of analytic usage. Implemented in a relational database, the dimensional model is called a star schema. Implemented in a multidimensional database, it is known as a cube. If any part of your data warehouse includes a star schema or a cube, it leverages dimensional design.

    This chapter introduces the basic concepts of dimensional design, which arose in response to the unique requirements of analytic systems. The concept of measurement is defined in terms of facts and dimensions, and translated into a database design, or star schema. This chapter shows how basic interaction with a star schema can support a wide variety of measurement scenarios, and defines two simple guiding principles for the development of dimensional solutions.

    Dimensional Design

    Information systems fall into two major categories: those that support the execution of business processes and those that support the analysis of business processes. The principles of dimensional design have evolved as a direct response to the unique requirements of analytic systems. The core of every dimensional model is a set of business metrics that captures how a process is evaluated, and a description of the context of every measurement.

    Purpose

    Analytic systems and operational systems serve fundamentally different purposes. An operational system supports the execution of a business process, while an analytic system supports the evaluation of the process. Their distinct purposes are reflected in contrasting usage profiles, which in turn suggest that different principles will guide their design.

    Operational Systems

    An operational system directly supports the execution of a business process. By capturing details about significant events or transactions, it constructs a record of activities. A sales system, for example, captures information about orders, shipments, and returns; a human resources system captures information about the hiring and promotion of employees; an accounting system captures information about the management of the financial assets and liabilities of the business. The activities recorded by these systems are sometimes known as transactions. The systems themselves are sometimes called online transaction processing (OLTP) systems, or transaction systems for short.

    To facilitate the execution of a business process, operational systems must enable several types of database interaction, including inserts, updates, and deletes. The focus of these interactions is almost always atomic: a specific order, a shipment, a refund. These interactions will be highly predictable in nature. For example, an order entry system must provide for the management of lists of products, customers, and salespeople; the entering of orders; the printing of order summaries, invoices, and packing lists; and the tracking order status.

    Because it is focused on process execution, the operational system is likely to update data as things change, and purge or archive data once its operational usefulness has ended. When a customer moves, for example, his or her old address is no longer useful for shipping products or sending invoices, so it is simply overwritten.

    Implemented in a relational database, the optimal schema design for an operational system is widely accepted to be one that is in third normal form. The design may be depicted as an entity-relationship model, or ER model. Coupled with appropriate database technology, this design supports high-performance inserting, updating, and deleting of atomic transactions in a consistent and predictable manner. Developers refer to the characteristics of transaction processing as the ACID properties—atomic, consistent, isolated, and durable.

    Analytic Systems

    While the focus of the operational system is the execution of a business process, the analytic system supports the evaluation of the process. How are orders trending this month versus last? Where does this put us in comparison to our sales goals for the quarter? Is a particular marketing promotion having an impact on sales? Who are our best customers? These questions deal with the measurement of the overall orders process, rather than asking about individual orders.

    Interaction with an analytic system takes place exclusively through queries that retrieve data about business processes; information is not created or modified. These queries can involve large numbers of transactions, in contrast to the operational system’s typical focus on individual transactions. Specific questions asked are less predictable, and more likely to change over time. Historic data will remain important to the analytic system long after its operational use has passed. The differences between operational systems and analytic systems are highlighted in Figure 1-1.

    The principles of dimensional modeling address the unique requirements of analytic systems. A dimensional design is optimized for queries that may access large volumes of transactions, not just individual transactions. It is not burdened with supporting concurrent, high-performance updates. It supports the maintenance of historic data, even as the operational systems change or delete information.

    Figure 1-1 Operational systems vs. analytic systems

    Measurement and Context

    The founding principle of dimensional design is disarmingly simple. Dimensional design supports analysis of a business process by modeling how it is measured.

    Measurement is easy to discern, whether by listening to people talk or reading a report or chart. Consider the following business questions:

    • What are gross margins by product category for January?

    • What is the average account balance by education level?

    • How many sick days were taken by marketing employees last year?

    • What are the outstanding payables by vendor?

    • What is the return rate by supplier?

    Each of these questions centers on a business process: sales, account management, attendance, payables, return processing. These process-centric questions do not focus on individual activities or transactions. To answer them, it is necessary to look at a group of transactions.

    Most importantly, each of these questions reveals something about how its respective business process is measured. The study of sales involves the measurement of gross margin. Financial institutions measure account balance. In human resources, they measure number of absences. The finance department measures payables. Purchasing managers watch the return quantities.

    Without some kind of context, a measurement is meaningless. If you are told sales were $10,000, there is not much you can do with this information. Is that sales of a single product, or many products? Does it represent a single transaction, or the company’s total sales from conception to date? Without some context, the measurement is useless.

    As with the measurements themselves, context is revealed in business questions or reports. In the preceding questions, for example, gross margin is viewed in the context of product categories and time (the month of January). Sick days are viewed in the context of a department (marketing) and time (last year). Payables are viewed in the context of their status (outstanding) and vendor.

    These two simple concepts, measurement and context, are the foundation of dimensional design. Every dimensional solution describes a process by capturing what is measured and the context in which the measurements are evaluated.

    Facts and Dimensions

    In a dimensional design, measurements are called facts, and context descriptors are called dimensions. Every dimensional design sorts information requirements into these categories. They may be identified within statements or questions, or found within report specifications. Sorting them into categories for facts and dimensions is easy, once you know what to look for.

    Facts and Dimensions in Speech

    In a spoken or written statement, the word by is almost always followed by a dimension. Consider the question "What are order dollars by product category for January? Clearly, the person asking this question wants a separate measurement for each product category, as indicated by the words by product category." Product category is a dimension.

    Similarly, the word for is also a good indicator of the presence of a dimension. In this case, the next word is likely to be an instance value of a dimension. For example, "What are order dollars by product category for January? Here, for is followed by January." January is a value or instance of a dimension. The dimension is not named, but we can infer that it is month.

    Facts tend to be numeric in value, and people want to see them at various levels of detail. You can identify facts by looking for things that people want to roll up or break out. Again, look at the question What are order dollars by product category for January? Here, order dollars will presumably be numeric, and they can be rolled up by categories.

    Not everything that is numeric is a fact. Sometimes, a numeric data element is really a dimension. The key is to determine how it is being used. Is it something that can be specified at varying levels of detail? If so, it is a fact. Is it providing context? If so, it is a dimension. Show me margin dollars by order number, contains two numeric elements. Margin dollars can be specified at various levels of detail, and the speaker is asking that it be rolled up by order number. Clearly, margin dollars is a fact. What about order number? It too is numeric, but the speaker is not asking that order numbers be added together. Instead, order number is being used to specify the context for margin dollars. Order number is a dimension.

    Other examples of numeric data elements behaving as dimensions include size, ages, phone numbers, document numbers, and unit amounts such as unit cost or unit price. The clue that these elements are dimensions lies in their use. The numbers themselves are not rolled up or down according to the question. Instead, they specify the context by which something else may be rolled up or down.

    Facts and Dimensions in Reports

    Facts and dimensions can also be recognized by considering how they would be used if they were stored in a database and shown on a report. Dimensions serve as filters, or query predicates. The report page shown in Figure 1-2, for example, is filtering detailed data for the western region and January 2009. These filters imply several dimensions: region, month, and year.

    Figure 1-2 Sample page from an orders report

    In a report, dimensions also serve to specify groupings or break levels, or to identify levels of subtotals. This can be seen in the orders report, where category is used to break up the report into sections; individual categories are also used to label rows containing subtotals. These features are clues that category is a dimension.

    Dimensions can also be identified by their use in controlling the aggregation of measurements. In the orders report, several measurements are specified for each product name and SKU. (SKU is a retail term that identifies an individual product, or stock keeping unit.) Product and SKU are dimensions. Notice, too, that dimension values may be used to sort data.

    Facts can also be spotted based on their use in queries or reports. Elements that are aggregated, summarized, or subtotaled are facts. Clearly, the report in Figure 1-2 contains the facts cost and order dollars. These numbers are being specified at various levels, including products, categories, and a grand total. Look again, and you may notice another fact: quantity ordered. Although it is not being subtotaled, this number is presumably being aggregated. In this case, it looks like quantity ordered by product is being aggregated from individual orders.

    The style of the report in Figure 1-2 is sometimes referred to as tabular. Other reports may cross-reference facts and dimensions in matrix format, or even display facts graphically. The values being plotted are facts; axis values, bar labels, and qualifications are dimensions.

    In some cases, it can be hard to tell whether a data element is a fact or a dimension. Sometimes, a single data element can be used in both ways. Chapter 6, More on Dimension Tables, discusses behavioral attributes, which are dimensions that have been derived from facts. Later, Chapter 11, Transactions, Snapshots, and Accumulating Snapshots, looks at techniques used to construct facts out of dimension values. It is also possible to find a process where there does not appear to be a clear measurement. Chapter 12, Factless Fact Tables, describes how the factless fact table can be used to deal with this kind of situation. You also may find that some important business metrics are not so easily rolled up. Chapter 3, Stars and Cubes, considers nonadditive facts, while Chapter 11 discusses the curious case of semi-additive facts.

    Grouping Dimensions and Facts

    A dimensional design organizes facts and dimensions for storage in a database. It is common for a set of dimensions to share relationships to one another, independent of facts. These are grouped together in a single table to reflect their natural clustering. Similarly, facts that are available at the same level of detail are grouped together.

    For example, suppose a study of the orders process reveals the facts and dimensions shown in Figure 1-3. In this example, products are being sold by a distributor to retail stores, so the customers are companies. Each of the facts listed in the column on the left may be rolled up according to the dimension values in the right-hand column.

    Notice natural affinities exist among the dimension values. Some of them clearly belong together. For example, product and product description both seem to be associated with SKU. Salesperson name is determined by salesperson ID. Other dimensions are more distantly associated. Products, for example, have brands. Salespeople work in regions. Figure 1-4 shows how the facts and dimensions for the orders process might be grouped together.

    Figure 1-3 Facts and dimensions for the orders process

    The diagram in Figure 1-4 depicts measurement of the orders process. It identifies the important measurements for the process and shows the dimensional context in which those measurements are evaluated. The facts and dimensions in this diagram can be combined in a number of ways, answering a wide array of business questions. Take any fact (or facts), add the word by, and then any combination of dimensions. The diagram can be easily understood by a businessperson. Things I measure go in the middle; the ways I look at them go in the surrounding boxes, loosely grouped or categorized.

    The dimensional framework for the orders process in Figure 1-4 can easily be translated into a database design, or more specifically a dimensional design. A dimensional design organizes facts and dimensions for storage in a database management system. In a relational database management system (RDBMS), the design is referred to as a star schema. In a multidimensional database (MDB), the design is referred to as a cube. This chapter introduces the star schema; we will return to cubes in Chapter 3.

    Figure 1-4 Sorting out facts and dimensions for the orders process

    The Star Schema

    A dimensional design for a relational database is called a star schema. Related dimensions are grouped as columns in dimension tables, and the facts are stored as columns in a fact table. The star schema gets its name from its appearance: when drawn with the fact table in the center, it looks like a star or asterisk. Figure 1-5 shows a simple star schema, based on the facts and dimensions for the orders process.

    Dimension Tables

    In a star schema, a dimension table contains columns representing dimensions. As discussed, these columns will provide context for facts. Sometimes, dimension table is simply shortened to dimension. This may give rise to confusion, since that word can also describe the individual columns. For example, when describing the schema shown in Figure 1-5, one might refer to the columns product, product_description, and SKU as dimensions, but it is also common practice to refer to the table itself as the product dimension.

    Generally, you can tell from the context whether the word dimension refers to a table or column. When it would not otherwise be clear, this book will use the terms dimension attribute and dimension table to make the distinction.

    The dimension tables serve to provide the rich context needed for the study of facts. In queries and reports, the dimensions will be used to specify how facts will be rolled up—their level of aggregation. Dimension values may be used to filter reports. They will be used to provide context for each measurement, usually in the form of textual labels that precede facts on each row of a report. They may also be used to drive master-detail relationships, subtotaling, cross-tabulation, or sorts.

    Figure 1-5 A simple star schema for the orders process

    Readers with experience designing operational systems will notice that the dimension tables are not in third normal form. At first, many developers find this disconcerting. For example, there may be a strong desire to move brand, brand_code, and brand_manager into a separate table called brand, Remember, though, that a dimensional model serves a different purpose from an ER model. It is not necessary to isolate repeating values in an environment that does not support transaction processing. Designers do occasionally perform additional normalization within dimensions, although they usually avoid doing so. In such cases, the schema is referred to as a snowflake. The additional tables that result are sometimes called outriggers. Situations in which snowflaking may be useful are discussed in Chapter 7, Hierarchies and Snowflakes.

    Because the dimensions in a dimension table are often loosely related, it can sometimes be difficult to decide how to group things. For example, should sales region be part of the salesperson dimension, or does it belong as a separate geographical dimension? Issues like this are tackled in Chapter 6.

    Keys and History

    In a star schema, each dimension table is given a surrogate key. This column is a unique identifier, created exclusively for the data warehouse. Surrogate keys are assigned and maintained as part of the process that loads the star schema. The surrogate key has no intrinsic meaning; it is typically an integer. Surrogate keys are sometimes referred to as warehouse keys. The surrogate key is the primary key of the dimension table.

    In this book, surrogate keys will be easily identifiable by the suffix _key in the column name. In Figure 1-5, for example, the surrogate key for the customer dimension is called customer_key, the surrogate key for the salesperson dimension is called salesperson_key, and so forth. Illustrations in this book will always list the surrogate key for a dimension table as its first attribute.

    Dimension tables also contain key columns that uniquely identify something in an operational system. Examples in Figure 1-5 include customer_id, sku, and salesperson_id. In the operational systems, these columns identify specific customers, products, and salespeople, respectively. These key columns are referred to as natural keys.

    The separation of surrogate keys and natural keys allows the data warehouse to track changes, even if the originating operational system does not. For example, suppose that customer ABC Wholesalers is identified by customer_id 10711 in an operational system. If the customer changes its headquarters location, the operational system may simply overwrite the address for customer_id 10711. For analytic purposes, however, it may be useful to track the history of ABC Wholesalers. Since the star schema does not rely on customer_id to identify a unique row in the customer dimension, it is possible to store multiple versions of ABC Wholesalers, even though both have the same customer_id—10711. The two versions can be distinguished by different surrogate key values. While it would also be possible to support change tracking by supplementing a natural key with a sequence number, the surrogate key allows fact and dimension tables to be joined based on a single column.

    The term slowly changing dimension refers to the manner in which a dimensional schema responds to changes in a source system. Detailed examples will be explored in Chapter 3, Stars and Cubes. Advanced techniques are explored in Chapter 8, More Slow Change Techniques.

    Fact Tables

    At the core of a star schema is the fact table. In addition to presenting the facts, the fact table includes surrogate keys that refer to each of the associated dimension tables. The simple orders star in Figure 1-5, for example, includes the facts quantity_ordered, cost_dollars, and order_dollars. It also includes surrogate keys that refer to products, salespeople, customers, and order dates.

    Together, the foreign keys in a fact table are sometimes considered to identify a unique row in the fact table. This is certainly true in Figure 1-5, where each fact table row represents orders of a product sold by a salesperson to a customer on a given day. In other cases, however, the foreign keys in a fact table are not sufficient to identify a unique row. As we will see in Chapter 3, sometimes a fact table row can be uniquely identified by a subset of its foreign keys, or even by using some nonkey attributes.

    Each row in the fact table stores facts at a specific level of detail. This level of detail is known as the fact table’s grain, a term that will be explored in Chapter 3. The information held in fact tables may be consumed at a variety of different levels, however, by aggregating the facts. In some data warehouse architectures, it is critical that the star schema capture information at the lowest level of detail possible. In other architectures, this is less important because a separate part of the data warehouse architecture is reserved for atomic data. These variations in data warehouse architecture are explored in Chapter 2, Data Warehouse Architectures.

    Using a Star Schema

    There is still much to be said about the fundamentals of dimensional design, a discussion that continues in Chapter 3, but enough background has now been provided to look at an important feature of the star schema: how it is actually used. Understanding the basic usage pattern of the star schema allows the dimensional designer to make intelligent choices.

    Querying Facts

    Most queries against a star schema follow a consistent pattern. One or more facts are requested, along with the dimensional attributes that provide the desired context. The facts will be summarized in accordance with the dimensions present in the query. Dimension values are also used to limit the scope of the query, serving as the basis for filters or constraints on the data to be fetched and aggregated.

    A properly configured relational database is well equipped to respond to such a query, which is issued using Structured Query Language (SQL). Suppose that someone has asked to see a report showing order dollars by product category and product name during the month of January 2009. The orders star schema from Figure 1-5 can provide this information, even though order dollars is stored at a lower level of detail. The SQL query in Figure 1-6 produces the required results, summarizing tens of thousands of fact table rows.

    The SELECT clause of the query indicates the dimensions that should appear in the query results (category and product), the fact that is requested (order dollars), and the manner in which it will be aggregated (through the SQL SUM() operation). The FROM clause specifies the star schema tables that are involved in the query.

    The WHERE clause serves two purposes. First, it filters the query results based on the values of specific dimension columns (month and year). It also specifies the join relationships between tables in the query. In terms of processing time, joins are among the most expensive operations the database must perform; notice that in the case of a star schema, dimension attributes are always a maximum of one join away from facts. The GROUP BY clause specifies the context to which the fact will be aggregated by the relational database; the ORDER

    Enjoying the preview?
    Page 1 of 1