Data Warehousing in Microsoft SQL Server 2008
- 10/29/2008
Data Warehouse Design
The preceding discussion gives you an idea of the issues that data warehousing evolved to address. In this section, we only scratch the surface of design considerations in bringing a data warehouse into existence and hope that will whet your appetite to learn more. Fortunately, it has never been easier to learn more about data warehousing than it is today.
Building a data warehouse requires addressing a myriad of technical and nontechnical issues, including the following:
Determination of enterprise goals and objectives to be served by the data warehouse and gaining organizational buy-in for them.
Identification of the various audiences for the data and their varying requirements.
Addressing of latency requirements with the appropriate data architecture.
Extract, transform, and load (ETL)—the process and tools by which data is extracted from source OpApps, cleaned and otherwise transformed as needed, and then loaded into the data warehouse. SQL Server Integration Services (SSIS) is Microsoft’s primary ETL tool for data warehousing.
Design of entitlement, backup, mobility, scalability, delivery, and training schemes.
Methods of end-user access to the information, including the distinction often made between reporting and analysis. The tools and products for this usually receive a disproportionate amount of attention in a data warehousing project because they are so visible.
The embedding of an organizational ethos that the data warehouse will constantly evolve with the ever-changing needs it supports. The effort is never “done.”
The primary goal of any data warehouse is to integrate data from disparate sources into a centralized store (at least logically speaking), in a form that can be used across the enterprise for decision support by all who need it. Merely dumping all the data from various stand-alone applications into a common database is not the sort of integration we mean. Rather, a data warehouse requires a schema of some sort to which all the data brought in is made to conform. The data also needs to be “clean”—meaning that all the different ways of representing the “same” thing in the various source systems have been converted to a single consistent form. Both of these tasks are ETL responsibilities, as previously mentioned.
Based on what we’ve said so far, the 35,000-foot view of a data warehouse is shown in Figure 14-1.
Figure 14-1. The generic data warehouse architecture
With this background in place, we can now consider the two predominant data warehousing architectures guiding practice today.
The Top-Down Approach of Inmon
William Inmon is recognized as “the father of data warehousing,” having invented the term in 1990. The data warehousing features he characterized can seem self-evident today, but no one had codified them previously as he did. According to his definition, the essential characteristics of data in a data warehouse are as follows:
Subject-oriented. Major entities are common across multiple OpApps. Customer, Product, Shipment, and Account are typical subject areas.
Integrated. Data sources are consistent with one another along common themes.
Nonvolatile. Data, once loaded, is usually never changed (updated or deleted).
Time-variant. Time is part of the key to everything—“as it was at this point in time,” also known as “history,” is preserved.
These features enable the previously stated goals of any data warehouse.
While an oversimplification, the Inmon style of data warehousing presumes that an enterprise data model has been or will be created—one that identifies all the “subject-oriented” entities common across multiple OpApps, the required numeric measures, the required detail level of each, and the relationships between them. It is posited that the logical data model representing this within the data warehouse is a normalized relational model of the sort associated with OLTP applications. Inmon refers to this as the “enterprise data warehouse” and to the data as being “architected.” The emphasis is on a centralized, normalized data store.
Since the typical complexity of a normalized model does not lend itself to direct query from ease of use and performance perspectives, this architecture also posits various datamarts, which are additional derived databases whose structure is optimized for query, and which generally contain only aggregated data derived from the data warehouse. The key point is that their architecture is secondary and separate from the data warehouse proper. A refinement of Figure 14-1 that represents Inmon’s datamart concept is shown Figure 14-2.
Figure 14-2. An Inmon-inspired data warehouse
Because this approach generally insists that a large-scale model already exists or will be created before construction of the data warehouse begins, it is usually characterized as top-down.
Inmon has written several books elaborating the principles and refinements of this architecture, and along with Claudia Imhoff (a long-term associate), he has elucidated an even larger architecture, the Corporate Information Factory (CIF), of which data warehousing is only a part. Space constraints preclude us from delving into further detail about the Inmon and CIF approaches. We do want to make two points before moving on, however.
The first you are probably already thinking—that requiring the existence or creation of an enterprise data model is impractical in many organizations. It has been successfully done, typically in larger enterprises, but many would find it impossible to justify the time and expense required to develop the model (with nothing to show at the end but documentation). No doubt when it can be done, it lays a very powerful foundation for informational applications, but in many cases, it is not feasible.
The second point is that many find this approach relatively abstract—useful in articulating high-level architecture but less helpful with practical details during actual development. The next approach to data warehousing that we’ll discuss, at the other end of the design spectrum, evolved to address both these realities.
The Bottom-Up Approach of Kimball
From the mid 1990s to the present, Ralph Kimball has publicized an alternative to the Inmon approach to data warehousing, the heart of which he called the Dimensional Model. If the Inmon approach can be called top-down, Kimball’s is definitely bottom-up, although both advocate a step-by-step approach. Just as Inmon articulated and formalized concepts that were already in use by practitioners, Kimball codified several practices already in use but lacking an integrative vision.
The first is the Dimensional Model, held to represent the most elegant tradeoffs between end-user intelligibility, ease of use, good performance for both predefined and ad hoc queries, and easy extensibility. The second is the idea of building the data warehouse incrementally, something most enterprises find much more palatable than the all-at-once, “big bang” approach implied by Inmon’s architecture. A key part of this is the concept of “conformed dimensions” (which we’ll define in a moment) to ensure that each new incremental data warehouse development could be integrated with what was already built, as opposed to each effort becoming the next-generation “island of automation,” or as it is usually called today, “stovepipe,” application. Third, Kimball emphasizes implementation practicality, with very specific advice on a host of data design issues advanced through his books, Web site, regular seminars, and training offerings.
Many indeed seem to find this approach desirable, as evidenced by the fact that most data analysis tools on the market today, including Microsoft SQL Server Analysis Services (which we cover in Chapter 15 through Chapter 18), have a definite affinity for the Dimensional Model. For this reason, as well as because it is less abstract, we will devote the rest of this section to an overview of this approach.
This section does not purport to teach the Kimball approach. Space permits us merely to expose you to a few key concepts associated with it. This should make your further investigations easier and more effective.
Terminology
You should be aware of several useful data warehousing terms that—while closely associated with (if not always originated by) Kimball and the Dimensional Model—have come to be more broadly understood due to their representation in many tools (especially OLAP tools). You’ll see most of these terms again in the chapters that cover SQL Server Analysis Services (Chapter 15 through Chapter 18).
Measure. A typically numeric value of interest in reporting and analysis, such as price, balance, or inventory. As stored in a data warehouse, the relevant measures are defined by the industry of the enterprise and come from the OpApps that are its data sources. A measure is also characterized by grain, defined later in this list.
Dimension. The heart of the Dimensional Model, a dimension is variously described as an “axis of analysis” or a “what” qualifier. A dimension helps qualify a measure and give it context (discussed in the next section). In a query, a dimension can be part of the query result and/or part of the query constraints. The most fundamental dimension is Time, essential in almost any context. Others are industry-specific but typically include at a minimum Customer, Product, and Geography. Dimensions are typically recognized as referential or master data entities. A dimension is a collection of related values called members—for example, 2008 might be a member of the Time dimension and John Smith a member of the Customer dimension. In a Dimensional Model, the dimensions are considered to be independent of one another, even if they really are not. For example, Customer and Product are not independent, since not every customer buys every product, but by modeling each as a dimension, we treat them as if they are independent because doing so simplifies the conceptual model on which queries are based. Few if any dimensions have zero correlation with any other dimensions.
Hierarchy. A particular parent-child organization of members within a dimension. Each distinct set of parents is called a level of the hierarchy. For example, a Time dimension might have levels named Year and Month. The Year level might have members like 2007 and 2008, while the Month level might have members like Jan 2007 and Jan 2008, with parent members at the Year level of 2007 and 2008. Hierarchies occur naturally in a wide range of applications and are nothing more than a way of grouping members for summarization. A hierarchy reflects the fact that different members of the same dimension represent different levels of detail.
Dimension table. A relational table containing (typically) one row per member of the dimension (depending on what form of history, if any, is maintained in the dimension). A dimension table usually has a minimum of two columns, one representing the key or identifier that uniquely defines members of the dimension and another giving a descriptive name for the member.
Fact table. A relational table that functions, from a data modeling perspective, as an associative entity between various dimensions. It contains one or more measure columns, and key columns of all related dimensions. It is populated (by ETL) in such a way that the measure values are completely described by the related dimensional keys. A fact table is also characterized by its grain (defined later in this list), and all measures in the same fact table (should) have the same grain.
Star schema. Based on what an Entity Relationship (E/R) diagram of a fact table and its related dimension tables look like, this has become a generic term for that pattern (discussed later in this section).
Grain. A characteristic of a measure that is defined in terms of its related dimensions. Grain has two properties: first, precisely those dimensions that define the context of the measure; second, for each such dimension, the level within a hierarchy from the dimension that defines the level of detail of the measure. These two properties together define the measure’s grain. For example, if all measures in a fact table pertain to values of the Month level of the Year-Month hierarchy of the Time dimension, the Time grain of that fact table is Month. The overall grain of the fact table, referred to as its granularity, is defined by such characteristics for all its dimensions.
Conformed dimension. A dimension, as previously defined, that has been designed and built in such a way that each star schema that includes the dimension can be meaningfully joined (logically) on such dimension. From a practical perspective, this means that all occurrences of such dimension in various fact tables mean the same thing—each includes exactly the same members, and each member has exactly the same meaning in relation to the facts whose context it helps define. Kimball refers to this state of affairs as the “Bus Architecture.”
It is not the case that each fact table using the dimension must use it at the same level (if it has a hierarchy). For example, if one fact table is at the Year level of the Time dimension and another is at the Month level, data from the two can still be meaningfully combined—it is simply necessary to aggregate the Month data to the level of Year first. Without conformed dimensions, various star schemas cannot be meaningfully combined along their common dimensions—in which case, the incremental approach to building up the data warehouse is not possible. Creating conformed dimensions is probably the most difficult part of the Dimensional Model approach, and where it most intersects with the Inmon approach—it is here that organizational agreement about which dimensions can be conformed, and what they will mean, must be secured. This is also where a lack of needed data (that is, at the required grain) in source OpApps will become apparent.
Context and the Star Schema
As mentioned earlier, dimensions provide the context of a measure. Figure 14-3 depicts an imaginary conversation that demonstrates how context is needed to make sense of data.
Figure 14-3. Determining the context of a measure
Now let’s diagram this conversation, as shown in Figure 14-4.
Figure 14-4. A representation of what we know about 492.00 (currency is assumed)
We can examine an actual implementation of the preceding example. Run the code shown in Example 14-1 against the AdventureWorksDW2008 sample database to retrieve our exact case.
Example 14-1. Querying AdventureWorksDW2008 for the value of a particular measure
USE AdventureWorksDW2008 GO SELECT dd.FullDateAlternateKey, do.OrganizationName, ddg.DepartmentGroupName, da.AccountDescription, ds.ScenarioName, ff.Amount FROM FactFinance ff INNER JOIN DimDate AS dd ON ff.DateKey = dd.DateKey INNER JOIN DimOrganization AS do ON ff.OrganizationKey = do.OrganizationKey INNER JOIN DimDepartmentGroup AS ddg ON ff.DepartmentGroupKey = ddg.DepartmentGroupKey INNER JOIN DimScenario AS ds ON ff.ScenarioKey = ds.ScenarioKey INNER JOIN DimAccount AS da ON ff.AccountKey = da.AccountKey WHERE dd.FullDateAlternateKey = '2/1/2004' AND do.OrganizationName = 'Central Division' AND ddg.DepartmentGroupName = 'Corporate' AND da.AccountDescription = 'Travel Lodging' AND ds.scenarioName = 'Actual'
From this query and the E/R diagram that represents the tables involved, we can see in Figure 14-5 what is meant by a star schema.
Figure 14-5. A star schema from AdventureWorksDW2008
Surrogate Keys
The surrogate key concept is not original to Kimball or the Dimensional Model, but it is something they strongly advocate. A surrogate key is a system-assigned, typically integer, primary key to a table. In SQL Server, the surrogate key would typically be an identity column, although sometimes a particular architecture might find it preferable to have a central key generator that gives out surrogate keys as needed. Surrogate keys have two important characteristics, as follows:
They have no embedded encodings—that is, they are not “smart” keys. This makes them immune to changes in the source data that would plague nonsurrogate primary keys. One reasonable exception to this is the surrogate key of the Time dimension, where making the surrogate integer key smart by representing YYYYMMDD (when applicable to the grain of the fact tables) can make partitioning the fact tables much easier.
As integers, they are the most efficient possible primary keys, both from performance and storage perspectives.
This concludes our brief review of the Kimball approach to data warehousing. You are strongly encouraged to consult the references at the end of this section, as well as appropriate Web searches, for a great deal more information. We’ll close here with Figure 14-6, which illustrates what a data warehouse built to Kimball principles looks like. An important aspect to observe in this figure is that the data warehouse is the collection of star schemas—there are no separate datamarts, as in the Inmon approach. (And by the way, in an Inmon data warehouse, there is no objection to the datamarts following the Kimball architecture.) Although not shown in this figure, it is assumed that the various star schemas are not disjoint, meaning that wherever they share a functional dimension such as Customer or Product, they have been constructed in such a way as to actually share a single version of the dimension. When this is done, the data in the various star schemas can be validly combined along the common dimensions—a property derived from them having been constructed to be “conformable,” in the parlance of the Dimensional Model.
Figure 14-6. A Kimball-oriented data warehouse