Basic Data Preparation Challenges
- By Gil Raviv
- 4/8/2019
Preparing the Model
Data preparation is key to the success of data analysis. To set the stage for effective analysis, you often need to split tables into multiple tables to ensure that you have a single flat table for facts or transactions (for example, Sales Order) and supplementary tables to support your facts (for example, Products, Clients).
In Exercise 2-2, Part 3, you learned how to create relationships between fact tables and lookup tables. While the modeling elements in Power Pivot and Power BI are not the focus of this book, the capability to shape your tables to meet your needs is indeed the focus. In the next three chapters, you will learn how to combine multiple tables to simplify your model. In Chapter 6, “Unpivoting Tables,” you will learn how to break down complex table structures to better support your analysis.
In this section, you will learn the imperative preparation steps that allow you to split an aggregated table into multiple tables, to build star schemas, and to control the granularity of your entities.
Exercise 2-7: Splitting Data into Lookup Tables and Fact Tables
In this exercise, you will import sample data from another fictitious company: Wide World Importers. This dataset, which is richer and newer than the AdventureWorks dataset, is also provided by Microsoft for learning purposes. To learn more about this data sample, go to https://blogs.technet.microsoft.com/dataplatforminsider/2016/06/09/wideworldimporters-the-new-sql-server-sample-database/.
The workbook C02E07.xlsx which can be downloaded from https://aka.ms/DataPwrBIPivot/downloads, summarizes Wide World Importers orders. The goal of this exercise is to show you how easy it is to use Power Query to split a table into a fact table and a lookup table by using the Reference and Remove Duplicate options.
Download the workbook C02E07.xslx and save it in the folder C:\Data\C02\.
Open a new blank Excel workbook or a new Power BI Desktop report and import the workbook C02E07.xslx to the Power Query Editor.
In the Navigator dialog, select the Sales_Order table and then select Edit.
Rename the Sales_Order query Sales Order - Base.
Your goal here is to split the original table into two tables, with the correct granularity levels, as shown in Figure 2-14. One of the tables (the bottom left table in Figure 2-14) is for the orders and their stock item identifiers (the fact table), and the second (the bottom right table in Figure 2-14) is for the stock items (the lookup table).
FIGURE 2-14 You can split the Sales Order table into a fact table and a lookup table.
Right-click Sales Order - Base and select Reference. Rename the new query Stock Items. (To rename the query, right-click the query in the Queries pane and select Rename in the shortcut menu, or rename the query in the Query Settings pane, under Properties, in the Name box.)
Select the Stock Items query, and in Home tab, select Choose Columns.
In the Choose Columns dialog box that opens, deselect all columns and then select the columns Stock ID, Stock Item, and Stock Lead Time. Click OK to close the dialog box.
Select the Stock ID column, and on Home tab, select Remove Rows and then Remove Duplicates.
You now have a lookup table for the stock items, with unique stock items on each row.
To create a new fact table for the orders, in the Queries pane, right-click on the Sales Order - Base query and select Reference.
Rename the new query Sales Orders. (To rename the query, right-click the query in the Queries pane and select Rename in the shortcut menu, or rename it in the Query Settings pane, under Properties, in the Name box.)
Select the Sales Orders query, and on the Home tab, select Choose Columns. The Choose Columns dialog box opens. Deselect the columns Stock Item and Stock Lead Time. Click OK to close the dialog box.
In Excel: On the Home tab, select Close & Load To to avoid loading the base query into your worksheet or Data Model. Then, in the Queries and Connections pane, select Sales Orders and Stock Items and load them into the Data Model, using the Load To option, as you did in Exercise 2-2, Part 2 step 11.
In Power BI Desktop: Disable the loading of Sales Order - Base, as you did in Exercise 2-2, Part 2 step 11.
Create a relationship between the two tables through the Stock ID columns.
When a Relationship Fails
When you create lookup tables in Power Query, as you did in Exercise 2-7, and you use text columns as keys in the relationship, two additional steps must be performed before you remove the duplicate keys in step 8 to ensure that your report does not fail to refresh.
In Exercise 2-7, in step 8 you removed the duplicate Stock ID values from the lookup table. Unfortunately, Power Query was designed differently than the Data Model in Excel and Power BI. While Power Query is always case sensitive, the Data Model is not. As a result, if your Stock ID value, “Stock_1”, will include a lowercase or uppercase variant (e.g. “STOCK_1”) or will include a trailing space (e.g. “Stock_1 “), Power Query will keep the two instances as unique values, but the Data Model will consider them duplicates and will issue the following error:
Column 'Stock ID' in Table 'Stock Items' contains a duplicate value 'Stock_1' and this is not allowed for columns on the one side of a many-to-one relationship or for columns that are used as the primary key of a table.
To experience this error, open the workbook C02E07.xslx, and modify two of the Stock ID values. Change one of the values to uppercase, and add a space character at the end of another value. Then save the workbook and try to refresh your Exercise 2-7 solution workbook or Power BI report.
To fix this issue, apply the following steps on the Sales Orders - Base query:
Launch the Power Query Editor and select the Sales Orders - Base query.
Select the Stock ID column, and in the Transform tab, select Format and then select Lowercase or Uppercase or Capitalize Each Word. Any of these three options will ensure that the keys are normalized, so variations of lowercase/uppercase values will be removed as duplicate values by Power Query.
Select the Stock ID column, and on the Transform tab, select Format and then Trim.
Refresh the workbook or report, and the error does not occur again.
Referencing and removing duplicates to create lookup tables are common steps. Many users who apply the Remove Duplicates step are not aware of the potential refresh failures that could occur for a dataset that contains lowercase/uppercase variations or trailing spaces. Oftentimes, the errors even appear sooner, as the users try to build the relationship. Handling the case-sensitive values and trimming the spaces before the removal of duplicates will resolve these issues.
Download the solution files C02E07 - Solution.xlsx and C02E07 - Solution.pbix from https://aka.ms/DataPwrBIPivot/downloads.
Exercise 2-8: Splitting Delimiter-Separated Values into Rows
In the preceding section, you learned how to split tables into fact and lookup tables. In this exercise, you will learn about a special case of fact and lookup tables. Imagine that your fact table acts as a mapping table that associates between members’ entities and their group unique identifiers. The lookup table in this scenario describes the group entities.
In this exercise, you will again use the AdventureWorks product table. Each product can be shipped in one or more colors. The associated colors are represented in comma-separated values, as illustrated in Figure 2-15, in the table to the left. In this exercise, you will learn how to split the products table into two tables: one for the entities (in this case, the products entities, without the colors) and the other one for the associations between groups and their members (in this case, product codes and colors).
FIGURE 2-15 You can split the comma-separated AdventureWorks product colors column to find how many products are released by color.
For this exercise, you will use the sample workbook C02E08.xlsx, which can be downloaded from https://aka.ms/DataPwrBIPivot/downloads. The sample workbook summarizes AdventureWorks product codes by average cost, average price, and comma-separated colors. To create a report that shows the number of products of each color, as shown in Figure 2-15, do the following.
Download the workbook C02E08.xslx and save it in the folder C:\Data\C02\.
Open a new blank Excel workbook or a new Power BI Desktop report and import the workbook C02E08.xslx from C:\Data\C02\ to the Power Query Editor.
In the Navigator dialog, select Products and then select Edit.
In the Queries pane, right-click Products and select Reference. Your goal is to create a new table with a mapping between the product codes and the colors.
Rename the new query Products and Colors. (To rename the query, you can right-click the query in the Queries pane and select Rename in the shortcut menu, or you can rename the query in the Query Settings pane, under Properties, in the Name box.)
With the new query selected, on the Home tab, select Choose Columns.
In the Choose Columns dialog box that opens, select ProductCodeNew and Colors and click OK to close the dialog box.
Select the Colors column, and on the Transform tab, select Split Column and then select By Delimiter.
In the Split Column by Delimiter dialog box that opens, note that the comma delimiter is selected by default, along with the option Each Occurrence of the Delimiter (see Figure 2-16). Expand the Advanced Options section and switch the Split Into option from Columns to Rows. Click OK to close the dialog box.
FIGURE 2-16 In the Split Column by Delimiter dialog box, don’t forget to select Split Into, and then select Rows under Advanced Options.
In the Colors and Products query, select the Colors column. On the Transform tab, select Format and then select Trim. This step removes the leading spaces from the color values.
You can now load the two tables into the Data Model in Excel or Power BI Desktop and create a relationship between ProductCodeNew in Products and ProductCodeNew in Products and Colors. You can download and review the solution files C02E08 - Solution.xlsx and C02E08 - Solution.pbix from https://aka.ms/DataPwrBIPivot/downloads.