Combining Data from Multiple Sources

  • 12/27/2024
Contents
×
  1. Appending a few tables
  2. Appending tables on a larger scale
  3. Summary

Appending tables on a larger scale

In the preceding section, you learned how to append multiple tables using Append queries and Append queries as new in the Power Query editor. Although these transformations come in handy when you have a predefined set of table names, in some common scenarios, you cannot know in advance which table names are available, but you are required to create a report of the combined data on a frequent basis. A manual manipulation of the queries to include the current table names each time they change would be a time-consuming effort. Thankfully, you have better options.

Appending tables from a folder

The most common scenario, and its challenge, is demonstrated in this example: Imagine that you have multiple tables that you need to append. Each table is in a separate Excel workbook whose name includes a different month and year:

  • Jan2025.xlsx

  • Feb2025.xlsx

  • Mar2025.xslx

  • Apr2025.xslx

Each month, a new Excel workbook is added to a shared folder, and you are asked to prepare an updated report that appends all the tables from the folder. Suppose that you already created a report that separately imports each of the tables and appends them using the preceding technique. When a new file is added (for example, May2025.xlsx), you would need to follow a sequence of manual steps to modify the queries: First, you would import the Excel file as a new query, named May2025. Then, you would edit the Append step in the Applied steps list in the Power Query editor to include the new query. You could add the query in the following formula (with the change highlighted in bold):

Table.Combine({Jan2025, Feb2025, Mar2025, Apr2025, May2025})

This method will ultimately become very tedious if you need to repeat it each month. Wouldn’t it be better if you could create an improved query that would automatically append any workbook from the folder—with a single refresh of the workbook and with no manual edits to the query? This is possible, thanks to the Import from Folder option.

Exercise 3-3: Appending AdventureWorks products from a folder

Let’s return to AdventureWorks for an illustration of the challenge at hand. Suppose that an IT VBA developer created a script that updates an Excel file whenever a new product is released. Each year, a new Excel file is created, and it is updated on the fly with newly released products. The product files are aggregated in a shared folder, as shown in Figure 3-5. As the chief analyst, you would like to create a report that combines all products.

FIGURE 3-5

FIGURE 3-5 In this example, AdventureWorks collects Excel workbooks with products by their release year in a shared folder.

To follow the steps in this exercise, you can download the file C03E03 - Products.zip from www.powerquerybook.com and extract the Excel files into the folder C:\Data\C03\C03E03 – Products.

  1. Open a blank new workbook in Excel or a new Power BI Desktop report.

    In Power BI Desktop, on the Home tab, select Get data > File > Folder > Connect.

    In Excel on the Data tab or on the Home tab, select Get data > From file > From folder.

  2. In the Folder dialog box that opens, browse to the folder created earlier (refer to Figure 3-5), or copy and paste its path to the Folder path and select OK.

  3. In the new dialog box that opens (titled with the name of the folder you selected in step 2), select Combine > Combine & transform data.

  4. In the Combine files dialog box that opens, select Sheet1, as shown in Figure 3-6, and select OK or Transform data.

    FIGURE 3-6

    FIGURE 3-6 Combine files enables you to select the worksheet or table to combine. Once selected, all worksheets or tables with the same name in each Excel workbook in the folder will be combined.

  5. In the Power Query editor that opens, look in the Preview pane, and see that the first column, Source.Name, contains the file names. Click the Filter control of the Source.Name header. In the Filter pane, you will notice that the values in this column are the same as the actual file names in the folder.

The Source.Name column can sometimes be of significant value. In the preceding example, the column SellStartDate shows the release date of each product in AdventureWorks. The release year can be extracted from this column if years are needed in the report. However, in many cases, you won’t have such a column, and the contextual information will be available only in the file name. Fortunately, you can extract this contextual information from Source.Name.

To extract the year 2024 from C03E03 - 2024.xlsx in Source.Name, you can apply various transformation techniques. Chapter 2 provides a detailed walkthrough of similar scenarios, so we highlight just a few of them in this context. The file names in this case contain the prefix C03E03 - and the suffix .xlsx, with the year in between. You can extract the year from Source.Name in several ways:

  • Replace C03E03 - with an empty string, and then replace .xlsx with an empty string. You learned how to replace values in Chapter 2.

  • Alternatively, split Source.Name at 9 characters from the left (using Split by number of characters), and then split the second column with a dot, or alternatively split the second column by position at 4 characters from the right. Remove the first and third columns and keep the second column, which contains the extracted years.

  • Extract text between the second space and a period:

    1. Select the Source.Name column.

    2. On the Transform tab, select Extract > Text between delimiters.

    3. Use a space ( ) as the start delimiter and a period (.) as the end delimiter.

    4. Under Advanced options, enter 1 as the number of start delimiters to skip.

    5. Select OK.

You are now ready to load the appended product table to your Power BI report or Excel workbook. You can review the solution files C03E03 - Solution.pbix and C03E03 - Solution.xlsx, which are available at www.powerquerybook.com.

Thoughts on Import from Folder

There are three important aspects of combining files using the Import from Folder option:

  • Importing multiple tables from a folder is very powerful. When a new file is added, the query you created earlier will not need to be updated to incorporate the new file into the append transformation. To prove this, you can run a small test. Copy and paste one of the existing source workbooks into the same folder and rename the duplicated file C03E03 - 2025.xlsx. Now, refresh the workbook or Power BI report that you created in Exercise 3-3 and confirm that you have new products from the year 2025. Isn’t this an outstanding feature?

  • After step 5 in Exercise 3-3, you might have noticed that the Queries pane was populated with new artifacts such as groups and various types of queries. You can ignore these elements for now. We will get back to them later, when we talk about query functions and other advanced topics.

  • You can combine multiple files from shared folders on a network or on SharePoint sites. Using shared folders enables you to share a Power BI report or an Excel workbook with other report authors and refresh it. If you use Power BI, you can publish the report to the Power BI service and enjoy a scheduled refresh. When the folder is located on your SharePoint Online or OneDrive for Business, the Power BI service will have direct access to the data and can be configured to refresh it periodically. When the shared folder is local, you can use the on-premises data gateway, so the Power BI service will have access to the folder to perform periodic refreshes.

Appending worksheets from a workbook

Let’s look at one last large-scale table append scenario: combining tables from multiple worksheets. In the preceding example, you combined AdventureWorks products that were separated in different workbooks. Each workbook had its own products, by their release year. Imagine, however, that the same tables are stored on separate worksheets in the same Excel workbook. Can you append all worksheets into a single table, while keeping the context of the release year for each product, as shown in Figure 3-7? Can you also ensure that each year you will not need to update the query when a new worksheet is added? How can you ensure that next year’s products are automatically combined into your report when you click Refresh?

FIGURE 3-7

FIGURE 3-7 Your goal with the AdventureWorks products divided into separate worksheets by their release year is to append them, while keeping their release year context, and avoid further modification of the query for future years.

Exercise 3-4: Appending worksheets: The solution

To follow the solution for this challenge, you can download the file C03E04 - Year per Worksheet.xlsx from www.powerquerybook.com.

  1. Open a blank new workbook in Excel or a new Power BI Desktop report.

    In Power BI Desktop: On the Home tab, select Excel workbook.

    In Excel: On the Data tab, select Get data > From file > From Excel workbook.

  2. Select the file C03E04 - Year per Worksheet.xlsx.

  3. In the Navigator dialog box, right-click the line with the folder icon (in this example, C03E04 - Year per Worksheet.xlsx) and select Transform data.

  4. When the Power Query editor opens, rename the query Products. In the main Preview pane, you can see a table with each worksheet in a separate row. The actual content from each worksheet is encapsulated in the Data column. Before you combine the worksheets, you can remove the three columns: Item, Kind, and Hidden.

  5. On the header of the Data column, click the Expand button.

  6. When the Expand pane opens, showing the columns you can expand, click OK to expand all columns. When you expand the Data column, the combined table has meaningless header names (Column1, Column2, and so forth) and contains both headers and rows from each worksheet, as shown in the first table in Figure 3-8.

    FIGURE 3-8

    FIGURE 3-8 This transformation flow diagram shows how to clean combined worksheets.

    The transformation flow diagram in Figure 3-8 outlines the three transformation steps that are required to clean the combined table. In the first step, you move up the first row, which was the headers row of table 2022, to become the headers of the combined table. Then, in the next step, you remove all the headers of worksheets 2023 and 2024. Finally, in the third step, you rename the 2022 column Year to represent the release year of each product. Throughout the book, we may use this type of diagram to depict the sequence of main transformations steps and capture the rationale for solving advanced data preparation problems. Next, we look at the steps in more detail.

  7. On the Transform tab, select Use first row as headers. This enables you to preserve the column headers from the first worksheet as the column headers for the appended table.

  8. In the Preview pane, notice that the column headers of worksheets 2023 and 2024 are interspersed as rows throughout the appended table. These rows include the value Name in the Name column and should be removed from the table. To filter out these rows, click the Filter control in the left side of the Name header. When the Filter pane opens, type Name in the search box and deselect the value Name. Then select OK.

  9. Double-click the first column header named 2022, and rename it Year.

    You can now load the combined table to your Power BI report or Excel workbook. You have successfully combined the three worksheets by their release year and have extracted the release year of each product from the worksheet names. It’s time to test the scalability of the solution to see if it will refresh correctly when a fourth worksheet is added with the products released in 2025.

  10. With your Power BI report or Excel workbook with the combined products still open, open the source Excel file C03E04 - Year per Worksheet.xlsx that was used in this exercise. Duplicate the 2024 worksheet as a fourth worksheet, and rename it 2025.

  11. Refresh the Products query in your Power BI report or Excel workbook, and ensure that the duplicated products from the 2025 worksheet are now appended with the value 2025 as their release date. You should find that it works!

    Add a 2021 worksheet with historical data for products released in 2021 as the first worksheet. To do so, you can duplicate one of the existing worksheets and add it as the first worksheet in C03E04 - Year per Worksheet.xlsx. Now, when the leftmost worksheet is 2021 instead of 2022, the refresh fails, with this error:

    Expression.Error: The column ‘2022’ of the table wasn’t found.
  12. For a short-term fix of this error, launch the Power Query editor, select the Products query in the Power Query editor, and select the step Changed types in the Applied steps list. In the formula bar, you see the following formula:

    Table.TransformColumnTypes(#"Promoted Headers",{{"2022", Int64.Type}, {"Name",
    type text}, {"ProductNumber", type text}, {"Color", type text}, {"StandardCost",
    type any}, {"ListPrice", type any}, {"Size", type text}, {"Weight", type any},
    {"ParentProductCategoryName", type text}, {"ProductCategoryName", type text}})

    Find the text "2022" in the preceding formula and replace it with "2021".

  13. Select the step Renamed Columns in the Applied steps list. In the formula bar, you see the following formula:

    Table.RenameColumns(#"Filtered Rows",{{"2022", "Year"}})

    Find the text "2022" in the preceding formula and replace it with "2021".

  14. Close the Power Query editor and refresh the report. The products released in 2021 are now correctly combined into your report.

You can review these steps in the solution files C03E04 - Solution 01.pbix or C03E04 - Solution 01.xlsx, which are available at www.powerquerybook.com.

If you could assume that the first worksheet will always be 2022, or now, after the fix, 2021, and if you could ensure that new products will always be added as new worksheets to the right, your mission is accomplished. But what should you do if this assumption cannot be made? How can you improve the query to ensure that you will not need to repeat steps 12–14 whenever a new worksheet is added as the first worksheet?

In the following, and final, part of this chapter, you will learn how to improve the query to support the scenario where you cannot know which worksheet will be the first one.

A robust approach to combining multiple worksheets

This section briefly introduces new concepts to improve the robustness of your queries and avoid refresh errors. In Exercise 3-4, you combined three worksheets of AdventureWorks products by their release year. While you could append the three worksheets in a scalable way and incorporate a combined table from the new worksheets as they are added to the data source, you found out that the query will fail when an unknown worksheet is added as the first worksheet.

In Exercise 3-4, you followed three steps (7–9) to combine the worksheets into a single table. In Chapter 10, “From Pitfalls to Robust Queries,” you will learn in more detail how each of these steps could have been done differently to avoid potential errors. For now, this section just briefly introduces the mistakes and how to resolve them in this particular scenario:

  • In step 7, you used First row as headers and allowed Power Query to automatically change the column types for you. As a result, the year 2022 was explicitly referenced in the formula. You kept the formula as is and, as a result, exposed the query to future refresh failures. Chapter 10 refers to this mistake as the changed types pitfall.

  • In step 9, you renamed the header 2022 to Year. This action triggered an explicit reference to column 2022 in the formula bar and exposed the query to refresh failures, as observed when you added 2021 as the first worksheet. Chapter 10 refers to this mistake as the rename columns pitfall.

See Also You should note that in step 8, you almost fell for another pitfall, the search in filters pitfall, when you applied the filter through the Search pane instead of selecting Text Filters and then defining the filter condition explicitly. You will learn about this pitfall and the others in Chapter 10.

To correct these two pitfalls, follow these steps:

  1. Open the Products query in the Power Query editor, and delete the step Changed types from the Applied steps list. (You will explicitly apply the correct types later.)

  2. Select the step Renamed columns in the Applied steps list. In the formula bar, you see the following formula:

    Table.RenameColumns(#"Filtered Rows",{{"2022", "Year"}})
  3. Replace "2022" in the preceding formula with the code Table.ColumnNames(#"Filtered Rows"){0}. The final formula line with the modified code in bold is as follows:

    Table.RenameColumns(#"Filtered Rows",{{Table.ColumnNames(#"Filtered Rows"){0}, "Year"}})

    Chapter 9 covers the M language in depth. You should feel completely comfortable at this stage if you don’t understand its syntax. You do not need to fully master M to resolve 99% of data challenges you will ever face. However, because you modified the M formula here, let’s briefly examine what happened.

    In a nutshell, instead of telling Power Query “Please rename column 2022 to Year,” you changed the instructions to “Please rename the first column to Year.”

    How did you refer to the first column? By using the function Table.ColumnNames. This function returns the list of column names of the table it receives as an argument. In this case, the table is #"Filtered Rows", which was named by the preceding step in which you filtered the table. Finally, to direct Power Query to the first element in the column names list, you can use the index zero inside braces. This method in M allows you to access specific elements in a list or a specific row in a table. The index in M starts at zero instead of one, so the index of the first element is 0, and the nth element is n–1. In Chapter 10, you will use the function Table.ColumnNames to avoid referring to columns by their hardcoded names—and therefore avoid refresh errors when column names change.

  4. You can conclude the creation of a robust query by explicitly changing the column types as follows:

    1. Make the Year column a Whole Number column.

    2. Make the StandardCost column a Decimal Number column.

    3. Make the ListPrice column a Decimal Number column.

You can now close the Power Query editor and refresh the report. The products will be combined correctly, no matter which year is placed as the first worksheet.

You can review the preceding steps in the solution files C03E04 - Solution 02.pbix and C03E04 - Solution 02.xlsx, which are available at www.powerquerybook.com.