Query and transform data
- By Daniil Maslyuk and Justin Frebault
- 12/24/2023
- Skill 2.1: Query data by using Azure Synapse Analytics
- Skill 2.2: Ingest and transform data by using Power BI
Skill 2.2: Ingest and transform data by using Power BI
Power BI includes Power Query, which is an extract-transform-load (ETL) tool that uses the M language. M is a functional, case-sensitive language that, unlike DAX, does not resemble Excel formula language in any way, and differs from DAX in important ways, too. In this section, we’ll look at the problems you may need to solve when working with large amounts of data in Power Query.
When the volume or number of data sources is significant, you may face performance degradation. There are tools within Power Query that will help you identify the performance problems, and later we’ll review the techniques you can use to improve performance.
In addition to Power BI Desktop, Power Query is available in Power BI dataflows, and we’ll review when you’d want to use dataflows and what you’d need to consider when using them.
Combining data from different data sources will lead to data privacy issues, which we’ll also discuss later in this chapter.
Finally, we’ll discuss how you can use Advanced Editor to write your own queries and functions, and how you can query some of the more complex data sources by using Power Query.
Identify data loading performance bottlenecks in Power Query or data sources
Several reasons could be responsible for poor performance when connecting to data in Power BI. Power BI Desktop has a few features that can help identify those issues.
View native query
When you get data in Power BI from some data sources, like databases, Power Query will do its best to translate the transformations you perform into the native language of the data source—for example, SQL. This feature of Power Query is known as query folding. Most of the time, this will make getting data more efficient. For instance, if you connect to a database and get a subset of columns from a table, Power Query may only retrieve those columns from the data source instead of loading all columns and then locally removing the ones you don’t want.
In some cases, it may be possible to view the query that Power Query sent to the data source to retrieve the data you wanted. For this, you need to right-click a query step in Power Query Editor and select View Native Query. The window that opens looks like Figure 2-14.
FIGURE 2.14 Native Query window
In the query shown in Figure 2-14, we connected to a SQL Server database, applied a filter, and selected a few columns. Because these operations can be translated to SQL, Power Query decided to do the transformations in the source instead of performing them after loading the whole table, which led to better performance.
You cannot edit the native query; it is provided for your information only. If you want Power BI to issue a specific query, you must provide a SQL statement when connecting to a database.
If the View Native Query option is grayed out, it means that the data source does not support query folding or that some query step could not be translated into the data source’s native language. For example, if we applied the Clean transformation to a text column, the query would not fold, because there is no direct equivalent in SQL yet.
Query diagnostics
Power BI contains the query diagnostics toolset, which can help you identify performance bottlenecks. Query diagnostics allow you to see the queries that you emit while authoring or refreshing a dataset. They are especially useful for working with data sources that support query folding. By using query diagnostics, you can look at all queries that happen during data refreshes or while you author queries, or you can analyze a single step in detail.
To learn how to use query diagnostics, you’ll connect to an OData feed first. It’s a feed from Microsoft based on their fictitious AdventureWorks company.
Create a new Power BI Desktop file.
Select Get data (or New Source if you’re already in Power Query Editor) > OData feed.
Enter https://services.odata.org/AdventureWorksV3/AdventureWorks.svc in the URL box and select OK.
If prompted, in the credentials window, ensure Anonymous is selected and select Connect.
Select the CompanySales check box in the Navigator window and select Transform Data or OK if you’re already in Power Query Editor.
Now that you are connected to an OData feed, you can apply some transformations and see the effect on our query. To start recording traces in Power Query, select Start Diagnostics on the Tools ribbon; when finished, select Stop Diagnostics. Alternatively, you can analyze a single step—for this, you must select the Diagnose Step button on the Tools ribbon, or you can right-click a step and select Diagnose. We are going to analyze a single step in the following way:
Filter the ProductCategory column to Bikes by using the filter button on the column header.
Right-click the ProductCategory column header and select Remove.
In the Query Settings pane, right-click the last step and select Diagnose.
After Power Query finishes recording the traces, it creates a new query group called Diagnostics (as in Figure 2-15, which contains several queries whose names start with CompanySales_Removed Columns, all ending with the current date and time). The queries are sources from JSON files stored locally on your computer. The Detailed query contains more rows and columns than the Aggregated query, which is a summary query.
Among other information available in the recorded traces, you will see the time it took for a query to run and whether a native query was sent to a data source, which can help you understand if query folding took place. In Aggregated and Detailed queries, you can find the Data Source Query column, which contains the query sent to the data source, if available.
Occasionally, you won’t be able to see the native query by using the View Native Query feature discussed earlier in this chapter, but you will see a native query sent to a data source when using query diagnostics. We can check whether query folding took place by following these steps:
In the Aggregated diagnostics query, filter the Operation column to only include CreateResult.
Go to the Data Source Query column and select the only column cell. You should see the result shown in Figure 2-15.
FIGURE 2.15 Native query sent to OData feed
The full query is as follows:
https://services.odata.org/AdventureWorksV3/AdventureWorks.svc/CompanySales?$filter= ProductCategory eq 'Bikes'&$select=ID,OrderQtr,OrderYear,ProductSubCategory,Sales&$top=1000 HTTP/1.1
Note that query folding occurs; the filter we placed on the ProductCategory column is included in the query and the ProductCategory column is not included in the query result. If you only relied on the View Native Query feature, you would not see the query because the option would be grayed out.
Some query diagnostics may require that you run Power BI Desktop as administrator. If you are unable to record some traces, like full data refreshes, due to IT policies, you can still record traces when previewing and authoring queries in Power Query Editor. For this, go to File > Options and settings > Options > Global > Diagnostics > Query diagnostics and select Enable in Query Editor.
Implement performance improvements in Power Query and data sources
If your Power Query queries need to be improved in terms of performance, the techniques you’ll use will highly depend on the data sources you’re using and how you’re using them. In this section, we’ll explore a few methods of improving the data loading performance, divided into four parts:
General advice
Working with files
Working with foldable data sources
Improving the merge operations
General advice
In almost every situation, you want to load only the data you need. You should filter the data and remove columns as early as possible. Many operations in Power Query are performed in memory, and less data translates into better performance.
If you can choose, in many cases you should use table functions instead of operating on lists and records because Power Query includes some built-in optimizations that can make use of table functions.
When working with tables or lists that are referenced multiple times, you may improve performance by using Table.Buffer or List.Buffer, respectively. Buffering functions store data in memory and ensure that data isn’t read multiple times from the source. When buffering data, you should note two things: first, buffering prevents query folding, so if you’re working with foldable data source and buffer a table, then further transformations won’t be folded even if they could be folded without buffering. Second, only scalar values are buffered; if your table or list includes nested structures such as records or other lists or tables, then those values won’t be buffered, and if you need them later, then your data source will be read again.
Depending on the nature of your data sources and how you’re using them, you may want to edit the Parallel loading of tables parameter in the current file data load settings (File > Options and settings > Options > Current File > Data Load), where the default value is 6, meaning six tables are going to be loaded in parallel at most. If your data model contains more than six tables, which take a long time to refresh, you can try to set a custom parameter value higher than 6, allowing more tables to load in parallel, and your data model may refresh more quickly as a result. On the other hand, if you’re getting some data from a web API and then reference the data a few times, you may want to disable parallel loading, because that way the API may be called fewer times, resulting in better data refresh performance.
In case you’ve got a slow data source that’s updated less frequently than you refresh your dataset, you may benefit from loading data to a dataflow first and then loading it to your dataset. Dataflows are discussed in the next section.
Working with files
If you work with Excel files, then you should ensure the third parameter of Excel.Workbook is set to true to leave the column types untyped, which can make reading the files quicker.
If you can choose the file types you work with, then instead of Excel you may want to choose CSVs. While sometimes Excel files may be smaller than the equivalent CSV files, the latter are simpler in structure and therefore faster to read.
While Parquet and CSV files offer approximately the same performance, Parquet files are smaller, which may be relevant when you’re using a gateway to access files from your company’s network.
Working with foldable data sources
If you can make use of incremental refresh, then it may decrease the data refresh times substantially. We’ll discuss incremental refresh in Chapter 3, “Implement and Manage Data Models.”
In general, you should push as many transformations as you can as close to the data source as possible, because in many cases data sources may perform data operations more efficiently than Power Query. Some transformations may still fold, even if the View native query option of the step is grayed out, as discussed earlier in this chapter in the “Query diagnostics” section. For example, if you need to filter some SQL data based on a data source of a different type, like Excel, then instead of merging tables, you should use a native query and make use of the WHERE clause in SQL. For example, the following query applies a filter on the pbi.Product table from a SQL database based on a list of items from a different Power Query query, ProductSubCategories:
let FilterItems = "'" & Text.Combine(List.Buffer(ProductSubCategories), "', '") & "'", Source = Sql.Databases("mydatabase.database.windows.net"), sakes = Source{[Name="sales"]}[Data], Result = Value.NativeQuery(sales, "SELECT * FROM pbi.Product WHERE [Product Type] IN (@Items)", [Items = FilterItems]) in Result
Improving the merge operations
When you merge two tables and one table has a unique key, you should add a key to it. While it’s possible to use the Table.AddKey function in Power Query, doing so won’t guarantee uniqueness; it’s preferable to remove duplicates from the key column or columns, which will add a key automatically and improve the merge performance.
While you should not keep data you don’t need in general, it’s especially important in case of merges; since merges take place in memory, fewer columns mean quicker merges. You can remove columns before the merge or immediately after, and the performance gains are going to be similar.
If you work with data that has its keys pre-sorted, then instead of Table.NestedJoin, you can use Table.Join with JoinAlgorithm.SortMerge as the last parameter. Note that if your data isn’t sorted, then you’re going to get unexpected results without any error message.
Create and manage scalable Power BI dataflows
In addition to Power BI Desktop, Power Query can be found in the Power BI service: you can prep, clean, and transform data in dataflows. Dataflows can be useful when you want your Power Query queries to be reused across your organization without necessarily being in the same dataset. For this reason, you cannot create a dataflow in your own workspace, because only you have access to it.
To create a dataflow in a workspace, select New >Dataflow. From there, you have several choices:
Add new tables—Define new tables from scratch by using Power Query.
Add linked tables—Linked entities are tables in other dataflows that you can reuse to reduce duplication of data and improve consistency across your organization.
Import model—If you have a previously exported dataflow model file, you can import it.
Create and attach—Attach a Common Data Model folder from your Azure Data Lake Storage Gen2 account and use it in Power BI.
The Power Query Online interface looks similar to Power Query Editor in Power BI Desktop and is shown in Figure 2-16.
FIGURE 2.16 Power Query interface when editing a dataflow
Once you finish authoring your queries, you can select Save & close and enter the name of the new dataflow. After saving, you’ll need to refresh it by selecting Refresh now from the dataflow options in the workspace—otherwise it won’t contain any data. When a dataflow finishes refreshing, you can connect to it from Power BI Desktop and get data from it.
If you transform data in dataflows, for best performance you may want to separate data extraction and transformation into different dataflows, which is especially helpful in case of slow data sources. If you then enable the enhanced compute engine in the dataflow settings, the transformations will be performed in a more efficient way.
When you used linked tables, the dataflow that depends on a linked table will be automatically refreshed when you refresh the dataflow that contains the original table.
Similar to datasets, you can configure incremental refresh in dataflows. We’ll review incremental refresh in the next chapter.