The VertiPaq Engine in DAX
- 11/3/2015
Understanding materialization
Now that you have a basic understanding of how VertiPaq stores data in memory, you need to learn what materialization is. Materialization is a step in query resolution that happens when using columnar databases. Understanding when and how it happens is of paramount importance.
In order to understand what materialization is, look at this simple query:
EVALUATE ROW ( "Result", COUNTROWS ( SUMMARIZE ( Sales, Sales[ProductKey] ) ) )
The result is the distinct count of product keys in the Sales table. Even if we have not yet covered the query engine (we will, in Chapter 15, “Analyzing DAX query plans” and Chapter 16, “Optimizing DAX”), you can already imagine how VertiPaq can execute this query. Because the only column queried is ProductKey, it can scan that column only, finding all the values in the compressed structure of the column. While scanning, it keeps track of values found in a bitmap index and, at the end, it only has to count the bits that are set. Thanks to parallelism at the segment level, this query can run extremely fast on very large tables and the only memory it has to allocate is the bitmap index to count the keys.
The previous query runs on the compressed version of the column. In other words, there is no need to decompress the columns and to rebuild the original table to resolve it. This optimizes the memory usage at query time and reduces the memory reads.
The same scenario happens for more complex queries, too. Look at the following one:
EVALUATE ROW ( "Result", CALCULATE ( COUNTROWS ( Sales ), Product[Brand] = "Contoso" ) )
This time, we are using two different tables: Sales and Product. Solving this query requires a bit more effort. In fact, because the filter is on Product and the table to aggregate is Sales, you cannot scan a single column.
If you are not yet used to columnar databases, you probably think that, to solve the query, you have to iterate the Sales table, follow the relationship with Products, and sum 1 if the product brand is Contoso, 0 otherwise. Thus, you might think of an algorithm similar to this one:
EVALUATE ROW ( "Result", SUMX ( Sales, IF ( RELATED ( Product[Brand] ) = "Contoso", 1, 0 ) ) )
This is a simple algorithm, but it hides much more complexity than expected. In fact, if you carefully think of the columnar nature of VertiPaq, this query involves three different columns:
- Product[Brand] used to filter the Product table.
- Product[ProductKey] used to follow the relationship between Product and Sales.
- Sales[ProductKey] used on the Sales side to follow the relationship.
Iterating over Sales[ProductKey], searching the row number in Products scanning Product[ProductKey], and finally gathering the brand in Product[Brand] would be extremely expensive and require a lot of random reads to memory, negatively affecting performance. In fact, VertiPaq uses a completely different algorithm, optimized for columnar databases.
First, it scans Product[Brand] and retrieves the row numbers where Product[Brand] is Contoso. As you can see in Figure 13-8, it scans the Brand dictionary (1), retrieves the encoding of Contoso, and finally scans the segments (2) searching for the row numbers where ID equals to 0, returning the indexes to the rows found (3).
FIGURE 13-8 The output of a brand scan is the list of rows where Brand equals Contoso.
At this point, VertiPaq knows which rows in the Product table have the given brand. The relationship between Product and Sales is based on Products[ProductKey] and, at this point VertiPaq knows only the row numbers. Moreover, it is important to remember that the filter will be placed on Sales, not on Products. Thus, in reality, VertiPaq does not need the values of Products[ProductKey], what it really needs is the set of values of Sales[ProductKey], that is, the data IDs in the Sales table, not the ones in Product.
You might remember, at this point, that VertiPaq stores relationships as pairs of row numbers in Product and data IDs in Sales[ProductKey]. It turns out that this is the perfect data structure to move the filter from row numbers in Products to ProductKeys in Sales. In fact, VertiPaq performs a lookup of the selected row numbers to determine the values of Sales[ProductKey] valid for those rows, as you can see in Figure 13-9.
FIGURE 13-9 VertiPaq scans the product key to retrieve the IDs where brand equals Contoso.
The last step is to apply the filter on the Sales table. Since we already have the list of values of Sales[ProductKey], it is enough to scan the Sales[ProductKey] column to transform this list of values into row numbers and finally count them. If, instead of computing a COUNTROWS, VertiPaq had to perform the SUM of a column, then it would perform another step transforming row numbers into column values to perform the last step.
As you can see, this process is made up of simple table scanning where, at each step, you access a single column. However, because data in a column is in the same memory area, VertiPaq sequentially reads blocks of memory and performs simple operations on it, producing every time as output a small data structure that is used in the following step.
The process of resolving a query in VertiPaq is very different from what common sense would suggest. At the beginning, it is very hard to think in terms of columns instead of tables. The algorithms of VertiPaq are optimized for column scanning; the concept of a table is a second-class citizen in a columnar database.
Yet there are scenarios where the engine cannot use these algorithms and reverts to table scanning. Look, for example, at the following query:
EVALUATE ROW ( "Result", COUNTROWS ( SUMMARIZE ( Sales, Sales[ProductKey], Sales[CustomerKey] ) ) )
This query looks very innocent, and in its simplicity it shows the limits of columnar databases (but also a row-oriented database faces the same challenge presented here). The query returns the count of unique pairs of product and customer. This query cannot be solved by scanning separately ProductKey and CustomerKey. The only option here is to build a table containing the unique pairs of ProductKey and CustomerKey, and finally count the rows in it. Putting it differently, this time VertiPaq has to build a table, even if with only a pair of columns, and it cannot execute the query directly on the original store.
This step, that is, building a table with partial results, which is scanned later to compute the final value, is known as materialization. Materialization happens for nearly every query and, by itself, it is neither good nor bad. It all depends on the size of the table materialized. In fact, temporary tables generated by materialization are not compressed (compressing them would take a lot of time, and materialization happens at query time, when latency is extremely important).
It is significant to note that materialization does not happen when you access multiple columns from a table. It all depends on what you have to do with those columns. For example, a query such as the following does not need any materialization, even if it accesses two different columns:
EVALUATE ROW ( "Result", SUMX ( Sales, Sales[Quantity] * Sales[Net Price] ) )
VertiPaq computes the sum performing the multiplication while scanning the two columns, so there is no need to materialize a table with Quantity and Net Price. Nevertheless, if the expression becomes much more complex, or if you need the table for further processing (as it was the case in the previous example, which required a COUNTROWS), then materialization might be required.
In extreme scenarios, materialization might use huge amounts of RAM (sometimes more than the whole database) and generate very slow queries. When this happens, your only chance is to rewrite the calculation or modify the model in such a way that VertiPaq does not need to materialize tables to answer your queries. You will see some examples of these techniques in the following chapters of this book.