The VertiPaq Engine in DAX

11/3/2015

Contents

Understanding database processing
Introduction to columnar databases
Understanding VertiPaq compression
Understanding segmentation and partitioning
Using Dynamic Management Views
Understanding materialization
Choosing hardware for VertiPaq

Using Dynamic Management Views

SSAS lets you discover all the information about the data model using Dynamic Management Views (DMV). DMVs are extremely useful to explore how your model is compressed, the space used by different columns and tables, the number of segments in a table, or the number of bits used by columns in different segments.

You can run DMVs from inside SQL Server Management Studio or, better, using DAX Studio. Moreover, DAX Studio offers you the list of all DMV in a simpler way, without the need to remember them or to reopen this book looking for the DMV name you forgot. You can also use the free tool VertiPaq Analyzer (http://www.sqlbi.com/tools/vertipaq-analyzer/) to automatically retrieve data from DMVs and see them in useful reports.

Although DMVs use an SQL-like syntax, you cannot use full SQL syntax to query them, because they do not run inside SQL Server, they are a convenient way to discover the status of SSAS and to gather information about data models. Moreover, DMVs were created when SSAS supported only Multidimensional, so the information provided is not optimized for Tabular. For example, when you query the column information, you get as a result CUBE_NAME, MEASURE_GROUP_NAME, and DIMENSION_NAME, although in VertiPaq there is no concept of cube, measure group, or dimension.

There are different DMVs, divided in two main categories:

SCHEMA views. These return information about SSAS metadata, such as database names, tables, and individual columns. They do not provide statistical information. Instead, they are used to gather information about datatypes, names, and similar data.
DISCOVER views. They are intended to gather information about the SSAS engine and/or discover statistics information about objects in a database. For example, you can use views in the discover area to enumerate the DAX keywords, the number of connections and sessions that are currently open, or the traces running.

In this book, we do not want to describe the details of all those views, because they would be off-topic. If you need more information, you can find it in Microsoft documentation on the web. Instead, we want to give some hints and point out the most useful DMVs related to databases used by DAX, which are in the DISCOVER area.

Moreover, while many DMVs report useful information in many columns, in this book we describe the most interesting ones related to the internal structure. For example, there are DMVs to discover the datatypes of all columns, which is not interesting information from the modeling point of view (it might be useful for client tools, but it is useless for the modeler of a solution). On the other hand, knowing the number of bits used for a column in a segment is very technical and definitely useful to optimize a model, so we highlighted it.

Using DISCOVER_OBJECT_MEMORY_USAGE

The first, and probably the most useful DMV lets you discover the memory usage of all the objects in the SSAS instance. This DMV returns information about all the objects in all the databases in the SSAS instance, and it is not limited to the current database.

SELECT * FROM $SYSTEM.DISCOVER_OBJECT_MEMORY_USAGE

The output of the DMV is a table containing many rows that are very hard to read, because the structure is a parent/child hierarchy that starts with the instance name and ends with individual column information.

The most useful columns in the dataset are as follows:

OBJECT_ID: Is the ID of the object of which it is reporting memory usage. By itself, it is not a key for the table. You need to combine it with the OBJECT_PARENT_PATH to make it a unique value working as a key.
OBJECT_PARENT_PATH: Is the full path of the parent in the parent/child hierarchy.
OBJECT_MEMORY_NON_SHRINKABLE: Is the amount of memory used by the object.

As we said, the raw dataset is nearly impossible to read. However, you can build a Power Pivot data model on top of this query, implementing the parent/child hierarchy structure and browse the full memory map of your instance. Kasper De Jonge published a workbook on his blog that does exactly this, and you can find it here: http://www.powerpivotblog.nl/what-is-using-all-that-memory-on-my-analysis-server-instance/

Using DISCOVER_STORAGE_TABLES

The DISCOVER_STORAGE_TABLES DMV is useful to quickly discover tables in a model. It returns only the tables of the current model. In reality, despite its name, it returns tables, hierarchies, and relationships, but the important information is the set of tables.

The most important columns are as follows:

DIMENSION_NAME: Even if it is named “dimension,” for Tabular models it is the table name.
TABLE_ID: The internal ID of the table, which might be useful to create relationships, because it contains the GUID used by SSAS as a suffix on most table names. Hierarchies and relationships, reported by the same DMV, have an ID starting with H$ and R$, respectively.
TABLE_PARTITIONS_COUNT: This represents the number of partitions of the table.
ROWS_COUNT: It is the total number of rows of the table.

A typical usage of this DMV is to run a query similar to the following one, which returns table name and number of rows for only the tables (by checking the first characters of DIMENSION_NAME and TABLE_ID).

SELECT
    DIMENSION_NAME AS TABLE_NAME,
    ROWS_COUNT AS ROWS_IN_TABLE
FROM  $SYSTEM.DISCOVER_STORAGE_TABLES
WHERE DIMENSION_NAME = LEFT ( TABLE_ID, LEN ( DIMENSION_NAME ) )
ORDER BY DIMENSION_NAME

Using DISCOVER_STORAGE_TABLE_COLUMNS

This DMV gives you detailed information about individual columns, either in tables or in relationships and hierarchies. It is useful to discover, for example, the size of the dictionary, its datatype, and the kind of encoding used for the column.

Most of the information is useful for columns, while it is of less use for hierarchies (either user or system hierarchies). Specifically, for hierarchies, the only useful information is the total size, because other attributes depend directly on the columns they use.

The most relevant columns are as follows:

DIMENSION_NAME: Even if it is named “dimension,” for Tabular models it is the table name.
TABLE_ID: The internal ID of the table, which might be useful to create relationships, because it contains the GUID used by SSAS as a suffix on most table names. Hierarchies and relationships, reported by the same DMV, have an ID starting with H$ or R$.
COLUMN_ID: For columns, it is the column name, while for hierarchies it indicates ID_TO_POS or POS_TO_ID, which are internal names for hierarchy structures.
COLUMN_TYPE: Indicates the type of column. Standard columns contain BASIC_DATA, whereas hierarchies contain different internal names of no interest for this book.
COLUMN_ENCODING: Indicates the encoding used for the column: 1 stands for hash (dictionary encoding), 2 is value encoding.
DICTIONARY_SIZE: Is the size, in bytes, of the dictionary of the column.

For example, to retrieve table name, column name, and dictionary size of all columns in your model, you can run this query:

SELECT
    DIMENSION_NAME AS TABLE_NAME,
    COLUMN_ID AS COLUMN_NAME,
    DICTIONARY_SIZE AS DICTIONARY_SIZE_BYTES
FROM
    $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMNS
WHERE COLUMN_TYPE = 'BASIC_DATA'

Using DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS

Among the various DMVs, this is the most detailed one, because it reports information about individual segments and partitions of columns. Its content is very detailed and might be overwhelming. Thus, it makes sense to use this information only when you are interested in squeezing the size of a large table, or in performing some kind of extreme optimization.

The most relevant columns are as follows:

DIMENSION_NAME: Even if it is named “dimension,” for Tabular models it is the table name.
TABLE_ID: The internal ID of the table.
COLUMN_ID: For columns, it is the column name, whereas for hierarchies it indicates ID_TO_POS or POS_TO_ID, which are internal names for hierarchy structures.
SEGMENT_NUMBER: The number of the segment reported, zero-based.
TABLE_PARTITION_NUMBER: The number of the partition to which the segment belongs.
RECORDS_COUNT: The number of rows in the segment.
ALLOCATED_SIZE: The size allocated for the segment.
USED_SIZE: The size actually used for the segment.
COMPRESSION_TYPE: Indicates the compression algorithm used for the column in the segment. Its content is private and not documented, because the algorithm is patented.
BITS_COUNT: Number of bits used to represent the column in the segment.
VERTIPAQ_STATE: can be SKIPPED, COMPLETED, or TIMEBOXED: And it indicates if the engine had the option to find the optimal sorting for the segment (COMPLETED), if it used the best found during the time it was allowed to use but stopped before finding the optimal one (TIMEBOXED), or if the sorting step was skipped (SKIPPED).