Data Aggregation
- By Mark Morowczynski, Rod Trent, Matthew Zorich
- 8/5/2024
- We Are Dealing with a Lot of Data Here
- Obfuscating Results
- Distinct and Count
- Min, Max, Average, and Sum
- Bins, Percentages, and Percentiles
- Lists and Sets
Min, Max, Average, and Sum
Counting totals and determining the distinct number of rows is just the start when it comes to using summarize. There are many additional statistical types of information we’ll frequently want to pull from our dataset, such as determining the first and last time something occurred. Perhaps you want to determine the average number of connections to a resource or the total amount of disk space consumed by your resources. There are aggregate functions to help you calculate these quickly.
Determining the Min and Max
A common scenario that will come up more often than you think is determining the first or last occurrence of something. You can use the min() or max() functions to find the minimum or maximum value of what is passed to it, such as finding the first time someone signed in to an application. Run the following query; your output should be similar to Figure 2-14:
SigninLogs | where TimeGenerated > ago (14d) | summarize TotalCount = count(), FirstEvent=min(TimeGenerated) by AppDisplayName | sort by FirstEvent asc
FIGURE 2.14 The first sign-in event in the application and the total sign-ins for that app
We can now quickly determine the first time a sign-in event was generated for that application and sort our results based on the earliest time. We can also do the opposite and determine the last time a sign-in event occurred for an application. To do that, we’ll use the max function. Update the query to match the one listed here; the output should be similar to Figure 2-15.
SigninLogs | where TimeGenerated > ago (14d) | summarize TotalCount = count(), LastEvent=max(TimeGenerated) by AppDisplayName | sort by LastEvent desc
FIGURE 2.15 The last sign-in event in the application and the total sign-ins for the app
The output is similar to our last result but now shows the last sign-in event for that application. As mentioned earlier, we can combine multiple summarize functions to refine our results further. We can get a side-by-side timeline view of the first and last events with just the min and max functions. Run the following query; your results should be similar to the output in Figure 2-16:
SigninLogs | where TimeGenerated > ago (14d) | summarize TotalCount = count(), FirstEvent = min(TimeGenerated), LastEvent=max(TimeGenerated) by AppDisplayName | project AppDisplayName, TotalCount, FirstEvent, LastEvent | sort by FirstEvent asc, LastEvent desc
FIGURE 2.16 The first and last sign-in event for each application and the total sign-ins for each application
Here, we are combining a few things that we’ve used so far in this book:
First, we use our new min and max aggregate functions to easily pull out the first and the last time a sign-in event occurred.
Next, we re-order the column’s output to put the functions’ results side by side to make it easier to see the difference.
Finally, we sort both columns, starting with the first event and then the last.
As we move into more advanced queries, you will see this similar pattern of combining multiple functions and filters, continuing to refine the query, and then formatting the output. You could easily add a filter for a specific user account to see this same information but for that user account.
Both min() and max() functions have a corresponding minif() and maxif() function. These work similarly to the countif() and dcountif() functions, where you can provide an expression to be evaluated; if the expression evaluates to true, it will then determine their min and max range.
The min and max functions return the value of a column, but what if you want the values for additional columns or find the columns where that value is located? You would use the arg_min() and arg_max() aggregate functions. You would provide the first column for which you want to find the minimum or maximum values, followed by the other columns for which you’d also like these values returned. You’d enter an asterisk (*) for all columns. Run the following query to find the minimum values of TimeGenerated; your output will be similar to Figure 2-17:
SigninLogs | where TimeGenerated > ago (14d) | summarize FirstEvent = arg_min(TimeGenerated, ConditionalAccessStatus, ClientAppUsed, AuthenticationRequirement) by AppDisplayName | sort by FirstEvent asc
FIGURE 2.17 The minimum value of TimeGenerated by application with the additional columns specified
Here, we are looking for the minimum value of TimeGenerated—the first result showing an application sign-in event. Then, we also included additional columns we want to see the values of when TimeGenerated is at its minimum value, such as conditional access status, the client application used to access the application, and finally, whether it was a single-factor or multifactor request. We can run a similar query using the arg_max and return all columns using a *. Run the following query; your output will be similar to Figure 2-18:
SigninLogs | where TimeGenerated > ago (14d) | summarize LastEvent = arg_max(TimeGenerated, *) by AppDisplayName | sort by LastEvent desc
This is similar to the minimum-value results, except we start with the most recent event and return all the columns in the table. The scrollbar at the bottom of Figure 2-18 shows that we have many more output columns to see all the values for each application’s most recent event.
FIGURE 2.18 Maximum value
Determining the Average and Sum
The final set of statistical functions we’ll look at in this section are average and summation. Just as you learned in school, these functions will find the avg(), otherwise known as the arithmetic mean, and sum(), which will find the sum of values in a column. Let’s run the following query to understand how these work; your output should be similar to Figure 2-19:
SigninLogs | where TimeGenerated > ago (14d) | summarize AvgCreatedTime = avg(CreatedDateTime)by AppDisplayName
FIGURE 2.19 The average time when a sign-in event occurred for each application
Here, we can see the average time an event was created per application. We can also expand this with the avgif() function. Like our previous aggregate functions that use an if function, we can evaluate an expression; if its results are true, that expression is used for the calculation. For this, let’s determine the average creation date if the user signed in from the US. Run the following query; your results should be similar to Figure 2-20:
SigninLogs | where TimeGenerated > ago (14d) | summarize AvgCreatedTime = avgif(CreatedDateTime, Location == "US")by AppDisplayName
FIGURE 2.20 Average time when a US sign-in occurred for each application
Similar to our previous results, we are now filtering on the average creation time if the sign-in came from the US. Some good examples of when to use average would be calculating the processor utilization or memory consumption of our IaaS virtual machines or even more advanced functionality from our Internet of Things (IoT) devices that might be reporting the temperature and humidity of their locations.
The next aggregate functions we will look at are sum() and sumif(). For these, you simply provide the column you want to summarize. The data type value in the column needs to be numeric, such as a decimal, double, long, or integer. For more information on data types, see Chapter 1, “Data Types and Statements.” Our sample sign-in logs don’t have any good columns to sum, so we are using a different table, AppPerformanceCounters, for this query because it has more data with values that can be totaled. Run the following query; the results should be similar to Figure 2-21:
AppPerformanceCounters | where TimeGenerated > ago(14d) | summarize sum(Value) by AppRoleName, Name
FIGURE 2.21 The sum of the application performance counters
Going through these performance counters for an application is a bit outside of the scope of this book, but the aggregate functions used so far can be applied to this table and columns. Understanding how much time an application has been executing or how much memory it has consumed might highlight places for optimization to drive some of the consumption costs down.
We can see that the Fabrikam-App handles 7,835 requests per second, more than ch1-usagegenfuncy37ha6, which performs 5,507 requests per second. We could have made this easier to read by only displaying that column. See “Visualizing Data” later in this chapter to see how to graph this data.
So far, everything we’ve been looking at is just doing the aggregate function for the 14-day timespan we’ve provided. In the previous example, Fabrikam-App handled 7,835 requests per second over those 14 days. Was one day busier for that application than another? Which day was the slowest day? Can we reduce our resource count? You could change your query to be only for the last day and run it daily, or you can have KQL do that using a concept called binning, which is covered next.