Data Aggregation
- By Mark Morowczynski, Rod Trent, Matthew Zorich
- 8/5/2024
- We Are Dealing with a Lot of Data Here
- Obfuscating Results
- Distinct and Count
- Min, Max, Average, and Sum
- Bins, Percentages, and Percentiles
- Lists and Sets
Bins, Percentages, and Percentiles
As we continue to analyze more of our data, we’ll often need ways to group this data out by different segments to answer questions. What day of the week was the most active? Which month of the year was the least active? We will use a common technique called binning to accomplish this and more. We’ll also frequently need to quickly convert the data into something a little easier to understand. Showing the percentage and the 25th or 95th percentile distribution for the data will help you tell a story with the data.
Grouping Data By Values (Binning)
Binning, or as you’ll see it called, the bin() or floor() function, allows you to group your datasets by a smaller, specific set of values. The bin function takes two parameters:
The first is the value you want to round down. This can be the int, long, real, datetime, or timespan types. (You’ll end up using timespan often.)
The second parameter is the bin size by which the values will be divided. This can be the int, long, real, or timespan types.
The most common type of binning will be by a date interval, frequently using a per-day interval. The bin function would be bin(TimeGenerated, 1d). Another type of binning could be on different size groupings. For example, you could query how much free space was on a disk for your entire fleet and then bin them by intervals of 50 GB to see how many fall into each bucket.
Let’s run through a few examples of using per-day bins. Run the following query; your results should be similar to Figure 2-22.
SigninLogs | where TimeGenerated > ago(14d) | where ResultType == 0 | summarize SuccessfullSignIn=count() by bin(TimeGenerated, 1d) | sort by TimeGenerated asc
FIGURE 2-22 Daily Successful sign-in count
We are first filtering for how successful sign-ins are. In the previous examples, we counted them for those 14 days, but now you can see some days are busier than most. For most organizations, this is expected as people are off not working on the weekend. But the ability to bin by date is extremely useful. We’ll use this functionality multiple times throughout this book.
Let’s also look at our previous application example, where we looked at how many requests per second it performed. We can simply add a binning technique to our existing query to break that summarized column by that daily time interval. Run the following query; your output should be similar to Figure 2-23:
AppPerformanceCounters | where TimeGenerated > ago(14d) | where Name == "Requests/Sec" and AppRoleName == "Fabrikam-App" | summarize sum(Value) by AppRoleName, Name, bin (TimeGenerated, 1d) | project TimeGenerated, AppRoleName, Name, sum_Value | sort by TimeGenerated asc
FIGURE 2.23 Total requests per second, per day
We made a few small modifications to the original query. First, we only filtered for the application and performance counter we were interested in. Our summarize function is the same as before, except we added a 1-day bin interval. We then cleaned up the output and sorted by date. If you wished any of the previous queries had been broken down by different intervals, feel free to alter them using the bin function!
Percentage
Calculating percentages is another common task. There is no built-in “to percentage” function, but we can calculate things using the todouble() function, dividing values, and multiplying results by 100—just as you would by hand. Let’s use an example with real-life recommendations and combine it with some of the new KQL skills you’ve picked up so far. What is the percentage of sign-ins using single-factor authentication versus multifactor authentication? The summarize count() functions will tally the number of each authentication method, and then we use extend to calculate the percentage. Run the following query; your results should be similar to Figure 2-24:
SigninLogs | where TimeGenerated > ago (14d) | where ResultType == 0 | project TimeGenerated, AppDisplayName, UserPrincipalName, ResultType, ResultDes cription,AuthenticationRequirement, Location | summarize TotalCount=count(),MultiFactor=countif(AuthenticationRequirement == "multiFactorAuthentication"), SingleFactor=countif(AuthenticationRequirement == "singleFactorAuthentication") | extend ['MFA Percentage']=(todouble(MultiFactor) * 100 / todouble(TotalCount)) | extend ['SFA Percentage']=(todouble(SingleFactor) * 100 / todouble(TotalCount))
FIGURE 2.24 Percentage of MFA and single-factor sign-ins
Thankfully, this is a test environment because those numbers look bad. If you see similar numbers in your production environment, stop reading and roll out multifactor authentication immediately.
Let’s break down this query. The beginning is the normal stuff, where we filter by time and successful sign-ins. Then, we pull the columns we want to work with and summarize the total count of all sign-ins, and then totals depending if the sign-ins are single-factor or multifactor.
Now, we will calculate the percentage of single-factor and multifactor by taking each integer total and casting the single-factor count and mulitfactor count to double using the todouble() function and multiplying by 100. Remember, as covered in the “Numerical Operators” section in Chapter 1, the data types can impact your results for numerical calculations. As you can see below, we have less than 1 percent of multifactor authentication sign-ins!
We can also round these results using the round() function, where you pass in the number you want to round and how much precision you want. We’ll use 2 and 3 digits in the query below to show you the difference. Update your previous query to the following; your results will be similar to Figure 2-25:
SigninLogs | where TimeGenerated > ago (14d) | where ResultType == 0 | project TimeGenerated, AppDisplayName, UserPrincipalName, ResultType, ResultDes cription,AuthenticationRequirement, Location | summarize TotalCount=count(),MultiFactor=countif(AuthenticationRequirement == "multiFactorAuthentication"), SingleFactor=countif(AuthenticationRequirement == "singleFactorAuthentication") | extend ['MFA Percentage']=round((todouble(MultiFactor) * 100 / todouble(TotalCount)), 2) | extend ['SFA Percentage']=round((todouble(SingleFactor) * 100 / todouble(TotalCount)), 3)
FIGURE 2.25 The rounded percentage of multifactor sign-ins and single-factor sign-ins
As you can see, you can round and alter how many digits you want to round to. This will be one of those common tactics you use repeatedly to calculate the percentage.
Percentiles
What if you wanted to determine if the values for the column are larger than a specific percentage compared to the other data? For that, we’ll need to use the percentile() or percentiles() functions. Percentile() takes two parameters: the column you want to use for the calculation, and then the percentage you want to determine is equal to or larger than for that sample set. Percentiles() works similarly, except you can specify multiple comma-separated values. Let’s go back to the ApplicationPerformanceCounters table and run the following query; your results should be similar to Figure 2-26:
AppPerformanceCounters | where TimeGenerated > ago(14d) | where Name == "Available Bytes" | summarize percentile(Value,50) by AppRoleName, Name
FIGURE 2.26 The 50th percentile value for Available Bytes per application
Here, we can see the value of Available Bytes that would be 50 percent or larger of the values for each application. We can get the values for multiple percentages using percentiles(). Update your command to the following; your output will be similar to Figure 2-27:
AppPerformanceCounters | where TimeGenerated > ago(14d) | where Name == "Available Bytes" | summarize percentiles(Value,25,50, 75) by AppRoleName, Name
FIGURE 2.27 The 25th, 50th, and 75th percentile values for available bytes per application
These values fall along the 25 percent, 50 percent, and 75 percent percentiles. This type of query is very interesting when you are trying to determine how to allocate and size resources such as virtual machine size or Azure App Service plan to pick for capacity planning or looking at usage spikes. You can also leverage this when looking for anomalies or outliers in your datasets. For example, if you have a simple test application that authenticates 100 times a day, that isn’t the most concerning. However, if you looked at the percentiles of sign-ins and found that it was in the 95 percent percentile, that would probably be a big cause for concern. The simple test application should not be one of our environment’s most logged-in applications. Either something is misconfigured, or it’s being used in a way outside its normal scope. Percentiles can help highlight those types of behaviors.