Data Aggregation

  • 8/5/2024

Distinct and Count

Some common scenarios you will need to repeat repeatedly are narrowing down to the distinct number of elements returned and counting the elements. Often, you’ll want to combine those two things! We can do all that and much more.

Distinct

We’ll start with the distinct operator, which will return the results based on the distinct combination of columns you provide. We’ll start by trying to answer a simple question: How many different user agents are being used in the environment? If we run our query as we did in Figure 2-1, we’ll see we have many different records; see Figure 2-2.

SigninLogs
| where TimeGenerated > ago (14d)
| project UserAgent
FIGURE 2.2

FIGURE 2.2 User agents that have been used in the last 14 days

As you can see, in the last 14 days, we had 24,696 sign-ins, and the list of the different user agents available seems pretty varied. The first two results are the same; if we look near the bottom, the third and fifth results are the same. But to answer our question, we need to remove the duplicates and only return unique values. Let’s try our query again, but instead of using project, let’s use the distinct operator in its place. The results should look similar to Figure 2-3.

SigninLogs
| where TimeGenerated > ago (14d)
| distinct UserAgent
FIGURE 2.3

FIGURE 2.3 Distinct user agents that have been used in the last 14 days

Our dataset was further reduced to 154 unique UserAgent strings in this environment. We need to work on some of our device management and patching to reduce this number further and ensure that our environment is uniform. A few other things now easily stick out. First, the last row shows a user using Firefox on Ubuntu. Do our security policies and Microsoft Entra ID conditional access policies apply to the Linux platform? If not, we probably need to turn this insight into action and update our policies. Also, third from the bottom is the axios/0.21.4 user agent. This looks very different from our other user agents. Is this expected in this environment? It’s hard to say; this is a demo environment, so probably.

Looking through these types of results in your own data can lead to many interesting discoveries. Besides finding gaps in their Microsoft Entra ID conditional access policies, we’ve had customers find pockets of computers that were never upgraded to the latest operating system, running unpatched and unsupported in production. We can do a few other things to make important findings stand out a bit more, which we’ll get to shortly.

The distinct operator isn’t limited to one column. You can add multiple columns in your query and get the distinct values of that combination. Let’s expand on the previous scenario, where we looked for the unique number of user agents being used and now extend it to which user agents are accessing which applications. We can easily update our query to include applications. Run the following query and add the sorting direction for clarity. Your query should look similar to Figure 2-4:

SigninLogs
| where TimeGenerated > ago (14d)
| distinct AppDisplayName, UserAgent
| sort by AppDisplayName asc
FIGURE 2.4

FIGURE 2.4 Distinct applications and the user agents accessing them

We can now tell the unique instance of each user agent mapped to which application they were accessing. About halfway down the screen, we see five different UserAgent strings used against the AXA Google Cloud Instance application. This is easy enough for us to see, and we can actually see one of those browsers is much older than the others: Chrome 113. But what if we also need to determine the count across all the applications and user agents/browsers?

Summarize By Count

Before we can answer that question directly, we need to introduce a new operator: summarize. We’ll use this frequently in this chapter and the rest of the book. The summarize operator will summarize data and produce a table of the aggregated results. There are several aggregate values, such as count(), dcount(), countif(), and dcountif(), which we’ll discuss in this section. We’ll cover additional aggregate values later in this chapter, such as finding the minimum and maximum values.

The summarize operator follows an input pattern of first specifying a column name for the outputted results of the query you are about to run. This is optional; if nothing is chosen, the default name will be used. The second input is the name of the aggregate function you are using, such as count or dcount. The next output determines which column(s) you want passed through the aggregate function. That seems complicated, but you’ll see shortly that this can be extremely powerful.

We’ll start the first query with summarize, similar to what we did in the previous chapter, by selecting a random sample value—in this case, a table column—and pass it into the aggregate function. To do this, we will use the take_any() aggregate function. Note that any() has been deprecated. Run the following query; your output should be similar to Figure 2-5:

SigninLogs
| where TimeGenerated > ago (14d)
| project TimeGenerated, UserAgent, AppDisplayName
| summarize take_any(*)
FIGURE 2.5

FIGURE 2.5 A random sample row has been returned

This query returned a random row, and we altered our output to show the TimeGenerated, the UserAgent, and AppDisplayName columns. If we wanted to see just the value for UserAgent with summarize, we could also do that by specifying that column in the take_any() function.

Because we have a good handle on the UserAgent value, let’s try and answer a question: Which UserAgent string values do we have in this environment, and how often do they show up? To do that, run the following query; your output should look similar to Figure 2-6.

SigninLogs
| where TimeGenerated > ago (14d)
| summarize count() by UserAgent
FIGURE 2.6

FIGURE 2.6 UserAgents by how many times they were found

Again, a few things should stick out. First, we didn’t provide a column name for the count() aggregation, so it’s just named count_. We can set that display value, which we will do in the next query. Second, we have a wide range of values for count. A good operational practice is to look at the longer tail of these results by looking at user agents that have only a handful of results, which might identify clients that need to be updated or an attacker that has misspelled a user agent name when trying to blend in with the normal traffic. Run the following query; the output will be similar to Figure 2-7.

SigninLogs
| where TimeGenerated > ago (14d)
| summarize UserAgentCount = count() by UserAgent
| sort by UserAgentCount asc
FIGURE 2.7

FIGURE 2.7 UserAgents by how many times they were found, sorted from least to most

Many user agents have only been seen once in the last 14 days. But python-requests/2.28.1 sticks out; we should investigate it. We can add additional columns to the count() by. This will allow us to determine which user agent accessed each application. Run the following query; your output will be similar to Figure 2-8.

SigninLogs
| where TimeGenerated > ago (14d)
| summarize UserAgentCount = count() by UserAgent, AppDisplayName
| sort by UserAgent desc
FIGURE 2.8

FIGURE 2.8 UserAgents Sorted Z to A with what apps they accessed

The python-requests/2.28.1 request accessed the Microsoft Azure CLI application once. But even more interesting, we see other user agents named python-requests in this environment. Look to see what information you uncover in your environment.

We can also look at this query from the application perspective if we want to know which application has been accessed the most by which user agent. To determine this, we’ll simply flip our count() by. Instead of counting by user agent, we’ll count by application and show which user agent is accessing that application the most. Run the following query; your output should be similar to Figure 2-9.

SigninLogs
| where TimeGenerated > ago (14d)
| summarize AppDisplayNameCount = count() by AppDisplayName, UserAgent
| sort by AppDisplayNameCount desc
FIGURE 2.9

FIGURE 2.9 Most-accessed application by user agent

In this demo environment, the Azure Portal application with an Edge browser version 121.0.0.0 was used 2,653 times. At the start of this section, we focused on getting the distinct set of results returned, but we had to count manually. Then, we used a count() of the results returned, but these are not distinct. Let’s combine both of these with the aggregate function dcount(), which allows us to get the estimated distinct count by passing the column for which we want to get a distinct count and which additional columns we want to aggregate/group the data by. Let’s take our current example. What user agent is accessing the most unique applications? Run the following query; your output should be similar to Figure 2-10.

SigninLogs
| where TimeGenerated > ago (14d)
| summarize AppDisplayNameCount = dcount(AppDisplayName) by UserAgent
| sort by AppDisplayNameCount desc
FIGURE 2.10

FIGURE 2.10 Distinct applications and how many times a user agent has accessed them

This is extremely useful information as we can see our most used user agent in the environment regarding the total number of applications it is accessing. Sorting the opposite way is also interesting to see what user agent is accessing only a small number of apps. These might be good candidates to be updated and brought into the standard browser versions for the environment.

We can also flip this. What if we want to see how many unique user agents access each application? We can see this number pretty quickly by getting the dcount() for the UserAgent column and grouping by application. Run the following query; your results should be similar to Figure 2-11:

SigninLogs
| where TimeGenerated > ago (14d)
| summarize UserAgentCount = dcount(UserAgent) by AppDisplayName
| sort by UserAgentCount desc
FIGURE 2.11

FIGURE 2.11 Counting the distinct user agents and which applications they accessed

This is even more interesting; 100 different user agents access the Azure Portal! Thankfully, this is a test environment, but this tells a compelling story. Many customers will have their own line-of-business (LOB) applications in Microsoft Entra ID. Running a similar query and seeing many user agents will show the possible browsers that would need to be tested to ensure compatibility. That’s great data for the leadership team to show why standardization on specific versions should be warranted.

There are two other similar aggregation functions to count and dcount: countif and dcountif. These functions allow you to count the rows if the expression passed to it evaluates true. For example, we have many applications in our Microsoft Entra ID tenant. We want to be able to determine the number of access attempts per application, and we want to see how many occurred in the US region. You could accomplish this by running two separate queries, one for the total count and then another where you filter based on location. But with countif, you can accomplish this in one query and see the results side by side. Run the following query; your results should be similar to the output in Figure 2-12:

SigninLogs
| where TimeGenerated > ago(14d)
| summarize TotalCount = count(), USLogins=countif(Location == "US") by AppDisplayName
| sort by USLogins desc
FIGURE 2.12

FIGURE 2.12 Total logins per application and total US logins

This view is much easier to read than two separate queries. Those with a sharp eye will also notice that we combined two summarize aggregate functions. Like how we combined multiple data-filtering methods in Chapter 1, we can do some powerful things by combining those functions. We highlight a few of those throughout this chapter.

Going a step further, how many unique user agents are using that application in that US region? Again, we could run separate queries like before, but combining them is much more useful, so we will use the dcountif() to only count the distinct rows that evaluate to true based on the expression. Run the following query; the output should be similar to Figure 2-13:

SigninLogs
| where TimeGenerated > ago(14d)
| summarize TotalCount = count(), USUserAgent=dcountif(UserAgent,
Location == "US") by AppDisplayName
| where USUserAgent > 0
| sort by USUserAgent desc
FIGURE 2.13

FIGURE 2.13 Total logins per application and by US access

The dcountif function evaluates the column you want to have the distinct count of when the expression is evaluated to true. In this example, we are looking for the unique number of user agents when the location is US. Next, we grouped them by application display name (AppDisplayName).

You’ll also notice we then have another where operator after summarize. So far in this book, we have filtered first and then done something with the output. You can continue filtering your query to drill down to the data you are interested in. In this example, we then filter out all the results that don’t have a value and sort by descending order so the largest is at the top. Filtering and re-analyzing the data will be something we do repeatedly in the more advanced chapters of the book.

There is one last thing to know about dcount() and dcountif(). Earlier, we said that it provides an estimate of distinct values. If you need complete accuracy, you can use count_distinct() or count_distinctif(), which are limited to 100 million unique values. We are trading accuracy for speed because dcount() and dcountif() functions estimate based on the cardinality of the dataset. They are also less resource-intensive. If you only need an estimate, use dcount() or dcountif().