This blog post is part thirteen of the "Hunting with Splunk: The Basics" series. We might be abusing the term "basics" a titch more than usual, but I believe this subject is vital for organizations. I’m very proud of this piece that Andrew wrote, and I hope y'all like it as much as I do. – Ryan Kovar
The Problem
In this installment of "Hunting with Splunk: The Basics," we’re going to look at how to detect suspicious and potentially malicious network traffic to "new" domains. First, let’s delve into what we mean by "new" domains and why you should make a habit of detecting this activity in the first place.
Users (and applications) are creatures of habit. If your organization is typical, the domains that are requested from your network today have a tremendous amount of overlap with yesterday’s requests. If I look at my browsing, I visit the same 20 or so websites on a daily basis. After all, I can’t miss today’s xkcd or Dilbert! Applications making network requests from my laptop, phone, and other systems generally hit the same domains day in and day out as well.
But what about the small percentage of Internet domains that were requested from my network today, but were not previous destinations for my systems? This is what I mean by "new" domains. Sure, there’s going to be some legitimate traffic going to a few domains today that haven’t been seen on the network before, but it’s likely to be a small percentage of the overall set of domains. The remainder of these new, never-seen-before domains represent a potential threat.
Why should you consider network activity to new domains to be suspicious and potentially malicious? Malware and malicious actors regularly use domains they own or control for a variety of nefarious purposes. For example, an attacker-controlled domain can be used as a hub for command and control communications, while another domain is used for data exfiltration. These domains can leverage dynamic domains, subdomains created with domain generation algorithms (DGAs), more legitimate-looking, human-readable domains, and other techniques. These are all standard techniques seen across a wide variety of modern attacks. By hunting for these new domains, we can increase the chances of finding threats and then quickly shift to investigation and mitigation.
Data Required
With this backdrop, let’s discuss what data is needed. The short answer is that any data in Splunk that has a field containing network requests to external domains. This could include data that neatly parses out a domain field. Alternatively, we can extract domains from URLs.
Perhaps the best place to look for this data is in your web proxy logs. If you have this data in Splunk, you already have a massive repository of URLs being requested from your network. Use the free Splunkbase app URL Toolbox to extract domains from a URL.
Another good source of network traffic with domain requests is DNS data. You can get this from your outward-facing DNS servers or with a wiredata collection tool like Splunk Stream to pull this data from the wire in JSON format.
The Approach
With a tremendous amount of thanks to SPL guru David Veuve, let’s dive into some SPL ideas to hunt for these new domains. Make sure you’ve installed the free Splunk app URL Toolbox for these searches to work. Start by validating that you have data by pulling a list of domains with the earliest and latest times we’ve seen them from our proxy logs within the last 15 minutes:
Here’s our search for ease of viewing. Let’s go through how this works line by line:
tag=web url=*
| eval list="mozilla"
| `ut_parse_extended(url,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
The first line brings back our proxy data (checking for a value in the URL field). You may have to change this in your environment by limiting it to a specific index and/or sourcetype instead of—or in addition to—using the web tag. You can also optimize your search here by filtering noisy domains like content delivery networks, IPs and/or IP ranges.
The eval command creates a new field which we will pass to the URL Toolbox macro on the following line. It tells the macro in what format to expect the URL. The macro itself takes 2 parameters—the name of the field containing a URL (in this case, "url") and the format of the URL (in this case, "mozilla"). Note that as with all macros in Splunk these are `back ticks` and not 'single quotes.' The stats command simply creates a table with the most recent time (latest) and the first time (earliest) we’ve seen this domain in our dataset grouped by the value ut_domain that was extracted when the macro was executed.
With any luck, your data should look somewhat like mine at this point. To find today’s "new domains," compare today’s domain requests to a baseline of the previous 6 days. This is easy enough to do by setting our timeframe for 7 days and expanding our previous search:
tag=web url=*
| eval list="mozilla"
| `ut_parse_extended(url,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
| eval isOutlier=if(earliest >= relative_time(now(), "-1d@d"), 1, 0)
| convert ctime(earliest) ctime(latest)
The first 4 lines are the same as our original search. The 5th line is where the magic happens. The eval command creates a new field called "isOutlier." This command uses an if()
function to determine if the earliest (first) time we’ve seen this domain in the dataset was within the last day (using the now()
and relative_time()
functions available in eval). The final line uses the convert command with the ctime()
function to make the time field human readable.
At this point, we can sort on the isOutlier field (click the column heading) to find our new domains. Alternatively, we can add | where isOutlier=1 to return only the new domains. If we wanted an alert, we could save the search after adding the where command and be notified when new domains are found.
While this search does get the job done, it may not be optimal over the long term. First—as you’ve might have noticed if you tried to run it—it can be slow. Second, we need to pull back 7 days' worth of data every time we run it, and even then, we’re limited in our view because we are only comparing today to the previous 7 days. What if last week included a federal holiday? Or if everyone decided to attend .conf18? This might not be a large enough baseline to avoid false positives.
Operationalizing and Tuning
Luckily with the power of Splunk we can solve these problems in a couple of ways. Basically, we are going to use Splunk’s lookup functionality to create a cache of previously seen domains and then run the search for new domains across a much smaller set of data, comparing it to the cache and updating the cache at the same time.
We will start by using Splunk to create an initial baseline cache for the previous 30 days and write it to a CSV lookup file in Splunk. This search will likely take a while to run, but after you’ve run it once, you won’t have to do this again. Our baseline-populating search looks something like this:
tag=web url=*
| eval list="mozilla"
| `ut_parse_extended(url,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
| outputlookup previously_seen_domains.csv
This looks very similar to our very first search, but now we are going to write out a CSV file that we’re going to use in the next search.
Once we have the baseline, we can create a search that compares the domains requested in the previous 15 minutes to the baseline. The search will update the CSV file with the new data (updating earliest and latest times for previously seen domains, and adding rows for new ones), while at the same time flagging any outliers. It will look something like this:
tag=web url=* earliest=-15m
| eval list="mozilla"
| `ut_parse_extended(url,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
| inputlookup append=t previously_seen_domains.csv
| stats min(earliest) as earliest max(latest) as latest by ut_domain
| outputlookup previously_seen_domains.csv
| eval isOutlier=if(earliest >= relative_time(now(), "-1d@d"), 1, 0)
| convert ctime(earliest) ctime(latest)
| where isOutlier=1
The first 4 lines work as before, except this time we are limiting the search to the previous 15 minutes. The idea is that we will run this search as a correlation or alert search, and if Splunk returns any hits, we can use the Adaptive Response or alerting frameworks to take action, such as updating a risk score in Splunk Enterprise Security or kicking off a workflow.
The inputlookup command on the 5th line uses the append flag to retrieve the CSV file we created in our baseline step and add it to our data set from the last 15 minutes. We then use the stats command on the 6th line to look at the "earliest earliest" and "latest latest" time for each domain in the dataset. This allows us to see the combined data from the previous 15 minutes and the baseline domain list. When executing the stats command and grouping by ut_domain, the latest field may be updated if the domain was previously in the CSV, while new domains previously not in the CSV are added with their earliest and latest seen times.
The remainder of the search simply writes the updated table to the same CSV lookup file, flags the outliers (new domains) as before, and cleans up the time formatting. Since we are alerting, the where command filters for just the outliers so you can take action on them.
Let's TSTAT that Search
Another way to optimize this search is to apply CIM-compliant accelerated data models to the search. All of the same principles from our previous searches apply, but we’re going to take advantage of the speed of tstats (Not familiar with tstats? Check out this response on Splunk Answers).
Assuming you have CIM compliant data and populated data models, you can test the search by manually running it across 7 days like this:
| tstats count from datamodel=Web by Web.url _time
| rename "Web.url" as "uri"
| eval list="mozilla"
| `ut_parse_extended(uri,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
| eval isOutlier=if(earliest >= relative_time(now(),"-1d@d"), 1, 0)
| convert ctime(earliest) ctime(latest)
The difference in using tstats versus our other search occurs in the first two lines of the search. The search uses the tstats command, which is very fast for accelerated data. We then rename the default "Web.url" field to "uri" before passing it to the macro. The rest is exactly the same, but it runs MUCH faster.
To operationalize it with lookups, as above, we just need to make a few changes. The initial lookup populating search will look like this:
| tstats count from datamodel=Web by Web.url _time
| rename "Web.url" as "uri"
| eval list="mozilla"
| `ut_parse_extended(uri,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
| outputlookup previously_seen_domains.csv
Similarly, an operationalized search would run every 15 minutes or so, using the lookup file to expand our time range and improve performance like this:
| tstats count from datamodel=Web by Web.url _time
| rename "Web.url" as "uri"
| eval list="mozilla"
| `ut_parse_extended(uri,list)`
| stats earliest(_time) as earliest latest(_time) as latest by ut_domain
| inputlookup append=t previously_seen_domains.csv
| stats min(earliest) as earliest max(latest) as latest by ut_domain
| outputlookup previously_seen_domains.csv
| eval isOutlier=if(earliest >= relative_time(now(), "-1d@d"), 1, 0)
| convert ctime(earliest) ctime(latest)
| where isOutlier=1
Hopefully, this gives you a number of methods to hunt for new domains and perhaps provides other ideas for more "first seen" threat hunting. For further exploration of these concepts, I’d strongly recommend checking out the awesome Splunk Security Essentials app at Splunkbase.