Detecting attacks and improving response through the use of real-time security features

What do I mean by features?

When you are creating a machine learning model to detect anything from fraudulent actions on a website to anomalous ssh access to a web server you need to provide the model with information that can be used to perform some sort of prediction, which are normally called features. The standard definition for a feature In machine learning is “an individual measurable property or characteristic of a phenomenon being observed.’’. While most solutions that create these kinds of features do so in an out-of-band process which might be minutes or even hours behind, this post will be focused on those features that need to be created in a real-time or near real-time fashion.

Beyond Machine Learning

Besides their use in Machine Learning, there are other situations where features can enhance detection and response capabilities of security teams. Through building or assisting to build several SEM’s or SIEM’s in my career one of the most common problems I faced was the lack of context when looking at a log line. You usually would have to go and pivot through different parsed components of the log to get a complete picture of the activity, but what if we could have it in the same parsed log? That means that both the system and the analyst can understand not only what happened in a log line, but the surrounding context.

As an example, this would be a high level diagram of one of my SIEM’s used for personal projects:

In most cases, the process would involve logs being consumed, parsed, in some cases enriched with external or internal sources and finally delivered to some sort of storage. It’s in the middle of this process where we can add another step after enriching the log line with information, which is adding context to the processed log.

If we were to explode the SIEM component we would see the following:

Most people will be familiar with all the boxes except for the feature storage and the use of the graph database which are cornerstone components if you want to be able to generate security features that can be used in real-time and scale to millions of log lines per second. In this case, the usage of the Features storage (for example HBase) is to compute in real-time the counter features and through the Graph DB (for example JanusGraph) store also graph features.

To go through a specific example, a normal log line might look like this:

Aug 16 09:11:23 zero sshd[25177]: Invalid user unknown_user2 from 34.90.104.104 port 31632

And later being enriched with data from databases / apis / static lists it might look like:

{
“ssh_access_failure”: “preauth”,
“created_date”: “2019–08–16 09:11:23”,
“hostname”: “zero”,
“source_ip”: “34.90.104.104”,
“created_epoch”: “1565946683”,
“log_type”: “sshd”,
“user”: “unknown_user2”,
“source_ip_is_known_cc”: 0,
“source_isp”: “Google Cloud”,
}

In this case, the last two entries would be populated through enrichments of the log source. One using threat intelligence to check if the source of the connection is a known malicious host, and another one through a static check of to which ISP the IP address belongs to.

While this can enrich the information of the log line, its missing quite a bit of context, is this the first time the IP address tried to login to the box? In general what amount of traffic are we seeing from this particular source IP address?

At this stage, products might throw out a document to be stored, which means an analyst would have to start pivoting through different parts of the document, be it the IP address or the username to try and understand what was happening around that time, what actions led to the behavior witnessed in the original log.

Here is where features come into play, they not only increase the context of the log line, but allow the analyst to perform more intelligent queries and rule engines to create context aware rules.

{
“ssh_access_failure”: “preauth”,
“created_date”: “2019–08–16 09:11:23”,
“hostname”: “zero”,
“source_ip”: “34.90.104.104”,
“created_epoch”: “1565946683”,
“log_type”: “sshd”,
“user”: “unknown_user2”,
“source_ip_is_known_cc”: 0,
“source_isp”: “Google Cloud”,
“ip_total_this_hour_sliding_window”: “26”,
“user_total_for_ssh_access_failure_in_hour_sliding_window”: “3”,
“user_total_this_hour_sliding_window”: “3”,
“user_to_ip_counter”: “1”,
}

In the now context-aware document, we can see that this is not the first time the source IP address tried to login and failed. Furthermore, we would have information on the number of logs we have related to that particular IP address, and that the user that was used has only been seen related to that particular source IP.

For this example I will call the “user” the feature identifier, the “total_for_ssh_access_failure” the feature, the “in_hour” section the period and “sliding_window” the type of period.

An important factor to consider is that this information is persisted when events are stored, which means that reviewing an event from 6 months ago would also have all the context present, making long term searches much simpler.

How can security features be used?

Now that I hopefully explained what I mean by security features, I will go into different examples of how these extra document tags which provide context to the log line can be used for improving detection and response.

As a way to provide context to an analyst

As mentioned in the previous example, a single document can provide information which was not available.

This means that when an analyst is looking at a particular log, they don’t need to start filtering on the IP address to check how many commands the IP address has executed in that period of time, or to realise that there are more than 20000 logs related to that IP address in an hour which in this context begs for a review of what might be going on.

As a way to search or visualize patterns

Usually, when searching in a tool such as Elasticsearch, we need to specify a query on the document data. While most searches might be around what activity is there for a specific user, or where did a host connect to, using features expands those capabilities to not only search for a particular event, but any event triggered after it. For example, one can search for any activity of any user that has failed to login 5 times to a box, and see all their activities from that moment on. From an analyst point of view, this is much simpler than performing searches for any user which failed 5 times, getting a list of those users and then again searching for them while trying to figure out what was the moment when the failures happened.

Another benefit is that you might want to use features to visualize patterns of behaviour. For example, plotting a graph based on all the traffic coming from users that had at least 3 failed login attempts could quickly point out to patterns of behaviour that are unexpected such as a particular focus on certain endpoints or network segments.

As a way to create alerts in real-time

Let’s imagine you are using SIGMA to create alerts while processing a log line and transforming it into a context-aware document, you could have the following rule to alert as soon as an event is seen for a user performing more than 5 failed logins on more than 2 hosts:

For the previous example, most tools would use something like Elasticsearch to perform the aggregations needed to trigger the alert (for example using sigma2elastalert). While this is quite convenient, it does mean that the time to respond will largely depend on your time to deliver data to Elasticsearch and have it indexed properly for a job to perform the query, and on top of this the load on the cluster will be a lot higher due to the number of aggregations that constantly need to be executed.

Using features in your documents, the component that is parsing and creating the processed documents would be able to perform the same checks (use a feature on a sliding window of the selected time frame and alert if the condition matches). For the example provided, this would be a unique counter feature of an authentication failure that would be incremented for the amount of failed users on a single host.

As a way to perform simple historical queries

Another benefit of this approach is that if you were to store this data in long term storage such as Big Query, when evaluating a new alert you could rerun the same query over large periods of time, since the check is a simple comparison with the features present in the documents stored. This is possible since any rule would be a combination of document tags and their values. Instead of having to perform aggregations on a year’s worth of data, the query would simply compare each document against an expected set of tags being present and their values to trigger an alert.

Here it’s an example of a query in Big Query where it would perform a simple check on any event that had stored as a feature in the document the amount of failed logins for the user in a sliding window of one hour. This would allow to easily perform a one-year check of how many alerts would have triggered with a different threshold.

Types of features

Now that I hopefully showed the value of these features, there are some ways we can categorize them which will have technical implications on their possible implementations.

By their time constraints

Fixed window features

These features are simple to calculate but can usually provide less value given they are prone to not reflect an accurate picture of the context depending on the window size. Examples are windows that count unique or total amount of logs related to a particular property and that should happen in a fixed period of time, such as in one particular day or hour. For example, if you were to state that an IP address should not perform more than 5 failed logins in one hour, and if one does 3 failed logins at 13:59:58 and 4 at 14:00:05, then you would not alert since the total counts are for 3 at 13pm and for 4 at 14pm.

Sliding window features

Sliding window features are more complicated to calculate since they need to be able to provide counts in any period of time-related to the time frame they intend to represent. In the previous example, an alert should trigger around 14:00:05 since in a sliding window period of one hour we reached 7 failed logins for a particular IP address. This complexity is justified by the power it brings, of being able to state for any log the properties present for a particular period of time previous to the log line.

In the previous image, there is a strange situation where the sliding window feature “user_total_for_ssh_access_failure_in_hour_sliding_window” seems to have its number go up and down, while a dictionary attack is trying to guess the root user’s password. This is due to keeping the last hour and a constant attack averages the amount of guessed passwords around 700 per hour.

Note: There is a trade-off to consider depending on how a sliding window might be implemented if this is done through a features log the storage and performance of counting can become unmanageable for counter features, or unique features of high cardinality. If this is not done through a feature’s log, then there are trade-off to consider for how accurate you want the sliding window to be. In my case, I reduce the granularity to a single minute meaning that a sliding window of 1 hour might be off for 1 minute in its values.

By how they store their values

Counter features

You can draw parallels from this type of features to how normal rate limiting implementations work. By counting somewhere the number of times a particular event happened, you can keep track of the number of those events for the period of time you are interested. For example, you might count the amount of failed logins for a user.

Unique value features

In this case, what we care about is not how many times a particular set of events have been seen, but how many unique tokens related to those events exist for a period of time. As an example, you might not care too much about the amount of failed logins to a particular host performed by a particular user, as much as to how many boxes the user tried to log in.

Graph features

While in practice graph features might not be terribly different from unique value features, the assumption is that we are using a different type of storage or an abstraction layer that provides more capabilities than the normal unique value features. For example, I would use for my personal setup Janusgraph with Hbase as a backend, having single storage for all three of these types of features.

In the previous image, there was a tag in the Elasticsearch document called ip_to_user_counter which was using a graph feature. Not only can this be useful for performing alerts or getting context on the log line, but also to then perform visualizations if needed using the data from the graph database itself.

In this image, we can see the previous dictionary attack on several users, the selected user “collins”. A graph feature, in this case, might count the number of unique users attempted by the IP address at the moment that log was processed.

In this picture we can visualize what happens to be legitimate activity from two IP addresses, all using the same user and the same commands to perform background jobs.

Graph features allow for more complex features to be used, although with care due to their increased performance cost. An improvement that can be added at this stage is using caching logic to reduce the number of operations needed on the graph database since this is a very high read use case.

Implementing security features

One key aspect of implementing these types of features to be used in real-time or near real-time is that the latency of the calculations and storage can be on the milliseconds' range. To achieve it the way the features calculation and retrieval is implemented will be key, and depending on the technologies used this can drastically change. One of the key issues I have seen is on the features storage, in particular between row and column-based databases.

Why column-oriented databases work better for features

One of the challenges with implementing features is how to scale them to the thousands of possible features any particular log could have. While a user might perform hundreds of actions that would benefit of having a feature during its existence, they will probably not all happen at the same time or even in the same day, yet you would want to populate any of them if necessary.

Let’s imagine for example we are storing this information on a row-based database such as Cassandra. We would not want to keep changing the schema every time a new feature needs to be introduced so we might have one row for each feature in a period of time, and a counter to increment them. Every time we process a log that contains a feature we would write to the storage, but if there are 50 possible features a feature identifier might have at any given time, we might need to read up to 50 rows for each log line!

Besides the impact on the storage due to the redundancy, this can severely affect the latency of the solution. The number of read requests would be due to the fact that while we can seldom witness an action for a particular feature identifier (such as a user), if we are keeping its context we would always need to read all possible features that are currently active in the period of time for the user. This also means that if you have several feature identifiers (for example another one might be the source IP address), you would need to multiply the number of read operations by each of them.

In contrast, if we have a column-oriented database such as Hbase we could reduce the number of read operations taking advantage of the fact that it allows for sparse storage of the columns in the database. We would create a single key that would identify the feature identifier, and then for each of the related features of that user we would add sparsely columns to that row, and we would return every time we read all the information we contain for that particular row. This would avoid having the growth of read requests to Hbase tied to the number of features we have, only impacting on the number of times we need to write those features.

In the previous case, a log event with two feature identifiers and 1000 possible features, might only perform two read operations in Hbase (one for the source IP address and another one for the user).

Wrapping up

I hope through this post I was able to show the value of adding security features to your SIEM, SEM or SIM. While they are no trivial matter to implement, security features can be very useful for near real-time detection, response and to keep easy access to the context on historical logs. Not only do they allow analysts to easily understand the context of a log or improve their searching capabilities but they also allow simple rule engines to become very powerful by not only taking advantage of enrichment of a single log line but also the context surrounding it, and storing the data in a graph structure can help find threats which might be difficult to discover when using relational models.

As always, thanks for reading and let me know if you have any feedback or questions!

All about security and scalability. Views expressed are my own.