Dave Gray & Azeem Aleem
"What's Measured Improves" Peter Drucker
It's mid-2017 and we have already witnessed the conundrum across organizations as the pressure of building a more efficient business creates loopholes for cyber criminals to gain an advantage.
In a previous blog we talked about the traditional perimeter melting away and how the "not in my backyard" siloed approach no longer works. Organizations need to assess cybersecurity as a business enabler, rather than a hindrance. For that, businesses need to develop a quantitative, statistical analysis to measure security functions (success, failures, trends and workloads) over time, known as "Security Metrics"
Below outline example of three types of standard metrics that should inform such security monitoring and evaluation: management, operational and technical security metrics.
[Image source: Aleem, A., Wakefield, A. & Button, M. Secur J (2013) 26: 236. doi:10.1057/sj.2013.14]
When we started working in the cybersecurity field as junior analysts, metrics were carried out manually by one of the admin support teams and consisted of the total number of IDS alerts (millions), incidents, false positives and details of advanced attackers (what we now know as APT). However, while all the information was utilized to show what was happening from a defense-wide viewpoint, just how useful was the information? For the likes of total IDS alerts - not much! With the introduction of modern management systems (for instance, the RSA NetWitness® Security Operations Management) for incident handling and GRC platforms (such as RSA® Archer) much of this historic (manual) collection is no longer required. As a result we have to think very carefully about what matters to us, both from the business and security points of view. Knowing there are millions of IDS or Anti-Virus alerts is all well and good (Fun Facts!), but what do they actually tell us? Perhaps the signatures are very loose, or we have every single alert enabled, or we need to do more on false positive reduction... what?????
Let's rewind back a little first and identify what it is we want to do, or to accurately phrase our "Mission Statement". This is ultimately what the SOC/CIRC/CDC wants to achieve. It is from this Mission Statement that we must focus our efforts, including the recording of the metrics, which in turn, drives improvement. Let's begin by creating a generic mission statement and go from there.
"The ACME Industries SOC is dedicated to defending ACME Industries, its Group Members and subsidiaries from cyber attacks, and to investigate and remediate all attacks against the organization."
From this we can identify the following:
- Network defense
- Cybersecurity investigations
- Attack remediation
- Networks (to be monitored)
- ACME Industries network
- ACME Industries group members
- ACME Industries subsidiaries
Those of you who have managed SOC's/CIRC's already see some required metrics just from breaking these down, but let's move one more stage before actually planning out what we want from our metrics. The next step is dividing our internal security units into separate teams, e.g. Hunters, Malware Analysts, SOC L1/L2 Analysts, Content Engineering, Platform Engineering, Cyber Threat Intelligence, and so on. Each of these teams require an additional Mission Statement supporting the overarching SOC/CIRC Statement. As an example, the L1/L2 SOC Analysts mission statement could be:
"To Monitor, Detect, Triage and Escalate cybersecurity incidents and to defend ACME Industries against all measures of cyber attacks."
This gives us clear scope for the team and its business-driven objectives. With this we can look into identifying the metrics that drive this program, and as we noted at the start, "What's measured improves". Let's focus on the L1/L2 SOC/CIRC analyst team as our example (this would need to be replicated for each team).
Here, we require specific metrics to support the RSA® Business-Driven Security strategy and must be aligned to give targeted results. Total Events Per Second or Total Anti-Virus alerts may be useful metrics for the Platform Engineers (platform capability) they are just fun facts for the Analyst team. Let's drill down on the specifics of our four focus areas and remember, we want to focus on improving our service:
- Attacks thwarted by countermeasures: This is your attack picture; however, it is, to a certain extent, "white noise" as this is background Internet activity for any organization connected to the Internet. Be careful as this can fall into being a fun fact!
- Sources Monitored: This is where our Logs and Packets are coming from, giving us our overall detection capability. Once more care is required here. If you are doing a fantastic job at blocking Phishing attacks, but you don't have any email logs...oops! This metric is especially useful when scoping out new Use Cases and other detection capabilities, e.g. where do we have gaps?
- Identify the security devices giving us our best detection value: This is more like it! During the course of investigations the Analyst team records which devices have had a positive and negative impact on the monitoring/detection capability. This provides the management a view of where their security budget is most effective and where improvements are required.
- Incidents from Signatures/Content Rules: The classic SOC metric. How many Incidents have I detected within a given timeframe with my automated rules and signatures?
- Incident from Hunting: Hunting has become a core requirement for SOCs/CIRC's over the last few years, and if you are not doing it you are missing a lot of attacks. This metric can also be split into specific threats, e.g. SQL Injection, Phishing, etc. Showing successes/failures in specific areas allows management teams to review training, or even the scope of the hunting.
- Incidents from CTI: These incidents are generated from two areas. Intelligence gathered from incidents detected by the Analysts and standard Threat Intelligence research. These supplement the normal signature-based detections and are targeted against our biggest threats.
- Missed Incidents discovered from disclosure: This is painful; what did we miss? Nobody wants a phone call from their National CIRT or Intelligence Organization to tell them there is beaconing traffic (or other threats) originating from their IP address range. This will also include incidents which have been discovered by the Analysts, but where the adversary has already gained a foothold within the network. We want this metric to be as low as possible and, while not being directly correlated to a failure of the SOC/CIRC, it should be utilized to drive improvement in these attack scenarios.
- Total Triage Time: Triage time is measured as the time the Analyst begins responding to an incident until it is either classed as a false positive and closed, or confirmed as a threat and escalated. This metric is useful in benchmarking the total amount of incidents junior analysts can handle, establishing a resourcing baseline for the role.
- True Positive/False Positive ratio: Another classic SOC Metric. While this does still have a place here within the SOC Analysts' metrics it is more of a focal point for the Content Engineers. However, the following metric is definitely of concern.
- Man Hours in False Positives: This is, effectively, the total amount of time lost due to invalid incidents. Translating this into cost-per-person view is guaranteed to change how incidents are handled in the future with an emphasis on reducing wasted effort.
- Number of Incidents handled by L1 team:-This is not tracking escalations. Understanding the ratio of True Positive Events the L1 Analysts are able to resolve helps in establishing the team resourcing requirements.
- Number of Escalated Incidents: A direct correlation against the previous metric and is the total number of incidents assigned to the more advanced analysis teams.
- Number of Escalated Incidents to External Teams: Which teams outside of the SOC/CIRC are required to assist with investigations (HR, Internal Security, and Law Enforcement)? This metric assists management by focusing on communications between teams, enabling an effortless communication chain between all of the parties.
- External Escalations Time to Resolve: This is where you judge the effectiveness of external teams in responding to an incident; however, keep in mind that some teams will take longer to respond than others (e.g. Law enforcement or HR).
- Time to Escalate: We previously mentioned triage time. This measures if there is a delay in escalating to advanced analysis teams.
- Time to close Escalation/Incident: Just how long has it taken the team to close the escalation/incident, or pass it back to the junior analysts? Deviations from the baseline allow the SOC/CIRC management to investigate processes & procedure improvements.
- Number of False Escalations: Where in the advanced analysis has it been identified as a false positive or closed without any advanced analysis techniques? This allows us to gauge how effective the junior analysts are in decision-making and identify where additional training is required.
- Cost of Remediation: How much it costs, in terms of financial aspect and also from downtime to the organization.
- Content Development
- Rules Creation broken out by technology (packet/logs)
- Fine Tuning and Enhancement around workflows/apps/alerts etc.
- Break Fix frequency around workflows/apps/alerts etc.
In addition to the specifics above we must also look at the bigger picture:
- Total Incidents Triaged Per Day
- Total Resolved Incidents Per Day
- Total Incidents by Business Unit Per Week
- Total Incidents by Geography
- Incident by Host Compliance
- Number of Drives Forensically Analyzed
- Incident Closure Rates
- Intelligence Collection
- Top Sources
- IOCs by Actor
- IOCs by Domain Categorization
The metrics above are only a sample of what we should be looking at and are designed to get you thinking about the decisions you need to make in assessing your metrics.
Thus far we have talked about the type of metrics to consider; however, that is only half of the battle. Consider the graph below for a hypothetical company's AV detections.
Is this good or bad? We don't know without proper context. First, have we done a comparison with one of our peer organizations running the same AV product? Do they have a comparable chart? Is 800 AV alerts in January good (compared to March, yes) or bad (compared to July, yes)? Why was there a major peak in March?
- Was there a new threat discovered and mitigated by the AV vendor?
- Did we deploy more AV agents?
- Were these incidents fully resolved, hence the drop in detections in April (or has the AV been bypassed?)
Metrics are ultimately just numbers (cardinal numbers!). It is the context surrounding them and how we use them that drives improvement into our security programs. Stopping to address exactly what we want to achieve and then collecting the metrics to support the mission statement will drive, not only a Business-Driven Security strategy, but RESULTS! One final note... next time you want a capability improvement you'll have the evidence at hand to support it.
Author: David Gray
Category: RSA Fundamentals, Blog Post
Keywords: Metrics, Monitoring, RSA, Security, SOC