Managing your risk surface
Why you generally shouldn't care too much about individual vulnerability counts.
People usually like numbers.
Quantification provides them with the feeling of certainty and clarity that is often difficult to find in cybersecurity (and life in general).
This is all well and good, and I have often written about how quantitative assessments, recommendations, and analyses are generally preferable to qualitative ones. An important caveat, however, is that you need to focus on numbers that actually mean something significant.
One metric that, by itself, doesn’t mean much is the total number of individual vulnerability findings in your application.
At a philosophical level, distinguishing one discrete vulnerability from another can be challenging. Security flaws are usually (but not always) the result of more than just a single character being out of place in a single line of code. Furthermore, due to vulnerability chaining, attackers often string together flaws to get to their target. With that said, there needs to be some level of analysis that is reasonably consistent across findings, so it makes sense to have some discrete thing that we call a single “vulnerability.”
On a more concrete level, the total number of such individual findings (assuming they all describe things that are all of roughly the same scope) isn’t something that you should index very heavily on. Not all vulnerabilities are created equally…not even close And the incremental risk posed to your organization can vary massively from vulnerability to vulnerability.
The Common Vulnerability Scoring System (CVSS) exacerbates this issue, implying that there is at most an order of magnitude of difference between vulnerabilities, which we know isn’t true. CVE-2014-0160, AKA “Heartbleed” facilitated 840 breaches but most CVEs cause zero. The (Not)Petya ransomware, which targeted CVE-2017-0144, caused $10 billion in damage while most CVEs are never exploited and thus never result in a single dollar of damage.
Some compliance regimes, such as the federal government’s FedRAMP standard, go even further in establishing clear perverse incentives with respect to how organizations look at the total number of findings. For example, after initial authorization, organizations that identify an increase of 20% or more from the baseline, or 10 unique vulnerabilities, whichever is greater, are subject to a “Detailed Finding Review.” By punishing companies who identify more than an arbitrary number of vulnerabilities - regardless of their severity - during a given time frame, the government disincentives its vendors from looking too closely for these vulnerabilities in the first place.
Putting it into practice
Assuming the above to be true, the next reasonable question would be “what do I do about it?” My recommendation would be to develop your vulnerability management program and policy in a way that reflects the highly unequal nature of vulnerabilities, indexing on the total risk surface rather than the raw number of security flaws identified. Many policies give individual remediation times for specific issues, but I would suggest this is as misguided as is focusing on the raw number of open vulnerabilities. Since one vulnerability might represent the same financial risk to an organization as a thousand other ones combined, it would make sense to apply equal resources to both groupings, despite their vastly different counts.
Thus, a well-designed vulnerability management policy should focus on the total risk surface of a given system or product. The appropriate business leader should thus identify the organizational risk appetite in financial terms. This represents the baseline of tolerable cybersecurity risk. He should also identify the relative velocity with which the organization must mitigate, transfer, or avoid the risk from vulnerabilities exceeding this level, or seek a risk acceptance decision. This represents the risk tolerance.
To give a hypothetical example, suppose that you value a given system at $1 million annually. Also assume a perfectly rational world where humans are not subject to the loss aversion bias and that you have no other, better options. It would then make sense to expend up to $999,999.99/year to maintain, operate, and protect the data confidentiality, integrity, and availability of this system or accept any cybersecurity risk that does not cause you to exceed this level. Assuming you calculate everything perfectly, you are guaranteed to make at least $.01, with everything made on top of that being gravy.
Unfortunately, humans are irrational, it’s impossible to calculate risk with perfect certainty, and we will need to separate out cybersecurity risk from all of the other ones to make a concrete decision. Thus, let’s assume the organization has a total risk appetite of $200,000/year. This means that the relevant business leader has declared a loss event of that magnitude every year, or a $1 million loss every five years, to be an accepted “cost of doing business.” Due to the intrepid efforts of the security and engineering teams, assume the business is actually at that level of risk currently.
Now, assume that you identify 10 vulnerabilities that you assess each have a 0.5% chance of being exploited each year, which would cause a loss event of $500,000 per exploitation. You now exceed your risk appetite by $25,000/year, and you must get back to your acceptable baseline.
The question is, how quickly?
Adding some bugs to your backlog to get fixed in a few sprints has a radically different impact on engineering efficiency than does telling everyone to drop everything immediately to fix one or more flaws. Additionally, assigning remediation timelines to individual vulnerabilities can tie the hands of your team, forcing them to fix issues in a sub-optimal order from both a security and business perspective. Thus, this is not an academic question; you need to have some way of determining how quickly to act.
Assume you identify these issues on January 1st and fix them all on December 31st. For the entire year you have exceeded your risk appetite by $25,000. Thus, you have “incurred” $25,000 in excess risk/year. Conversely, assume you fix them the day after you identify them. In this case you have incurred only $68.45 ($25,000 / 365.25 days in a year) in excess risk. These are different outcomes, and both business and security leaders should treat them differently.
How to do so, however, is a tricky problem.
If you develop some sort of “risk budget” that refreshes every January 1st and disappears every December 31st, you would drive bizarre behavior. This would probably take the form of the security team being relatively lax early in the calendar year but then pressuring engineering to fix issues more rapidly as the year progresses. This is reminiscent of similarly counterproductive measures like sales teams closing deals at all costs as the end of the quarter or government agencies spending 4.9 times more in the last week of a fiscal year than they would otherwise.
Since hackers do not respect weekends, holidays, or fiscal or calendar years, you will need a different approach.
Thus, what I would propose is that organizations specify both a risk appetite - a baseline below which no action is necessary - as well as a rate of return to this baseline, which I consider to be synonymous with risk tolerance.
Continuing the scenario, presume that your risk tolerance is $1000/year.
Your rate of risk “expenditure” for the 10 aforementioned vulnerabilities is thus $68.45/day. This gives you slightly fewer than 15 days ($1000 divided by $68.45/day is ~14.6 days) to get back to your baseline, assuming you fix everything on the same day.
Not specifying exactly how the engineering team does its work, by setting timelines for individual issues, however, is a major benefit of this technique.
Identifying the total allowed tolerance above baseline lets the engineering team attack the problem in several ways, choosing the one most efficient for them. For example, if nine of the 10 bugs are easy to resolve in the current sprint, which releases 10 days after their detection, this would result in a risk “expenditure” of ~$616. The remaining bug would be incurring risk at the rate of $6.85/day ($2250 / 365.25 days a year), thus giving you approximately 56 days to resolve it while staying within your risk tolerance of $1000. This could allow the engineering team to plan it as part of next month’s architectural overhaul, thus reducing the burden of resolving the issue.
Conversely, assume that instead of these 10 vulnerabilities, you just became aware of a single issue that has a 95% chance of exploitation (realistic in the case of log4shell, (CVE-2021-44228), which would result in a $500,000 loss. At this rate, you will have exceeded your risk appetite by $475,000 in one year. This equates to a risk expenditure of ~$1300/day. Since the clock has now started, you will need to fix this issue quickly - in less than one day - to remain within your risk tolerance. Based on the risk presented by such a vulnerability, stopping everything to fix it seems appropriate.
Throughout the process, if more vulnerabilities are identified, you will need to pull more people off other work to get back down to your baseline. If vulnerabilities can be mitigated faster than expected, engineers can shift back to non-security projects. Your organization will need to constantly track the total amount of “outstanding” risk against the risk tolerance your business leader has accepted to determine how you should apportion these resources.
Technical debt
One final aspect that is important to explore when discussing the relevance of total numbers of vulnerabilities is technical debt. The number of CVEs (or other known vulnerabilities) in a product or component can correlate with technical debt in your 1st party code, although it doesn’t necessarily. This is because there are many 3rd party packages and applications that don’t fix these CVEs, even in their latest version, so there is sometimes nothing your organization can do. All other things being equal (which they almost never are), however, it certainly makes sense to upgrade to the latest version of a library or application as soon as possible. Staying “evergreen” in terms of your software supply chain can prevent you from having to undertake monumental refactoring when you do discover an exploitable vulnerability in your technology stack. If the first- and third-party code haven’t “drifted” apart as much, it is likely that you will be able to update more rapidly when required.
Thus, an optimal model would also incorporate the engineering costs of jumping several versions of a dependency. If you had a library with an identified CVE that you determined was not exploitable, there still might be some risk associated with not updating the library. If one or more subsequent versions of the library are released and a future CVE is identified that is in fact exploitable in your application, then there is a higher likelihood that upgrading several versions will require more engineering effort and happen more slowly than if you had stayed up to date all along. Perhaps I’ll develop a framework for calculating this technical debt risk in the future, but I don’t have anything well-developed at present, as this represents a separate problem.
Conclusion
Implementing a framework such as the one I have described will no doubt require some careful thinking as well as organizational design. The Deploy Securely Risk Assessment Model (DSRAM) provides a tool for calculating the financial risk posed by a given vulnerability, but establishing a program like the one I suggest will require much more work. Automated calculation of “unused” risk tolerance, projected resolution dates for issues, and workflows for managing these things would be a perquisite for implementation.
With that said, the identification of known vulnerabilities in software is a fact of life, and you will simply need to deal with a constant stream of CVEs and other flaws being identified in your software supply chain. Having a quantitative, risk-based approach is the only way to logically tackle this problem. Research has shown that traditional, one-size-fits-all approaches of demanding the remediation of “all highs and criticals” are ineffective. By clearly establishing a numerical risk appetite and tolerance and evaluating your entire risk surface in light of these guidelines, your organization can mitigate the most security risk possible while still achieving its business and operational objectives.