I wanted to write down some thoughts on the CPU problems that are in the news. The first section has some analysis on how all this impacts computing and the cloud in general. The second section is Corda-specific.
Because things are moving fast I’ll try to keep this post high level. Nonetheless for space reasons I will use terms from the fields of CPU, cryptography and kernel engineering without defining them. There is some great background material in the Meltdown and Spectre papers if you aren’t already familiar with superscalar CPU design. Please ask in the comments section or on Discourse if you’d like further explanation.
Current state. The current state of the CPU industry might be described as “surprised”. The early collapse of the news embargo in the first week of January resulted in an uncoordinated release of information, but the real issue is deeper than that. It’s apparent that there have been numerous false starts around how to address Spectre attacks and chip manufacturers were still working on mitigations by the planned expiry of the embargo. On Wednesday 3rd January, the US Computer Emergency Response Team (CERT) was advising that a solution could require the remanufacture of every computer in existence and chip manufacturers had only made vague PR statements. By Thursday Intel and ARM had produced technical white papers that made it clear software mitigations were possible, and US-CERT had retroactively edited their website to remove their prior advice. On the Friday the industry’s most senior engineers were publicly debating the details of the right workarounds to use. This sort of thing is very unusual.
We have been spending quite a bit of time keeping up with the details of what the final solutions are going to be, as well as discussing them with the Hotspot JVM engineers but there’s little point me discussing that here because the precise approach could be different by next week. So I’ll focus on higher level thoughts.
Intel. There were a lot of initial reports that one of the attacks was specific to Intel, but this is not true. The Meltdown attack affects CPU cores designed by Intel, ARM and Apple. Spectre attacks affect all those manufacturers, plus AMD, plus a variant of the attack also impacted the Mill (a new CPU design so exotic it doesn’t even have registers). That isn’t surprising because CPUs have been speculating past bounds checks and indirect jumps for decades – it’s a basic requirement given that electricity moves at only 2/3rds the speed of light in metallic waveguides. By the second day of response Intel and ARM were well on top of things and revealed that they can patch all the problems with microcode, compiler and kernel updates … albeit with a loss of performance in some cases.
It is tempting for people who aren’t CPU designers to try and explain all this as incompetence or malice. I argue that actually nobody is at fault – what we’re witnessing here is an evolution in human technology itself. Since the invention of CPUs they have been sold without anyone ever claiming they’re resistant to side channel attacks. In Intel’s case they specifically say they aren’t, in fact (see slide 115 in this presentation from 2015). Despite this there has been no market for side-channel resistant CPUs. Quite simply nobody has demanded such a product, so nobody built one. Perhaps that will start to change now.
Nobody has requested such a product because side channel attacks are deeply unintuitive. I have given several lectures on the topic last year (see below) and without fail these talks leave the room in stunned silence. Jaws drop. For people who have never heard of side channels before a typical first response is disbelief that such attacks could work at all. The idea that you can steal secrets you can’t see simply by timing how quickly you can do things you’re legitimately allowed to do is not obvious. Even the researchers who discover these attacks often doubt they’re possible at all, right up to the moment they succeed.
Mitigations. This helpful article published by Microsoft describes the various mitigations and performance impacts involved. The exact impact is a complicated function of CPU type, workload, hardware age, operating system patchlevel and even the manufacturer of the computer in question so I won’t get into it here. Suffice it to say that the point of operating systems and compilers is to abstract developers from the details of the underlying hardware and although side channel attacks are a hell of a detail, it looks for now like these abstractions will hold.
It is important to understand that in some cases the performance drops are optional. These attacks are a problem when you are running potentially malicious code on the same hardware as a sensitive workload. The types of workload most affected (on the server side) are things like databases, where there’s lots of communication with the hardware. So the performance loss can be avoided by simply ensuring you do not share hardware with potential attackers. In consumer applications this is unavoidable, but the severity is quite low in my view (see below).
The primary place this matters is the cloud. So let’s talk about that.
The cloud. There has been a long running debate in many industries, especially the older and more conservative industries like finance, about whether or not it’s safe to migrate workloads to the cloud. I don’t think Spectre/Meltdown change the equation much – the cost/benefit analyses remain largely the same as before. This is for two reasons:
Firstly, the major cloud providers have all rolled out patches for these issues. There will be more but it’s quite plausible future issues will also be software patchable.
But more importantly, exploitation of cross-VM side channel attacks is nowhere near as easy as some people are making out. It’s worth restating the obvious – to exploit these attacks you must be running on the same CPU as your target. VM scheduling inside the cloud is arbitrary and unpredictable. Cloud providers don’t let you just stroll up to their helpdesk and ask to be conveniently co-scheduled with major banks or other sensitive targets. So real world exploitation of these attacks requires you to constantly schedule, test and de-schedule your VMs over and over again, hoping to get lucky and land on the same machine as your target.
A description of this process can be found in this news article about side channel attacks in cloud from 2015. This constant rescheduling is unusual so should be detectable statistically. There are no legitimate use cases for trying to get yourself co-scheduled with some particular organisation’s VMs in this way. Attempting to do this at all is prima facie evidence that you’re trying to hack someone.
I don’t know if the big cloud providers currently watch out for this behaviour. But it seems likely that one of the best ways to stop these attacks is to detect the setup phases. This has the advantage that it generalises to hypervisor exploits.
Finally, a simple and guaranteed way to stop such attacks dead is by requesting dedicated hardware. Cloud providers do support that, albeit at extra cost. I expect clouds to let you request VM scheduling constraints in future to assist with secure binpacking – for instance, by requesting that your VMs are only co-scheduled with yourself, or with other trusted organisations.
Given the above let’s discuss the severity of Spectre and Meltdown attacks.
Severity. I’m going to swim against the tide here and say I don’t think these attacks are that big a deal. That might be surprising given all the fire and motion, so let me explain.
Side channel attacks are about letting attackers read data from a different trust domain. In other words, it’s a kind of sandbox escape. Code sandboxing is a feature that has a long history of proving difficult, despite large investments and many decades of trying. It’s a lot easier when you place lots of restrictions on what the sandboxed code can do (e.g. there haven’t been any EVM escapes as far as I’m aware), but in the case of native code process or VM isolation the sandboxed code has enormous freedom. We persist anyway because it’s worth it despite the difficulties – asking people to buy a new iPhone for every app they want to run is clearly impractical, and most sandbox escapes are found by corporate ‘red teams’.
It is worth observing that Meltdown and Spectre only allow you to read memory and even then, only at very low speeds and with data corruption during the process. This is bad, but it’s not as bad as full write access.
In contrast, plain old kernel/hypervisor bugs typically grant total control of the machine to the attacker – both read and write access – and they happen all the time. The Xen hypervisor had 43 vulnerabilities in 2017 alone. Many of them allow a guest VM to take control of everything running on the physical hardware. For example CVE-2017-12137 is described as “Xen allows local paravirtualised guest OS users to gain host OS privileges via vectors related to map_grant_ref”. That’s actually much more severe than Spectre or Meltdown for machines that use Xen, because the attacker gets line-rate, error free write access to your entire memory space. They can both learn things and change things.
Other platforms fare no better. The MacOS security update that fixed Meltdown also fixed dozens of other security bugs, like CVE-2017-7154 “A local user may be able to cause unexpected system termination or read kernel memory”. This one has the same effect as Meltdown on all Mac systems and was in fact found by Jann Horn, the same Google engineer who found Meltdown and Spectre. But it passed by without a sound. Researchers fuzzed the Linux kernels used on Android and found 32+ new local root exploits. Nobody noticed. And so on.
Meltdown and Spectre are getting a lot of attention because they’re new, they’re very clever and because early reports implied there might be catastrophic impact … a security bug that couldn’t be fixed without replacing every CPU. This turned out to not be true, and they now look a lot like any other bug that will be quietly fixed via software updates. Other more serious bugs get ignored because they’re just minor variants on the same theme that has been plaguing the industry for decades (i.e. anything written in C is guaranteed to be riddled with holes) so there’s nothing new to add or say about them.
Finally, it’s worth noting that side channel attacks have been around for a long time. So have Rowhammer attacks which also allow software security to be undermined by hardware problems. But I’m not aware of reports of actual usage of either by a malicious attacker, at least not since the days of pay TV smartcard hacking which was decades ago. Side channel attacks get a lot of attention from academics but reproduction outside of the lab is often plagued with practical problems (like the detectability of VM rescheduling, or the existence of simpler ways to get what you want!)
So. The combination of “can be easily mitigated via software” + “no history of actual exploits by real attackers” + “more severe problems are found every week” makes it hard for me to get excited about the security impact of these issues in the general case (for SGX specifically see below). Full marks for cleverness, but as an attacker I’ll take a buffer overflow in Xen over a side channel attack any day.
Final thoughts. Side channel attacks have been around for a while and given the recent uptick in research attention, I expect there to be more events like Spectre/Meltdown in 2018. They will be variants on a theme, in the same way that buffer overflows are a theme and yet each individual overflow is slightly different. We should also expect to see back and forth on mitigations. In the same way that we’ve seen mitigations like ASLR against traditional C based attacks get stronger and weaker over time as research progresses, we will see the same for mitigations around side channels.
Impact on Corda
On reading news reports about Spectre a reasonable person might ask themselves the following questions:
- Does this catch the Corda team by surprise?
- Does this hurt the credibility of Intel SGX?
- Should we be betting more on zero knowledge proofs instead of SGX?
I think the answer is no to all three.
Corda. We’ve been aware of micro-architectural side channels and the need to take precautions due to them for a long time. There is more to this than Spectre and Meltdown, which are two specific types of side channel attack. Last year I spent significant amounts of time on this issue:
- I travelled to Portland to meet with Intel engineers. I presented our research and plans around the construction of an “oblivious JVM” i.e. JVM modifications designed to block side channel attacks. They also presented some of the work they’ve been doing on the topic to us.
- I gave a training talk to the Corda team titled “Advanced cryptography”, introducing them to the topic of side channel attacks, cache timing attacks, exploitation of branch target buffers and so on. The techniques covered in that talk show up in the Meltdown and Spectre papers too – our work has all been in the context of SGX (thus assuming kernel mode access), so we didn’t focus on the possibility of crossing user-mode address spaces with side channel attacks. But the class of techniques involved is the same.
- I also visited Oracle’s JVM compiler research team. It is helpfully located in Zürich, just down the road from me. I gave them a talk on the topic of side channel attacks that work by exploiting caches and branch target buffers, again in the context of SGX.
SGX. The SGX documentation discusses side channel attacks. If anyone was under the impression that Intel or other chip manufacturers were unaware of these sorts of attacks, that isn’t the case … what wasn’t clear until last year is that speculative execution specifically could be used to quite such impressive effect. But the general idea of side channels in CPU design has been around for a long time.
Given how Spectre attacks work, I expect them to be effective against enclaves at the moment. But SGX appears in Skylake+ CPUs and these CPUs have various features that can be used to mitigate certain kinds of side channel attacks. Namely, the microcode updates Intel is distributing adds new speculation barriers and other forms of precise control over the problematic features, and Intel TSX allows memory writes to be rolled back automatically in such a way that the cache is unaffected if certain types of attacks are mounted. Other features allow attacks to be detected rather than blocked, so the enclave can shut down and refuse to cooperate if it thinks it’s under attack.
We’re waiting for Intel to produce some formal published analysis of Spectre/Meltdown & SGX interactions, but based on available information my expectation is that whilst CPUs are currently vulnerable, a combination of a microcode update and compiler changes are sufficient to block them.
This is all good but full mitigation against side channels in general requires more than what any CPU can provide. A JVM is an ideal piece of software to abstract developers from side channel attacks. For example here’s a patch that fixes Variant 1 Spectre attacks. The reason I delivered a training session to the team developing the Graal JIT compiler for Java is that compilers are a fundamental building block for side channel mitigation, and the Graal compiler is likely to be the future of compilation in HotSpot (reminder: Graal is a new open source project to upgrade the Java JIT compiler).
Because we’re building on Java we are in a great position to benefit from compilation techniques that can block attacks automatically, like polymorphic call devirtualisation, the insertion of speculation barriers, TSX hardware memory transactions, Path ORAM storage layers and so on. Side channel attacks often rely on very precise knowledge of the exact machine code being executed too, so simple randomisation of the JIT compiler is also a possibility.
The Corda deterministic JVM is a key part of this because it’s much easier to eliminate side channels from deterministic pure functions than computations which arbitrarily mutate state. We already started discussions with compiler engineers last year on development of features to mitigate and block side channel attacks. The long term direction is a compiler that can systematically eliminate all side channels without reliance on hardware features designed for specific types of attack, by doing things like explicitly executing both sides of a branch and fully unrolling loops up to their statically determined bounds. This means that smart contract/CorDapp developers shouldn’t have to adjust their code.
With respect to the performance drops associated with Spectre mitigations, given the nature of Corda transaction verification enclaves we do not anticipate much difference. But we’ll have to wait and see for exact numbers.
Finally, please note that SGX was designed on the assumption that implementation issues would be found and therefore, SGX remote attestations include the patchlevels of CPU microcode and various other version numbers. CPUs that are running down-level SGX versions can be detected over the internet and excluded from a compatibility zone. The precise mechanics of how microcode updates are phased in across Corda zones is being worked out at the moment (already started last year in fact). We’ll provide more detail on that might work later this year.
Zero knowledge proofs. Although it may seem surprising, the work I’ve described above is actually very useful for any practical integration of ZKP into block chain technology. Many of the the compiler changes I’ve described above are not mutually exclusive with ZKP or even hardware specific at all.
The reason for that isn’t really obvious unless you’ve read the underlying research papers. But the essence is this – to express a program such that it can be proved to be satisfied under zero knowledge requires it to be in the form of algebraic constraints. This isn’t at all an intuitive or easy way to write programs; it can be thought of as a kind of mathematical assembly language. One reason it’s hard is because, as Ben-Sasson et al describe in their paper “Succinct non-interactive zero knowledge for a von Neumann machine“:
Most of the difficulties that arise when designing a circuit generator have to do with data dependencies. A circuit’s topology does not depend on its inputs but, in contrast, program flow and memory accesses depend on the choice of program and the program’s inputs. Thus, a circuit tasked with verifying program executions must be “ready” to support a multitude of program flows and memory accesses, despite the fact that its topology has already been fixed
E.g., [PGHR13] requires array accesses and loop iteration bounds to be compile-time constants; also, while [BFRS+13] supports data-dependent memory accesses, most program flow is also restricted to be known (or bounded) at compile-time; mitigations are possible, but only in special cases [ZE13].
That sounds a lot like how you make code side channel free in other contexts: eliminating branches or executing both sides, unrolling loops to their statically determined bounds, avoiding or always executing data dependent loads etc. The paper describes implementing a CPU emulation in which you pick how many cycles will execute ahead of time, so the execution of the program always takes the same amount of “time”, but that approach is too expensive to be actually deployed at the moment.
ZKP research doesn’t use the same “side channels” lingo that other parts of cryptography do, as the underlying issues are subtly different, but the solutions end up amounting to something very similar.
Given the difficulty of converting ordinary code into ZKP form the only ZKP currently in usage anywhere (Zcash) takes the approach of simply writing out the equations by hand, using a C++ library to build up the arithmetic circuit gates one at a time. There is no higher level language or anything that resembles an ordinary programming language. This results in an optimal and constant-time program, but only at massive development cost – essentially only ZKP researchers can craft useful programs in this way, and only very small and simple ones at that. Regular business developers cannot.
But hey, if you write your smart contracts to be side channel free up front, you don’t need to worry about Spectre either. So ZKP based approaches don’t solve or even avoid these challenges – they assume a solution already exists.
Of course, that is not the only issue ZKPs have … there is also the problematic performance, the backdoor issues and so on. Algorithmic research is steadily making progress on all of these issues and I imagine that within a few years most of the big problems will be solved, although making performance good enough for some use cases might require new accelerator chips to be designed and manufactured. At that point serious peer review can begin, because these techniques are not invulnerable to human error either (see Remark 2.5 in the above paper for an example of where a mistake was made in the mathematics). In the long run we will probably migrate to mathematical techniques as they mature, leaving hardware techniques behind.
But that eventual deployment of ZKP for all smart contracts, not just hand-crafted and quite limited token logics, will require the development of constraint compilers that convert an ordinary program written by ordinary line-of-business domain experts into algebraic constraints. The work and experience that we will gain by doing these transforms for SGX will therefore be reusable.
Despite the drama Spectre and Meltdown are a continuation of a research trend that has been around for some time. We are not taken by surprise by this class of attacks and had already set events in motion to develop long term solutions for SGX users.
Short term mitigations will involve software and CPU microcode updates, and can close these specific types of attack at some cost in performance. Sooner or later people will find workarounds for these mitigations I’m sure. Longer term solutions will mostly focus on compiler changes made in concert with hardware changes.
Although it is tempting to imagine ZKP and SGX as two equally plausible paths towards ledger privacy, for reasons we have argued extensively elsewhere this is not the case and there is really no alternative at the moment to secure hardware. The difficulties in bringing ZKP to production for “normal” programs are a superset of the challenges involved in making secure hardware side channel resistant. Experience gained solving these problems is directly applicable to solving ZKP.