Certificate revocation and expiry
As we prepare for the first Corda compatibility zone, R3Net, we have been receiving questions from customers about our planned approach to certificate revocation and expiry. This blog post is not a formal certificate policy document, but outlines some of our thinking so we can gather feedback before implementation. If there are any questions please get in touch via Discourse, Slack, comments on this blog, or get in touch with me: firstname.lastname@example.org
Corda’s philosophy is to reuse proven infrastructure when we can, and as such it uses the PKIX suite of protocols for management of keys and identity. There are four ways to implement revocation in PKIX:
- Certificate revocation lists (RFC 5280)
- OCSP, the Online Certificate Status Protocol (RFC 6960)
- OCSP Stapling
- Custom revocation via software upgrades
Certificate revocation lists were the first revocation mechanism created. They are simply signed lists of certificates that should no longer be trusted, fetched via HTTP and cached locally. Their advantage is that they’re simple and revocation data is cached, so if the web server hosting the CRL goes offline for a while certificates continue to be revoked. But because they’re just static lists fetched by everyone they can’t scale to large numbers of revocations. Any change to the list requires all users to redownload it.
OCSP is an interactive protocol in which a server (the “responder”) provides digitally signed answers to revocation queries. The signed certificate contains the URL of the OCSP responder to use. OCSP can scale to large numbers of revocations, but not large numbers of users as they must individually poll the server. OCSP has the disadvantage that if the responder goes offline you face an ugly choice: either stop using TLS and experience an outage, or continue without doing revocation checks. If you choose the latter this incentivises attackers to mount denial of service attacks against the certificate authority, to ensure you don’t spot their use of a revoked certificate. To make things worse the protocol leaks private information, because the CA sees the identity of everyone you connect to.
OCSP Stapling is an improvement over OCSP. With stapling, the server (node) you connect to effectively proxies and caches the OCSP response. At a stroke this fixes the privacy and scaling issue by distributing the query load across the servers that are already talking to the user anyway. Whilst you still face the question of what to do about responder outages, it’s still a significant upgrade over plain OCSP. Unfortunately, OCSP Stapling support was only added in Java 9 and Corda currently targets Java 8.
Finally, it is always possible to revoke certificates by issuing a software update. This is effectively the only way to revoke trust in compromised root certificates, and has been used several times by browser and OS vendors to revoke certificate authorities themselves.
Because we are using ordinary PKIX, which of the first three approaches to use is up to the operator of the compatibility zone. For R3Net, we are planning to start with certificate revocation lists. Although old, this protocol is appropriate because CRLs can be distributed via caching content delivery networks, providing very high availability and robustness against DDoS attacks. It also offers perfect privacy, as revocation checks are done locally. Whilst the technique does not scale to very large numbers of revocations, the number of R3Net users in the world will not be so high for this to pose practical problems any time soon. Even the web PKI, which encompasses millions of websites, has CRLs that are only a few megabytes in size. Node operators would not have to do anything, the software will poll CRL URLs for updates by itself and cache the results in the database. If CRL sizes did become too large, the CRL standard supports a blockchain-like approach in which “delta CRLs” can point by hash to previous CRLs and build on top of them. In this way the amount of data to download can be minimised.
Once we have upgraded to Java 9 we may revisit this and examine an implementation of OCSP Stapling instead.
All approaches require a decision about what polling interval to use. This will be a parameter controlled by the operator of the compatibility zone the node is a part of, and defines the window of time required for a revocation to take effect. In the case of an OCSP based solution, it also reflects a tradeoff between revocation speed and availability.
A more detailed design doc will be written as part of implementing this feature that will address questions like the precise time a revocation check occurs, what happens if a certificate is revoked whilst a flow is in flight, whether/how node administrators can override revocations and so on.
X.509 certificates contain expiry times after which TLS stacks will return an error rather than establish a connection. The original purpose of expiry was to handle the steady degradation of RSA key strength in the face of improvements to factorisation algorithms. Over time, expiry also came to be used to limit the lifetime of stolen private keys and to enforce the ability to quickly phase out old algorithms. Modern web EV certificates are limited to a lifetime of a year: this is not to ensure cryptographic strength but rather to ensure that browser vendors can enforce new certificate security policies in a reasonable length of time.
Whilst these are noble goals, key expiry has created serious problems for the internet community over the years. Expiry creates an immediate and total outage of the service with the expired certificate the moment the clock strikes the (very literal) deadline. According to a survey of over 2000 businesses:
- Expired certificates cost businesses $15 million per outage.
- The average organization had two unplanned certificate-related outages over the past two years.
- When it comes to business continuity costs, the biggest part, or $4.2 million, is brand image damage, followed by $4.1million in lost revenues, and $3.4 million each for lost productivity and remediation expenses.
Unexpected certificate expiry has caused major outages at Google, Microsoft and Instagram. Worse, these outages were pointless: expiry requires administrators to guess when a key might become weak, but they can’t really know and in all these cases expiry was not protecting anyone against any real problem. When even the world’s most sophisticated technology firms can’t avoid hours-long disruptions, it’s time to ask hard questions about the cost/benefit ratio of this feature.
For this reason our current thinking is that R3Net certificates will not automatically expire. When a certificate uses a key that is suspected to have become weak, for example due to a cryptanalytic breakthrough or evidence that it may have leaked, a combination of written policies and email/phone communication with administrators will be used to encourage key rotation. Zone agreements will be used to set expectations around cryptographic hygiene. Humans will take control of a process that has been traditionally delegated to machines: an approach that has led to disastrous results when those machines mechanically and blindly close the doors to your business, usually without even knowing if there is a problem with the cryptography at all.
We appreciate that some organisations may have rules in place that mandate them to use key expiry. Whilst R3Net will not insist on an expiry time, and the software will default to not using one, if a certificate signing request does contain an expiry time we won’t reject it. It will be up to the organisation that requested expiry to ensure that it does not cause unexpected outages.
It is simply not acceptable that international commerce slams to a halt because the clock reached midnight. Although slightly unconventional, we believe this proposed policy will deliver an exceptionally strong PKI with unparalleled uptime, robustness and trustworthiness.