December 02, 2020
A few weeks ago, the Ethereum network suffered a chain split that subsequently led to outages of various services. Following up to that, I spent a considerable amount of time trying to understand how this really happened and if it could have been prevented. This is not only because being a software engineer I am naturally curious, but also because I am part of the engineering team building Corda, a distributed ledger platform that can be perceived as a competitor of some ‘enterprise’ versions of Ethereum. As a result, I am always interested in identifying areas for improvement and I also believe different platforms in this space can learn a lot from each other. In this post, I’ll try and analyse the incident a bit more and reflect on the underlying factors that contributed to it.
First of all, let’s start with how users started experiencing the issues. Some users started seeing two different chains, depending on their vantage point. For instance, Etherscan and Blockchair were showing two different chains after block 11234873. This led some exchanges, such as Binance, to temporarily disable withdrawals.
Another big provider – Infura – suffered an outage that caused a delay in price feeds of ether (ETH) and ERC-20 tokens and a general disruption in other services in the wider DeFi space.
Let’s have a look at how the incident unfolded in a bit more detail. Fortunately, many of the parties that were involved have already performed very thoughtful analyses¹ that we can synthesise here. I will try and take it slow, so that people who are not extremely familiar with Ethereum can follow along.
An Ethereum network consists of a set of Ethereum nodes. Each one of those nodes is executing a piece of software, which is known as a client. There are many implementations of Ethereum clients, but they all have to follow a formal specification that defines the proper behaviour of a client. This is essential so that all these clients can agree on whether a transaction is valid or not, thus ensuring they will reach the same decision when operating on the same data. A year ago, one of these clients – Geth – released a version that contained a bug, which meant this version of the client was not fully compliant with the specification². This meant Geth and other client implementations could reach a different decision on whether a transaction is valid for a very specific category of transactions that could trigger this bug. That bug remained dormant in the codebase of Geth for almost a year until it was reported on 20th July, 2020. Soon after that, the developers of Geth fixed the bug and released a new version, v1.9.17. This aligned Geth with the other available clients that conformed to the specification, but now Geth clients from this version and onwards could potentially disagree with Geth clients of previous versions. And this is what happened.
The developers of Geth did not publicise the fact that the v1.9.17 release contained a critical security bug fix that could affect consensus for obvious reasons. The main argument was the following:
In this particular instance, the consensus bug was dormant in the code for over 1 year. The probability after all that time for someone to accidentally trigger it is tiny. Opposed to that, the probability of someone maliciously triggering it if highlighted as a security issue is not insignificant. The Geth team made the conscious decision not to mention it, hoping that people eventually upgrade to versions that contain the fix and the issue is gradually ejected from the network.
However, luck plays funny games! Some people building on top of Geth came across this bug and decided to do an experiment and submit a transaction that would trigger it on the Ethereum mainnet! Most of the nodes in the network that were using the Geth client had upgraded, but there were still some that hadn’t upgraded yet. As a result, these nodes that hadn’t upgraded yet fell out of sync with the rest of the network.
Infura was predominantly using the Geth client and they hadn’t upgraded to v1.9.17 yet, since they didn’t know earlier versions had a serious bug and they were following their regular upgrade cadence. As a result, a consensus failure happened at the block containing this transaction that led to a complete sync halt affecting several of their systems. Clients that were unable to use these services started retrying increasingly, which overloaded other services and forced the team to temporarily disable them too. The team tried to upgrade the Geth version they were using, which appeared to be more complicated than expected due to the fact they were actually using a forked version of Geth. As soon as the upgrade to the new version of Geth was completed, the corresponding nodes were able to switch back to the right chain and their systems recovered to normal operation.
Incidents with such a big impact are a good opportunity for reflection. The Ethereum community has already started this process to understand how they can prevent these issues in the future. From my point of view, I can’t help but use this opportunity to understand how these risks translate in a permissioned setup and how a platform like Corda could help mitigate issues like these.
RETURNDATACOPY. It is not only hard to implement a full virtual machine, but it has also proved to be hard to use and reason about it in the past judging from the many smart contracts that were built in an insecure way, with the DAO hack probably being the most popular so far. For this reason, Corda makes use of the battle-tested Java Virtual Machine (JVM), which provides higher-level APIs that are easier and safer to use.
With all that said, I have to acknowledge that this incident also highlighted some positive aspects of the Ethereum ecosystem. The model of a single formal specification that allows many diverse implementations of the software provides a good degree of fault tolerance, since it reduces single points of failure. This was evident by the fact that even though a fault of a single client had considerable impact, the majority of the network was still left unaffected, which is what mattered more in this case. This is also one of the reasons why the Corda protocol has always been defined by the open-source codebase as we share the belief that “the more eyes on the code, the better”.
: The version was v1.9.7 and the part of the specification that got broken was EIP-211, which is related to a gas charging mechanism for functions that return data of arbitrary size.
: For those of you that don’t know, Corda is itself open-source and thus subject to similar dynamics. However, we also maintain Corda Enterprise, which is an interoperable commercial distribution that allows us to mitigate some of the issues described.
Want to learn more about building awesome blockchain applications on Corda? Be sure to visit corda.net, check out our community page to learn how to connect with other Corda developers, and sign up for one of our newsletters for the latest updates.
— Dimos Raptis is a Software Engineer at R3, an enterprise blockchain software firm working with a global ecosystem of more than 350 participants across multiple industries from both the private and public sectors to develop on Corda, its open-source blockchain platform, and Corda Enterprise, a commercial version of Corda for enterprise usage.
Share this post
January 04, 2021
December 18, 2020