On May 11 and May 12, 2023, Ethereum's Mainnet network experienced two significant interruptions. This resulted in a delayed block production for 4 epochs and 9 epochs respectively. During the second incident, an inactivity penalty kicked in. The network, however, managed to recover autonomously on both occasions.
The first outage resulted in approximately 47 missing blocks, and the second caused a more substantial loss of approximately 149 blocks. The delays and missing blocks led to a lost revenue of approximately 5 ETH for impacted block producers. This figure, however, is expected to be significantly higher when considering builder bundle rewards.
It is estimated that 65% of validators were offline for 8 epochs, leading to an inactivity leak, resulting in an estimated loss of about 28 ETH, plus around 50 ETH in lost revenue from missing attestations. Altogether, the estimated loss was approximately 83 ETH, which averages to less than 0.00015 ETH per validator.
However, it is noteworthy that no validator slashings were attributed to these incidents, indicating that the issues were more systemic than individual.
The root cause of the outages lies in some of the consensus clients, including Prysm, that struggled to optimally process valid attestations with an old target checkpoint. This caused Prysm to recompute prior beacon states to validate the attestations' authenticity, leading to resource exhaustion and a significant slowdown in fulfilling validator client requests.
A series of old attestations voting to an old beacon block (a block from epoch N-2 during epoch N) were broadcast, causing the issues in Prysm and Teku. These valid but problematic attestations forced Prysm to regenerate the same state multiple times due to the rapid filling of its cache.
The issues were detected following a substantial drop in network participation at epochs 200,551 and 200,750, leading to a temporary cessation of chain finalization.
The main issue was that the network failed to finalize due to missing blocks and attestations. Additionally, the network faced stress due to an increased processing of max deposits. Prysm, in particular, faced the problem of multiple replays (replayBlocks function), leading to high CPU usage.
Despite these problems, the duration of the incidents was relatively short, with no mass slashings reported. The network's client diversity and some clients' ability to propose blocks and create attestations enabled the chain to recover. Importantly, no manual intervention or emergency release was needed to address the finality issue.
This incident highlighted the limitation of testnets, which are not representative of the Mainnet environment, thus underscoring the need for more robust stress tests and contingency planning. It also served as a successful field test of inactivity leak penalties.
Several fixes were introduced to prevent a recurrence of such issues. These include using the head state for validating attestations for a recent canonical block as the target root, using the next slot cache for validating attestations for boundary slots in the previous epoch, and discarding any attestations not validated by the previous two rules. These measures should reduce the chances of replaying states and ignore the attestations for old blocks under normal conditions.
While the Mainnet outage posed significant challenges, the swift recovery and the valuable lessons learned pave the way for a more resilient Ethereum network.