09-30-22 Solana Mainnet Beta Outage Report

09-30-22 Solana Mainnet Beta Outage Report

At approximately 22:41 UTC on Friday, Sep. 30, the Solana Mainnet Beta cluster halted when the network was unable to recover from a fork caused by a bug in the consensus algorithm implementation. Block production resumed at approximately 6:57 UTC on Saturday after a coordinated restart, and network operators continued to restore client services over the next several hours.

What caused the outage?

  • Due to a validator operator’s malfunctioning hot-spare node, which the operator had deployed as part of a high-availability configuration, duplicate blocks were produced at the same slot height.
  • Both the primary and spare validators became active at the same time, operating with the same node identity, but proposed blocks of differing composition. This situation persisted for at least 24 hours prior to the outage, with most of the validator’s leader slots producing duplicate blocks which were handled safely by the cluster.
  • Initially, duplicate blocks were handled by the network as expected. For example, duplicate blocks were produced in slot 153139220 (220) and the cluster reached consensus on one of those blocks before it continued on to slot 153139221 (221), as should happen for duplicate block conflict resolution.
  • However, at the next slot, 221, duplicate blocks were observed again but an edge case was encountered. Even though the correct version of the block 221 was confirmed, a bug in the fork selection logic prevented block producers from building on top of 221 and prevented the cluster from achieving consensus.

Timeline

  • 09-30-2022 21:46 UTC: Validators start reporting consensus failure. Voting is still occurring but roots are not advancing.
  • 09-30-2022 22:00 UTC: Investigation commences to see if recovery is possible without a restart.
  • 09-30-2022 22:41 UTC: the Solana Mainnet Beta cluster halted
  • 09-30-2022 23:09 UTC: It is discovered that a validator had been producing duplicate blocks for its leader slots. Its operator is contacted and the validators taken offline.
  • 10-01-2022 00:08 UTC: Attempts to recover the cluster fail. Restart planning begins.
  • 10-01-2022 01:10 UTC: Restart instructions issued.
  • 10-01-2022 06:57 UTC: 80% of stake-weight online, roots advancing and network online.
  • 10-01-2022 07:30 UTC: The core team identified the likely bug that caused the consensus failure.
  • 10-01-2022 09:30 UTC: A fix was proposed and a test was added to reproduce the edge-case bug.
  • 10-03-2022 08:30 UTC: After review by the core team, the patch was merged into the master branch and backported to all release branches.
  • 10-03-2022 15:00 - 20:00 UTC: New release binaries were built and deployed to canary nodes for testing.
  • 10-04-2022 15:30 UTC: An announcement to upgrade was issued to validators, who then began actively upgrading their systems to version v1.10.40 and v1.13.2.
  • 10-07-2022: 04:00 UTC: 90% of stake-weight applied the patch to fix the consensus bug and the core team determined the risk of the bug to the network to be sufficiently mitigated.