Chain Halt Recovery
Chain Halt Recovery
This document describes how to recover from a chain halt.
It assumes that the cause of the chain halt has been identified, and that the new release has been created and verified to function correctly.
See Chain Halt Troubleshooting for more information on identifying the cause of a chain halt.
Background
Pocket network is built on top of cosmos-sdk
, which utilizes the CometBFT consensus engine.
Comet's Byzantine Fault Tolerant (BFT) consensus algorithm requires that at least 2/3 of Validators
are online and voting for the same block to reach a consensus. In order to maintain liveness
and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate
and use the same version of the software.
Resolving halts during a network upgrade
If the halt is caused by the network upgrade, it is possible the solution can be as simple as
skipping an upgrade (i.e. unsafe-skip-upgrade
) and creating a new (fixed) upgrade.
Read more about upgrade contingency plans.
Manual binary replacement (preferred)
This is the preferred way of resolving consensus-breaking issues.
Significant side effect: this breaks an ability to sync from genesis without manual interventions.
For example, when a consensus-breaking issue occurs on a node that is synching from the first block, node operators need
to manually replace the binary with the new one. There are efforts underway to mitigate this issue, including
configuration for cosmovisor
that could automate the process.
Since the chain is not moving, it is impossible to issue an automatic upgrade with an upgrade plan. Instead, we need social consensus to manually replace the binary and get the chain moving.
The steps to doing so are:
- Prepare and verify a new binary that addresses the consensus-breaking issue.
- Reach out to the community and validators so they can upgrade the binary manually.
- Update the documentation to include a range a height when the binary needs to be replaced.
TODO_MAINNET(@okdas):
- For step 2: Investigate if the CometBFT rounds/steps need to be aligned as in Morse chain halts. See this ref.
- For step 3: Add
cosmovisor
documentation so its configured to automatically replace the binary when synching from genesis.
Rollback, fork and upgrade
These instructions are only relevant to Pocket Network's Shannon release.
We do not currently use x/gov
or on-chain voting for upgrades.
Instead, all participants in our DAO vote on upgrades off-chain, and the Foundation
executes transactions on their behalf.
This should be avoided or more testing is required. In our tests, the full nodes were propagating the existing blocks signed by the Validators, making it hard to rollback.
Performing a rollback is analogous to forking the network at the older height.
However, if necessary, the instructions to follow are:
- Prepare & verify a new binary that addresses the consensus-breaking issue.
- Create a release.
- Prepare an upgrade transaction to the new version.
- Disconnect the
Validator set
from the rest of the network 3 blocks prior to the height of the chain halt. For example:- Assume an issue at height
103
. - Revert the
validator set
to height100
. - Submit an upgrade transaction at
101
. - Upgrade the chain at height
102
. - Avoid the issue at height
103
.
- Assume an issue at height
- Ensure all validators rolled back to the same height and use the same snapshot (how to get a snapshot)
- The snapshot should be imported into each Validator's data directory.
- This is necessary to ensure data continuity and prevent forks.
- Isolate the
validator set
from full nodes - (why this is necessary).- This is necessary to avoid full nodes from gossiping blocks that have been rolled back.
- This may require using a firewall or a private network.
- Validators should only be permitted to gossip blocks amongst themselves.
- Start the
validator set
and perform the upgrade. For example, reiterating the process above:- Start all Validators at height
100
. - On block
101
, submit theMsgSoftwareUpgrade
transaction with aPlan.height
set to102
. x/upgrade
will perform the upgrade in theEndBlocker
of block102
.- The node will stop climbing with an error waiting for the upgrade to be performed.
- Cosmovisor deployments automatically replace the binary.
- Manual deployments will require a manual replacement at this point.
- Start the node back up.
- Start all Validators at height
- Wait for the network to reach the height of the previous ledger (
104
+). - Allow validators to open their network to full nodes again.
- Note: full nodes will need to perform the rollback or use a snapshot as well.
Troubleshooting
Data rollback - retrieving snapshot at a specific height (step 5)
There are two ways to get a snapshot from a prior height:
-
Execute
poktrolld rollback --hard
repeately, until the command responds with the desired block number.
-
Use a snapshot from below the halt height (e.g.
100
) and start the node with--halt-height=100
parameter so it only syncs up to certain height and then gracefully shuts down. Add this argument topoktrolld start
like this:poktrolld start --halt-height=100
Validator Isolation - risks (step 6)
Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the following errors in logs are the sign of the nodes syncing blocks from the wrong fork:
found conflicting vote from ourselves; did you unsafe_reset a validator?
conflicting votes from validator