Chain Halt Recovery

This document describes how to recover from a chain halt.

It assumes that the cause of the chain halt has been identified, and that the new release has been created and verified to function correctly.

tip

See Chain Halt Troubleshooting for more information on identifying the cause of a chain halt.

Background
Resolving halts during a network upgrade

Background

Pocket network is built on top of cosmos-sdk, which utilizes the CometBFT consensus engine. Comet's Byzantine Fault Tolerant (BFT) consensus algorithm requires that at least 2/3 of Validators are online and voting for the same block to reach a consensus. In order to maintain liveness and avoid a chain-halt, we need the majority (> 2/3) of Validators to participate and use the same version of the software.

Resolving halts during a network upgrade

If the halt is caused by the network upgrade, it is possible the solution can be as simple as skipping an upgrade (i.e. unsafe-skip-upgrade) and creating a new (fixed) upgrade.

Read more about upgrade contingency plans.

Manual binary replacement (preferred)

note

This is the preferred way of resolving consensus-breaking issues.

Significant side effect: this breaks an ability to sync from genesis without manual interventions. For example, when a consensus-breaking issue occurs on a node that is synching from the first block, node operators need to manually replace the binary with the new one. There are efforts underway to mitigate this issue, including configuration for cosmovisor that could automate the process.

Since the chain is not moving, it is impossible to issue an automatic upgrade with an upgrade plan. Instead, we need social consensus to manually replace the binary and get the chain moving.

The steps to doing so are:

Prepare and verify a new binary that addresses the consensus-breaking issue.
Reach out to the community and validators so they can upgrade the binary manually.
Update the documentation to include a range a height when the binary needs to be replaced.

warning

TODO_MAINNET(@okdas):

For step 2: Investigate if the CometBFT rounds/steps need to be aligned as in Morse chain halts. See this ref.
For step 3: Add cosmovisor documentation so its configured to automatically replace the binary when synching from genesis.

Rollback, fork and upgrade

info

These instructions are only relevant to Pocket Network's Shannon release.

We do not currently use x/gov or on-chain voting for upgrades. Instead, all participants in our DAO vote on upgrades off-chain, and the Foundation executes transactions on their behalf.

warning

This should be avoided or more testing is required. In our tests, the full nodes were propagating the existing blocks signed by the Validators, making it hard to rollback.

Performing a rollback is analogous to forking the network at the older height.

However, if necessary, the instructions to follow are:

Prepare & verify a new binary that addresses the consensus-breaking issue.
Create a release.
Prepare an upgrade transaction to the new version.
Disconnect the Validator set from the rest of the network 3 blocks prior to the height of the chain halt. For example:
- Assume an issue at height 103.
- Revert the validator set to height 100.
- Submit an upgrade transaction at 101.
- Upgrade the chain at height 102.
- Avoid the issue at height 103.
Ensure all validators rolled back to the same height and use the same snapshot (how to get a snapshot)
- The snapshot should be imported into each Validator's data directory.
- This is necessary to ensure data continuity and prevent forks.
Isolate the validator set from full nodes - (why this is necessary).
- This is necessary to avoid full nodes from gossiping blocks that have been rolled back.
- This may require using a firewall or a private network.
- Validators should only be permitted to gossip blocks amongst themselves.
Start the validator set and perform the upgrade. For example, reiterating the process above:
- Start all Validators at height 100.
- On block 101, submit the MsgSoftwareUpgrade transaction with a Plan.height set to 102.
- x/upgrade will perform the upgrade in the EndBlocker of block 102.
- The node will stop climbing with an error waiting for the upgrade to be performed.
  - Cosmovisor deployments automatically replace the binary.
  - Manual deployments will require a manual replacement at this point.
- Start the node back up.
Wait for the network to reach the height of the previous ledger (104+).
Allow validators to open their network to full nodes again.
- Note: full nodes will need to perform the rollback or use a snapshot as well.

Troubleshooting

Data rollback - retrieving snapshot at a specific height (step 5)

There are two ways to get a snapshot from a prior height:

Execute
```
poktrolld rollback --hard
```
repeately, until the command responds with the desired block number.
Use a snapshot from below the halt height (e.g. 100) and start the node with --halt-height=100 parameter so it only syncs up to certain height and then gracefully shuts down. Add this argument to poktrolld start like this:
```
poktrolld start --halt-height=100
```

Validator Isolation - risks (step 6)

Having at least one node that has knowledge of the forking ledger can jeopardize the whole process. In particular, the following errors in logs are the sign of the nodes syncing blocks from the wrong fork:

found conflicting vote from ourselves; did you unsafe_reset a validator?
conflicting votes from validator