Failed upgrade contingency plan
Contingency plans
There's always a chance an upgrade will fail due to a variety of unknown unknowns.
This document is intended to help you recover with minimal downtime.
- Option 0: The bug is discovered before the upgrade height is reached
- Option 1: The migration didn't start (i.e. migration halt)
- Option 2: The migration is stuck (i.e. incomplete/partial migration)
- Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)
- Failed Upgrade Checklist
Option 0: The bug is discovered before the upgrade height is reached
tl;dr cancel the upgrade plan!
See the instructions of how to do that here.
Option 1: The migration didn't start (i.e. migration halt)
tl;dr This is unlikely to happen.
Possible reasons for this are if the name of the upgrade handler is different from the one specified in the upgrade plan, or if the binary suggested by the upgrade plan is wrong.
If the nodes on the network stopped at the upgrade height and the migration did not
start yet (i.e. there are no logs indicating the upgrade handler and store migrations are being executed),
we MUST gather social consensus to restart validators with the --unsafe-skip-upgrade=$upgradeHeightNumber
flag.
This will skip the upgrade process, allowing the chain to continue and the protocol team to plan another release.
--unsafe-skip-upgrade
simply skips the upgrade handler and store migrations.
The chain continues as if the upgrade plan was never set.
The upgrade needs to be fixed, and then a new plan needs to be submitted to the network.
--unsafe-skip-upgrade
needs to be documented in the list of upgrades and added
to the scripts so the next time somebody tries to sync the network from genesis,
they will automatically skip the failed upgrade.
TODO_IMPROVE(@okdas): Provide more documentation here and details on how cosmovisor UX can simplify this.
Option 2: The migration is stuck (i.e. incomplete/partial migration)
tl;dr Requires social consensus and protocol team support to issue a new upgrade.
If the migration is stuck, there's always a chance the upgrade handler was executed onchain as scheduled, but the migration didn't complete.
In such a case, we need:
-
All full nodes and validators: Roll back validators to the backup. A snapshot is taken by
cosmovisor
automatically prior to upgrade whenUNSAFE_SKIP_BACKUP
is set tofalse
(the default recommended value; more information) -
All full nodes and validators: skip the upgrade. Add the
--unsafe-skip-upgrade=$upgradeHeightNumber
argument topocket start
command like so:pocketd start --unsafe-skip-upgrade=$upgradeHeightNumber # ... the rest of the arguments
-
Protocol team: Resolve the issue with an upgrade and schedule a new plan. The upgrade needs to be fixed, and then a new plan needs to be submitted to the network.
-
Protocol team: Document the failed upgrade.
- Document and add
--unsafe-skip-upgrade=$upgradeHeightNumber
to the scripts (such as docker-compose and cosmovisor installer) - The next time somebody tries to sync the network from genesis they will automatically skip the failed upgrade
- Document and add
Option 3: The migration succeed but the network is stuck (i.e. migration had a bug)
tl;dr This should be treated as a consensus or non-determinism bug that is unrelated to the upgrade.
See Recovery From Chain Halt for more information on how to handle such issues.
Failed Upgrade Checklist
The following is a list of documentation & scripts that need to be updated on a failed upgrade:
- The upgrade list should reflect a failed upgrade and provide a range of heights that served by each version.
- Systemd service should include
--unsafe-skip-upgrade=$upgradeHeightNumber
argument in its start command here. - The Helm chart should point to the latest version;consider exposing via a
values.yaml
file - The docker-compose examples should point to the latest version