A pool failed on one of the nodes
Situation: the network is down or there is another problem that caused a pool to fail on a node.
Consequences: the pool migrated to the other node.
Recovery procedure:
- eliminate the cause of the problem
- mark pool as usable on the node where it failed
The pool will remain on the node that is running and will have no intentions of “coming back”(returning), even when the cause of the migration was eliminated. By marking the pool usable on the node that it failed, you’ve notified the cluster that the pool can be started there, if needed. You can also move it back manually by clicking the Move pool button on that pool.
A pool failed on both nodes
Situation: a pool fails on both nodes
Consequences: the pool is marked as failed and it appears in the Inactive cluster pools box.
Recovery procedure:
- eliminate the cause of the problem
- mark pool as usable on the node you wish it to start
- mark pool as usable on the other node
The pool will start on the node you first marked it as usable on.
A node lost power
Situation: a power outage occurred on one of the nodes.
Consequences: The pools migrated to the second node and continue to serve data. The first node is marked as Offline.
Recovery procedure:
- restore power to the node
- turn it on
- wait for the node to rejoin the cluster
- move the pools back to the node (click Move pool on the second pool)
All nodes lost power
Situation: a power outage occurred on both nodes.
Consequences: The pools are unaccessible.
Recovery procedure:
- restore power to one node
- turn it on
- go to the HA menu, wait for the cluster to restart.
- wait for all the pools to start
- restore power to the second node
- wait for the node to start
- move pools to the second node
A node was replaced with another machine
Situation: a fatal hardware error happened to a cluster node.
Consequences: You have to replace the node. The pools running on that node migrated to the other node.
Recovery procedure:
- remove the failed node from the cluster
- install and configure the new node
- go to the HA menu and join the node to the cluster
One node panicked
Situation: the connection between the nodes fails and the connection with the client network also fails.
Consequences: A race scenario starts, in which both nodes will try to take over all the pools. The one that gets them will win the race, the other one will freeze with a kernel panic.
Recovery procedure:
- fix connection between the nodes
- fix connection with the client network
- wait for the node to recover
Heartbeat connection was lost
Situation: the heartbeat connection is lost between the nodes.
Consequences: the nodes cannot “see” and communicate with each other. Both nodes will try to start the pools, but only one of them will succeed. As a result, you will have a random number of pools running on one node and the rest of them on the other one. In the HA menu on both nodes, you will see that the other node is marked as Offline.
Recovery procedure:
- Restore the Heartbeat connection. The remote node will return with an Online state.