Failover and recovery

The following describes typical failover scenarios and their recovery actions.

Pgpool stops on the standby database server

Verify that the standby database server is healthy:

  1. From the primary database server:

    # sudo -u postgres psql -h <standby> -p 5432 -c 'select pg_is_in_recovery()' 
    
    pg_is_in_recovery
    -------------------
    t

    (1 row)

  2. If this query returns "t" as shown above, then all that is required is to restart pgpool from the standby database server:

    # systemctl start pgpool.service
  3. Confirm pgpool is running properly:

    # systemctl status pgpool.service

Pgpool stops on the primary database server

Verify that the primary database server is healthy:

  1. From the standby database server:

    # sudo -u postgres psql -h <primary> -p 5432 -c 'select pg_is_in_recovery()'
    pg_is_in_recovery
    -------------------
    f

    (1 row)

  2. If this query returns "f" as shown above, then all that is required is to restart pgpool from the primary database server:

    # systemctl start pgpool.service
  3. Confirm pgpool is running properly:

    # systemctl status pgpool.service

PostgreSQL stops on the standby database server

  1. First, perform a backup of the primary database server.
  2. Address the root cause of PostgreSQL's stoppage.
  3. From the primary server, run the following command to confirm that connectivity exists with the primary server:

    # ssh <standby> service pgpool status

    <- should see normal pgpool status output ->

  4. From the standby server, restart PostgreSQL:

    # systemctl status pgpool.service
  5. From the primary server, confirm that pgpool has attached to both the primary and standby servers:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"

    The command should return something like this:

    node_id | hostname | port | status | lb_weight | role
    ---------+-----------+------+--------+-----------+---------
    0 | <primary> | 5432 | 2 | 0.500000 | primary
    1 | <standby> | 5432 | 2 | 0.500000 | standby

    The "role" column should contain the appropriate primary/standy value and the status column should be "2" for both nodes.

  6. Confirm replication is active. To do this, run the following command from the primary server:

    # sudo -u postgres psql -h <primary> -p 5432 -c 'select sent_location, replay_location from pg_stat_replication'

    The command should return something like this:

    sent_location | replay_location
    ---------------+-----------------
    7D/90004B0 | 7D/9000478

    (1 row)

  7. Wait 60 seconds, and run the same command. The results should differ.

PostgreSQL stops on the primary database server

Recovery will involve a Service Manager Service Portal service outage. After recovery, the primary and standby databases will swap roles.

Note In the following steps, "new-primary" refers to the original standby server, which has been promoted. Similarly, "new-standby" refers to the original primary server, which has stopped.

The following diagram illustrates how the system changes when a failover event occurs.

  1. On the new-standby, stop pgpool and confirm PostgreSQL is stopped:

    # systemctl stop pgpool.service
    # systemctl status pgpool.service
  2. Stop Service Manager Service Portal and OO on all Service Manager Service Portal nodes:

    # propel stop
    # systemctl stop central.service
  3. Perform a backup of new-primary database server.
  4. On the new-primary, restart PostgreSQL and pgpool:

    # systemctl restart pgpool.service
    # systemctl restart postgresql-9.5.service
  5. Confirm that the new-primary has been promoted:

    # sudo -u postgres psql -h <new-primary> -p 5432 -c 'select
    pg_is_in_recovery()' 
    pg_is_in_recovery
    
    -------------------

    f

    (1 row)

    The result should be "f", as shown above.

  6. Run the following command:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
    node_id | hostname | port | status | lb_weight | role
    ---------+---------------+------+--------+-----------+---------
    0 | <new-standby> | 5432 | 3 | 0.500000 | standby
    1 | <new-primary> | 5432 | 2 | 0.500000 | primary

    The “role” values should be reset. The new-primary should have a status “2” (up), and the new-standby should have a status of “3” (down).

  7. On the load-balancer node, create a new inventory file with the primary and standby servers reversed. For example, if the original primary server was db1.hpe.net and the original standby server was db2.hpe.net, your new inventory would have this content:

    [postgres]
    db2.hpe.net ansible_ssh_user=root
    db1.hpe.net ansible_ssh_user=root
    [db_master_nodes]
    db2.hpe.net ansible_ssh_user=root
    [db_slave_nodes]
    db1.hpe.net ansible_ssh_user=root
  8. For this example, we assume the new inventory file is /opt/hp/propel/contrib/propeldistributed.<version>/inventories/recovery_cluster. Then from the directory /opt/hp/propel/contrib/propel-distributed.<version> on the load balancer node, rerun Ansible playbook db.yml:

    # ansible-playbook db.yml -c paramiko --ask-become-pass -u propel 2>&1 | tee recovery.out
  9. Verify the new-primary and new-master are running. From the load balancer node, run the following command:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
    node_id | hostname | port | status | lb_weight | role
    ---------+---------------+------+--------+-----------+---------
    0 | <new-primary> | 5432 | 2 | 0.500000 | primary
    1 | <new-standby> | 5432 | 2 | 0.500000 | standby

    The “role” column should reflect the new server statuses. The “status” column should be “2” for both nodes.

  10. Confirm replication is active:

    # sudo -u postgres psql -h <new-primary>; -p 5432 -c 'select sent_location, replay_location from pg_stat_replication'
    sent_location | replay_location
    ---------------+-----------------
    7D/90004B0 | 7D/9000478
    (1 row)
  11. Wait 60 seconds, and run the same command. The results should differ.
  12. On each Service Manager Service Portal node, start Service Manager Service Portal and start OO:

    # propel start
    # service central start
  13. Verify that the mpp service has initialized properly and restart it if necessary:

    # service mpp status

Standby server down or unavailable

After addressing the root cause of the server outage, see the "PostgreSQL stops on the standby database server" failover scenario.

Note If the server exited abruptly, pgpool may not initialize properly. See troubleshooting note "Pgpool not attaching to nodes".

Primary server down or unavailable

After addressing the root cause of server outage, see the "Pgpool stops on the primary database server" failover scenario.

Note If the server exited abruptly, pgpool may not initialize properly. See troubleshooting note "Pgpool not attaching to nodes".

Service Manager Service Portal node down or unavailable

  1. After addressing the root cause of server outage, restart Service Manager Service Portal and OO:

    # propel stop
    # propel start
    # systemctl restart central.service
  2. Verify that the Portal service has initialized properly and restart if necessary:

    # systemctl status portal

Load balancer down or unavailable

  1. After addressing the root cause of the server outage, restart nginx:

    # service nginx restart

    If the backup image of the load balancer contains a Service Manager Service Portal installation, it may be necessary to stop Service Manager Service Portal and OO:

    # propel stop
    # systemctl restart central.service
  2. Verify that no node processes are running:

    # ps –ef | grep node