Disaster recovery

This section assumes that a Distributed Service Manager Service Portal Disaster Recovery (DR) system has already been set up.

Set up a Service Manager Service Portal Disaster Recovery (DR) cluster

  1. Make sure that Service Manager Service Portal is stopped on the Service Manager Service Portal nodes of the DR cluster.
  2. On the DR cluster's master DB node:
    1. [Optional] Make a backup of the /var/lib/pgsql/9.5/data directory.
    2. Delete the data directories:

      # rm -rf /var/lib/pgsql/9.5/data/*
    3. Set up replication from the Primary Cluster master DB node to the DR cluster master DB nod:

      # su - postgres
      # pg_basebackup --dbname="postgresql://repl:replpass@<primary-cluster-master-db>/" -D /var/lib/pgsql/9.5/data -P --xlog-method=stream
    4. Create recovery.conf under /var/lib/pgsql/9.5/data:

      # vi /var/lib/pgsql/9.5/data/recovery.conf
      standby_mode = 'on'
      primary_conninfo = 'host=< primary-cluster-master-db> user=repl password=replpass' 
      restore_command = 'cp /var/lib/postgresql/9.5/archive/%f %p'
      recovery_target_timeline='latest'
      trigger_file = '/tmp/pgsql.trigger'
    5. Changes the permissions:

      # chown postgres:postgres recovery.conf
    6. Restart postgres on both the DR master and slave nodes:

      # service postgressql-9.5 restart
    7. Verify the setup:
      1. On the DR cluster, make sure both DB nodes are on standby.

         

        Step

        Command

        Output

        a)

        Check the Master postgres

        sudo -u postgres psql -h <primary> -p 5432 -c "SELECT pg_is_in_recovery()"

        pg_is_in_recovery

        -------------------

        f

        b)

        Check Master postgres using DB-VIP

        sudo -u postgres psql -h <DB-VIP> -p 5432 -c "SELECT pg_is_in_recovery()"

        pg_is_in_recovery

        -------------------

        f

        c)

        Check the Slave postgres

        sudo -u postgres psql -h <standby> -p 5432 -c "SELECT pg_is_in_recovery()"

        pg_is_in_recovery

        -------------------

        t

      2. Verify if replication is happening to the DR cluster master from the Primary cluster master.

        Note Replication from the DR cluster master to the DR cluster slave is stopped in this mode. Once the DR site is enabled as the primary site, replication starts from the master to the slave.

Switch Service Manager Service Portal to your Disaster Recovery cluster

  1. Make sure that all nodes on the Primary cluster are down.
  2. Rerun Ansible playbook script db.yml from the /opt/hp/propel/contrib/propel-distributed.<version> directory on the DR cluster LB node:

    # ansible-playbook db.yml -c paramiko --ask-become-pass -u propel 2>&1 | tee recovery.out
  3. Verify the configuration:
    1. On the DR cluster, make sure both DB nodes are on standby:

       

      Step

      Command

      Output

      a)

      Check the Master postgres

      sudo -u postgres psql -h <primary> -p 5432 -c "SELECT pg_is_in_recovery()"

      pg_is_in_recovery

      -------------------

      f

      b)

      Check Master postgres using DB-VIP

      sudo -u postgres psql -h <DB-VIP> -p 5432 -c "SELECT pg_is_in_recovery()"

      pg_is_in_recovery

      -------------------

      f

      c)

      Check the Slave postgres

      sudo -u postgres psql -h <standby> -p 5432 -c "SELECT pg_is_in_recovery()"

      pg_is_in_recovery

      -------------------

      t

    2. Check if replication is happening to the DR cluster slave.
    3. Check pgpool (show pool_nodes) :

      From the primary db of the DR cluster, confirm pgpool has attached to both the primary and standby servers:

      # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
      node_id | hostname | port | status | lb_weight | role
      ------------+-----------+------+-----+------------+----------
      0 | <primary> | 5432 | 2 | 0.500000 | primary
      1 | <standby> | 5432 | 2 | 0.500000 | standby

      The “role” column should contain the appropriate primary/standy value and the status column should be “2” for both nodes.

  4. Start the Service Manager Service Portal nodes and restart Nginx on the DR cluster.
  5. Log in to the DR user interface and check whether the data created on the Primary site is present on the DR.