Install > Install and set up Service Manager Service Portal > Deploy a distributed Service Manager Service Portal cluster > Troubleshoot distributed Service Manager Service Portal clustering

Troubleshoot distributed Service Manager Service Portal clustering

This section provides troubleshooting hints and tips that can help you set up a Distributed Service Manager Service Portal Cluster.

NGINX 504 Gateway Time-out

When this error occurs, do the following:

  1. Check if pgpool is running and listening:

    # systemctl status pgpool
    # sudo -u postgres psql -h <DB VIP> -p 9999 –l
  2. Check if IdM on node1/2 can connect to the DB (see logs in /var/log/propel/idm). If not, restart the Launchpad and IdM. If they can connect to the DB, try the LB connection again. If that works, restart all other services on alService Manager Service Portal nodes. Test again the LB connection.

Pgpool not starting

Make sure that version 3.4.7 is installed (see the following figure for an example). This is the version that HPE validated with in the Distributed Service Manager Service Portal configuration.

You can update the propel-distributed/roles/pgpool/tasks/main.yml for the pgpool role to force to install a specific version:

name: roles:pgpool Install older pgPool
yum: name=http://www.pgpool.net/yum/rpms/3.4/redhat/rhel-7-x86_64/pgpool-II-pg94-3.4.7-
1pgdg.rhel7.x86_64.rpm state=installed
ignore_errors: yes

Pgpool not attaching to nodes

When both databases are running, the “show pool_nodes” query should show a status of “2” for both nodes.

# sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role
---------+-----------+------+--------+-----------+---------
0 | <PRIMARY> | 5432 | 2 | 0.500000 | primary
1 | <STANDBY> | 5432 | 3 | 0.500000 | standby

To obtain the expected result, try the following:

  1. On the primary server, restart pgpool:

    # service pgpool restart

    On the standby server, restart pgpool:

    # service pgpool restart

    Check the result:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
  2. If the status is still incorrect, perform the following steps:

    On the standby server, stop pgpool:

    # service pgpool stop

    On the primary server, stop pgpool:

    # service pgpool stop

    On the primary server, confirm eth0:0 is down:

    # ifdown eth0:0

    On the primary server, verify that pgpool exited gracefully:

    # rm –i /tmp/.s.PGSQL.9898
    # rm –i /var/run/postgresql/.s.PGSQL.9999

    On the primary server, restart pgpool:

    # service pgpool start

    Check the result:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"

    If the status is “2” for both nodes, restart pgpool on the standby server:

    # service pgpool start
  3. If the status is still incorrect, perform the following steps:

    On the standby server, stop pgpool:

    # service pgpool stop

    Confirm the status of primary server. The result should be “f”:

    # sudo -u postgres psql -h <Primary> -p 5432 -c 'select pg_is_in_recovery()'
    pg_is_in_recovery
    -------------------
    f

    (1 row)

    Confirm the status of the standby server. The result should be “t”:

    # sudo -u postgres psql -h <Standby> -p 5432 -c 'select pg_is_in_recovery()'
    pg_is_in_recovery
    -------------------
    t

    (1 row)

    If these are incorrect, the issue is more likely with the configuration of PostgreSQL. Otherwise, perform these steps:

    On the primary server, run these commands using the node_id that reports a status of “3”:

    # /usr/pgpool-9.4/bin/pcp_detach_node -U pgpool -h localhost -p 9898 -W -n <node_id>
    Password:
    # /usr/pgpool-9.4/bin/pcp_attach_node -U pgpool -h localhost -p 9898 -W -n <node_id>
    Password:
    # /usr/pgpool-9.4/bin/pcp_detach_node -U pgpool -h localhost -p 9898 -W -n <node_id>

    By default, the password is pgpool.

    Wait 60 seconds and then check the result:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"

PostgreSQL queries on VIP fail

When one or both databases are up and pgpool is running, this error should not occur:

# sudo -u postgres psql -h <DB-VIP>-p 9999 -c 'SELECT now()'
psql: server closed the connection unexpectedly
This probably means the server terminated

To fix the issue, follow the same steps in Pgpool not attaching to nodes.

“show pool_nodes” shows both databases

When one or both databases are up and pgpool is running, this error should not occur:

# sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role
---------+-------------+------+--------+-----------+---------
0 | <PRIMARY> | 5432 | 2 | 0.500000 | standby
1 | <STANDBY> | 5432 | 2 | 0.500000 | standby

To fix the issue, follow the same steps in Pgpool not attaching to nodes.

Load Balancer node information

nginx logs : /var/log/nginx

nginx conf : /etc/nginx/conf.d/virtual.conf

Command to restart nginx: service nginx restart

Database node information

The following section contains information about a DB node.

How to change postgreSQL to listen to all interfaces

  1. Edit the pg_hba.conf file:

    # su – postgres
    # vi /var/lib/pgsql/9.5/data/pg_hba.conf

    host all all 0.0.0.0/0 trust

  2. Edit the postgresql.conf file:

    # vi /var/lib/pgsql/9.5/data/postgresql.conf

    listen_address = '*'

  3. Restart PostgreSQL:

    # service postgresql-9.5 restart

DB Log locations:

/var/lib/pgsql/9.5/data/pg_log

DB restart:

# service postgresql-9.5 restart

DB not responding:

If PostgreSQL runs out of space and does not respond:

http://blog.endpoint.com/2014/09/pgxlog-disk-space-problem-on-postgres.html

RabbitMQ commands

The following section provides some useful commands for RabbitMQ.

Broker status

# rabbitmqctl status

SX: config for MQ

/opt/hp/propel/sx/WEB-INF/classes/config/infrastructure.json

Check if rabbitmq is running correctly

  • 5671 is used by rabbit broker:

    # netstat -an | grep 5671

  • 25672 is used by rabbit to manage clustering:

    # netstat -an | grep 25672

RabbitMQ failed to start on a node

BOOT FAILED - Timeout contacting cluster nodes: [rabbit@awha22p4].

BACKGROUND -This cluster node was shut down while other nodes were still running.

To avoid losing data, you should start the other nodes first, then start this one. To force this node to start, first invoke "rabbitmqctl force_boot". If you do so, any changes made on other cluster nodes after this one was shut down may be lost.

DIAGNOSTICS - attempted to contact: [rabbit@awha22p4]

If you see the type of error described above, run the following commands:

# rabbitmqctl force_boot
# rabbitmqctl start_app