Troubleshoot distributed Service Manager Service Portal clustering

This section provides troubleshooting hints and tips that can help you set up a Distributed Service Manager Service Portal Cluster.

Application node down

When a Service Manager Service Portal application node is down, the failover works well, but some services cannot start up in recovery steps because they cannot connect to postgresql.

Solution: Check /etc/sysconfig/iptables file, some rules in nat table may be lost. Add these rules in this file (you can get these rules from another Service Manager Service Portal node which functions well), restart iptables, and then restart Service Manager Service Portal

DB master down

When DB master node is down, check the DB nodes status on the DB slave host.

sudo -u postgres psql -h <vip> -p 9999 -c "show pool_nodes"

The roles of two nodes are all standby, and you cannot log in to Service Manager Service Portal.

Solution: Stop services postgresql-9.5 and pgpool on two DB nodes, start postgresql-9.5 and pgpool on the DB master first, start these two services on the DB slave, check the DB nodes status and make sure that the roles and status of DB nodes are correct, and then restart nginx and all Service Manager Service Portal services.

Fail over does not work well

Use the user ‘propel’ to log in to two DB nodes, check if the files ‘id_rsa’ and ‘id_rsa.pub’ exist, if not, run the following commands on each node :

ssh-keygen -t rsa -f ~/.ssh/id_rsa
ssh-copy-id propel@<Another DB host>
ssh propel@<Another DB host>

NGINX 504 Gateway Time-out

When this error occurs, do the following:

  1. Check if pgpool is running and listening:

    # systemctl status pgpool
    # sudo -u postgres psql -h <DB VIP> -p 9999 –l
  2. Check if IdM on node1/2 can connect to the DB (see logs in /var/log/propel/idm). If not, restart the Launchpad and IdM. If they can connect to the DB, try the LB connection again. If that works, restart all other services on alService Manager Service Portal nodes. Test again the LB connection.

Pgpool not starting

Make sure that version 3.4.7 is installed (see the following figure for an example). This is the version that we validated with in the Distributed Service Manager Service Portal configuration.

You can update the propel-distributed/roles/pgpool/tasks/main.yml for the pgpool role to force to install a specific version:

name: roles:pgpool Install older pgPool
yum: name=http://www.pgpool.net/yum/rpms/3.4/redhat/rhel-7-x86_64/pgpool-II-pg94-3.4.7-
1pgdg.rhel7.x86_64.rpm state=installed
ignore_errors: yes

Pgpool not attaching to nodes

When both databases are running, the “show pool_nodes” query should show a status of “2” for both nodes.

# sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role
---------+-----------+------+--------+-----------+---------
0 | <PRIMARY> | 5432 | 2 | 0.500000 | primary
1 | <STANDBY> | 5432 | 3 | 0.500000 | standby

To obtain the expected result, try the following:

  1. On the primary server, restart pgpool:

    # service pgpool restart

    On the standby server, restart pgpool:

    # service pgpool restart

    Check the result:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
  2. If the status is still incorrect, perform the following steps:

    On the standby server, stop pgpool:

    # service pgpool stop

    On the primary server, stop pgpool:

    # service pgpool stop

    On the primary server, confirm eth0:0 is down:

    # ifdown eth0:0

    On the primary server, verify that pgpool exited gracefully:

    # rm –i /tmp/.s.PGSQL.9898
    # rm –i /var/run/postgresql/.s.PGSQL.9999

    On the primary server, restart pgpool:

    # service pgpool start

    Check the result:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"

    If the status is “2” for both nodes, restart pgpool on the standby server:

    # service pgpool start
  3. If the status is still incorrect, perform the following steps:

    On the standby server, stop pgpool:

    # service pgpool stop

    Confirm the status of primary server. The result should be “f”:

    # sudo -u postgres psql -h <Primary> -p 5432 -c 'select pg_is_in_recovery()'
    pg_is_in_recovery
    -------------------
    f

    (1 row)

    Confirm the status of the standby server. The result should be “t”:

    # sudo -u postgres psql -h <Standby> -p 5432 -c 'select pg_is_in_recovery()'
    pg_is_in_recovery
    -------------------
    t

    (1 row)

    If these are incorrect, the issue is more likely with the configuration of PostgreSQL. Otherwise, perform these steps:

    On the primary server, run these commands using the node_id that reports a status of “3”:

    #vim /etc/pgpool-II/pcp.conf

    Uncomment the line that starts with "postgres." If your Postgres password has already been changed, you need to generate md5 of the password:

    #pcp_attach_node -h <fqdn of the master pgpool> -p 9898 -U postgres -n <node_id>

    Use the result of this command to replace the value after "postgres:" in the /etc/pgpool-II/pcp.conf file.

    Then, run the following command:

    #pcp_attach_node -h <fqdn of the master pgpool> -p 9898 -U postgres -n <node_id>
    Password:
    

    The default value of password is "postgres"

    Note In SMSP 2.20 p2, the pgpool version is 3.4.8, the pcp_attach_node usage is different from 3.5.5 , the usage is as follows:

    pcp_attach_node -d timeout <fqdn of the master pgpool> port# username password node_id

    For example:

    pcp_attach_node -d 30 sgdlitvm0085.hpeswlab.net 9898# postgres postgres 1

    Wait 60 seconds and then check the result:

    # sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"

PostgreSQL queries on VIP fail

When one or both databases are up and pgpool is running, this error should not occur:

# sudo -u postgres psql -h <DB-VIP>-p 9999 -c 'SELECT now()'
psql: server closed the connection unexpectedly
This probably means the server terminated

To fix the issue, follow the same steps in Pgpool not attaching to nodes.

“show pool_nodes” shows both databases

When one or both databases are up and pgpool is running, this error should not occur:

# sudo -u postgres psql -h <DB-VIP> -p 9999 -c "show pool_nodes"
node_id | hostname | port | status | lb_weight | role
---------+-------------+------+--------+-----------+---------
0 | <PRIMARY> | 5432 | 2 | 0.500000 | standby
1 | <STANDBY> | 5432 | 2 | 0.500000 | standby

To fix the issue, follow the same steps in Pgpool not attaching to nodes.

Load Balancer node information

nginx logs : /var/log/nginx

nginx conf : /etc/nginx/conf.d/virtual.conf

Command to restart nginx: service nginx restart

Database node information

The following section contains information about a DB node.

How to change postgreSQL to listen to all interfaces

  1. Edit the pg_hba.conf file:

    # su – postgres
    # vi /var/lib/pgsql/9.5/data/pg_hba.conf

    host all all 0.0.0.0/0 trust

  2. Edit the postgresql.conf file:

    # vi /var/lib/pgsql/9.5/data/postgresql.conf

    listen_address = '*'

  3. Restart PostgreSQL:

    # service postgresql-9.5 restart

DB Log locations:

/var/lib/pgsql/9.5/data/pg_log

DB restart:

# service postgresql-9.5 restart

DB not responding:

If PostgreSQL runs out of space and does not respond:

http://blog.endpoint.com/2014/09/pgxlog-disk-space-problem-on-postgres.html

RabbitMQ commands

The following section provides some useful commands for RabbitMQ.

Broker status

# rabbitmqctl status

SX: config for MQ

/opt/hp/propel/sx/WEB-INF/classes/config/infrastructure.json

Check if rabbitmq is running correctly

  • 5671 is used by rabbit broker:

    # netstat -an | grep 5671

  • 25672 is used by rabbit to manage clustering:

    # netstat -an | grep 25672

RabbitMQ failed to start on a node

BOOT FAILED - Timeout contacting cluster nodes: [rabbit@awha22p4].

BACKGROUND -This cluster node was shut down while other nodes were still running.

To avoid losing data, you should start the other nodes first, then start this one. To force this node to start, first invoke "rabbitmqctl force_boot". If you do so, any changes made on other cluster nodes after this one was shut down may be lost.

DIAGNOSTICS - attempted to contact: [rabbit@awha22p4]

If you see the type of error described above, run the following commands:

# rabbitmqctl force_boot
# rabbitmqctl start_app

Related topics

Set up a distributed Service Manager Service Portal cluster