PCS HA fail count reset error

YasminKundur · 01-13-2023, 12:17 AM

Hi all,

I am having an issue in PCS HA. Below are the details regarding this,
===========================================================================================

I have setup a PCA HA cluster which will load balance over VIP. here is below status
/root> pcs status
Cluster name: proxycluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-01-13 11:28:15 +05:30)
Cluster Summary:
* Stack: corosync
* Current DC: ha1 (version 2.1.4-5.el8_7.2-dc6eb4362e) - partition with quorum
* Last updated: Fri Jan 13 11:28:16 2023
* Last change: Fri Jan 6 15:43:49 2023 by hacluster via crmd on ha1
* 2 nodes configured
* 2 resource instances configured

Node List:
* Online: [ ha1 ha2 ]

Full List of Resources:
* Resource Group: GTPGroup:
* gtp_agent (lsb:gtp_agent): Started ha1
* VIP (ocf::heartbeat:IPaddr2): Started ha1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

------------------------------------------------------------------------------
agent is as below,

/root> cat /usr/lib/ocf/resource.d/mobileum/GTPMonitor
#!/usr/bin/bash
param=$1

#######################################################################
# Initialization:

: ${OCF_FUNCTIONS_DIR=/usr/lib/ocf/lib/heartbeat}
. /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs

#######################################################################

#######################################################################
# Cluster Actions
# ---------------------------------------------------------------------
GTPMonitor_usage()
{
cat <<END
usage: $0 {start|stop|status|monitor|meta-data}

- 'start' operation starts the Server.
- 'stop' operation stops the Server.
- 'status' operation reports whether the Server is running
- 'monitor' operation reports whether the Server appears to be working.
- 'meta-data' operation show meta data message
END
}

meta_data()
{
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "gtp-api-1.dtd">
<resource-agent name="GTPMonitor">
<version>1.0</version>

<longdesc lang="en">
This is a GTP Resource Agent. It manages a GTP instance server as a
cluster resource.

All actions have a (timeout), that is a hint to the user what minimal timeout
should be configured for the action. This is meant to cater for the fact that
some resources are quick to start and stop (IP addresses or filesystems,
for example), some may take several minutes to do so (such as databases).

In addition, recurring actions (such as monitor) should also specify a recommended
minimum (interval), which is the time between two consecutive invocations of the
same action. Like (timeout), this value does not constitute a default, it is merely
a hint for the user which action interval to configure, at minimum.
</longdesc>
<shortdesc lang="en">Manages a GTP Server instance</shortdesc>

<actions>
<action name="start" timeout="300s" />
<action name="stop" timeout="300s" />
<action name="monitor" timeout="300s" interval="60s" depth="0" />
<action name="meta-data" timeout="5s" />
</actions>
</resource-agent>
END
}

gtp_start() {
su - roamware -c "
. /opt/Roamware/setall_gtpproxy.sh
bash /opt/Roamware/scripts/operations/gtpproxy/StartGTPProxy.sh"
rc=$?
if [ $rc -ne 0 ]
then
ocf_log error "gtp could not be started"
return $OCF_ERR_GENERIC
else
ocf_log info "gtp successfully started."
return $OCF_SUCCESS
fi
}

gtp_stop() {
su - roamware -c "
. /opt/Roamware/setall_gtpproxy.sh
bash /opt/Roamware/scripts/operations/gtpproxy/StopGTPProxy.sh"
rc=$?
if [ $rc -ne 0 ]
then
ocf_log error "gtp could not be stopped"
return $OCF_ERR_GENERIC
else
ocf_log info "gtp successfully shutdown"
return $OCF_SUCCESS
fi
}
gtp_monitor() {
su - roamware -c "
. /opt/Roamware/setall_gtpproxy.sh
bash /opt/Roamware/MonitorGTP.sh"
rc=$?
if [ $rc -ne 0 ]
then
ocf_log info "gtp monitoring not working"
return $OCF_NOT_RUNNING
else
ocf_log info "gtp monitoring working"
return $OCF_SUCCESS
fi
}

first_argument="$1"

case "$first_argument" in
start)
gtp_start
;;
stop)
gtp_stop
;;
monitor)
gtp_monitor
;;
*)
ocf_log err "$0 was called with unsupported arguments: $*"
exit $OCF_ERR_UNIMPLEMENTED
;;
esac

---------------------------------------------------------------------------------

Now let me tell you the issue I am facing..... When I stop an application manually/kill it, it has to restart automatically within the fail counts and I have default configurations for fail counts which is INFINITY. So PCS handles restart of application or failover to secondary node but only after pcs resource cleanup <resource id>. Look at the below error which blocks the resource agent until manually cleaned up.

/root> ps -ef |grep gtp
root 19822 9777 0 11:32 pts/11 00:00:00 grep --color=auto gtp
roamware 23791 1 0 Jan06 ? 00:00:23 /opt/Roamware/binaries/gtpproxy/bin/gtpproxy -c /opt/Roamware/binaries/gtpproxy/config/GTPProxy.cfg
/root> kill -9 23791
/root> ps -ef |grep gtp
root 20131 9777 0 11:32 pts/11 00:00:00 grep --color=auto gtp
/root> pcs status
Cluster name: proxycluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-01-13 11:32:19 +05:30)
Cluster Summary:
* Stack: corosync
* Current DC: ha1 (version 2.1.4-5.el8_7.2-dc6eb4362e) - partition with quorum
* Last updated: Fri Jan 13 11:32:19 2023
* Last change: Fri Jan 6 15:43:49 2023 by hacluster via crmd on ha1
* 2 nodes configured
* 2 resource instances configured (1 BLOCKED from further action due to failure)

Node List:
* Online: [ ha1 ha2 ]

Full List of Resources:
* Resource Group: GTPGroup:
* gtp_agent (lsb:gtp_agent): FAILED ha1 (blocked)
* VIP (ocf::heartbeat:IPaddr2): Stopped

Failed Resource Actions:
* gtp_agent_stop_0 on ha1 'error' (1): call=30, status='complete', last-rc-change='Fri Jan 13 11:32:18 2023', queued=0ms, exec=353ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

Looking for any kind of help is appreciated.