LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 01-13-2023, 12:17 AM   #1
YasminKundur
LQ Newbie
 
Registered: Jan 2023
Posts: 1

Rep: Reputation: 0
PCS HA fail count reset error


Hi all,

I am having an issue in PCS HA. Below are the details regarding this,
===========================================================================================

I have setup a PCA HA cluster which will load balance over VIP. here is below status
/root> pcs status
Cluster name: proxycluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-01-13 11:28:15 +05:30)
Cluster Summary:
* Stack: corosync
* Current DC: ha1 (version 2.1.4-5.el8_7.2-dc6eb4362e) - partition with quorum
* Last updated: Fri Jan 13 11:28:16 2023
* Last change: Fri Jan 6 15:43:49 2023 by hacluster via crmd on ha1
* 2 nodes configured
* 2 resource instances configured

Node List:
* Online: [ ha1 ha2 ]

Full List of Resources:
* Resource Group: GTPGroup:
* gtp_agent (lsb:gtp_agent): Started ha1
* VIP (ocf::heartbeat:IPaddr2): Started ha1

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled

------------------------------------------------------------------------------
agent is as below,

/root> cat /usr/lib/ocf/resource.d/mobileum/GTPMonitor
#!/usr/bin/bash
param=$1

#######################################################################
# Initialization:

: ${OCF_FUNCTIONS_DIR=/usr/lib/ocf/lib/heartbeat}
. /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs

#######################################################################

#######################################################################
# Cluster Actions
# ---------------------------------------------------------------------
GTPMonitor_usage()
{
cat <<END
usage: $0 {start|stop|status|monitor|meta-data}

- 'start' operation starts the Server.
- 'stop' operation stops the Server.
- 'status' operation reports whether the Server is running
- 'monitor' operation reports whether the Server appears to be working.
- 'meta-data' operation show meta data message
END
}

meta_data()
{
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "gtp-api-1.dtd">
<resource-agent name="GTPMonitor">
<version>1.0</version>

<longdesc lang="en">
This is a GTP Resource Agent. It manages a GTP instance server as a
cluster resource.

All actions have a (timeout), that is a hint to the user what minimal timeout
should be configured for the action. This is meant to cater for the fact that
some resources are quick to start and stop (IP addresses or filesystems,
for example), some may take several minutes to do so (such as databases).

In addition, recurring actions (such as monitor) should also specify a recommended
minimum (interval), which is the time between two consecutive invocations of the
same action. Like (timeout), this value does not constitute a default, it is merely
a hint for the user which action interval to configure, at minimum.
</longdesc>
<shortdesc lang="en">Manages a GTP Server instance</shortdesc>

<actions>
<action name="start" timeout="300s" />
<action name="stop" timeout="300s" />
<action name="monitor" timeout="300s" interval="60s" depth="0" />
<action name="meta-data" timeout="5s" />
</actions>
</resource-agent>
END
}

gtp_start() {
su - roamware -c "
. /opt/Roamware/setall_gtpproxy.sh
bash /opt/Roamware/scripts/operations/gtpproxy/StartGTPProxy.sh"
rc=$?
if [ $rc -ne 0 ]
then
ocf_log error "gtp could not be started"
return $OCF_ERR_GENERIC
else
ocf_log info "gtp successfully started."
return $OCF_SUCCESS
fi
}

gtp_stop() {
su - roamware -c "
. /opt/Roamware/setall_gtpproxy.sh
bash /opt/Roamware/scripts/operations/gtpproxy/StopGTPProxy.sh"
rc=$?
if [ $rc -ne 0 ]
then
ocf_log error "gtp could not be stopped"
return $OCF_ERR_GENERIC
else
ocf_log info "gtp successfully shutdown"
return $OCF_SUCCESS
fi
}
gtp_monitor() {
su - roamware -c "
. /opt/Roamware/setall_gtpproxy.sh
bash /opt/Roamware/MonitorGTP.sh"
rc=$?
if [ $rc -ne 0 ]
then
ocf_log info "gtp monitoring not working"
return $OCF_NOT_RUNNING
else
ocf_log info "gtp monitoring working"
return $OCF_SUCCESS
fi
}

first_argument="$1"

case "$first_argument" in
start)
gtp_start
;;
stop)
gtp_stop
;;
monitor)
gtp_monitor
;;
*)
ocf_log err "$0 was called with unsupported arguments: $*"
exit $OCF_ERR_UNIMPLEMENTED
;;
esac

---------------------------------------------------------------------------------

Now let me tell you the issue I am facing..... When I stop an application manually/kill it, it has to restart automatically within the fail counts and I have default configurations for fail counts which is INFINITY. So PCS handles restart of application or failover to secondary node but only after pcs resource cleanup <resource id>. Look at the below error which blocks the resource agent until manually cleaned up.

/root> ps -ef |grep gtp
root 19822 9777 0 11:32 pts/11 00:00:00 grep --color=auto gtp
roamware 23791 1 0 Jan06 ? 00:00:23 /opt/Roamware/binaries/gtpproxy/bin/gtpproxy -c /opt/Roamware/binaries/gtpproxy/config/GTPProxy.cfg
/root> kill -9 23791
/root> ps -ef |grep gtp
root 20131 9777 0 11:32 pts/11 00:00:00 grep --color=auto gtp
/root> pcs status
Cluster name: proxycluster
Status of pacemakerd: 'Pacemaker is running' (last updated 2023-01-13 11:32:19 +05:30)
Cluster Summary:
* Stack: corosync
* Current DC: ha1 (version 2.1.4-5.el8_7.2-dc6eb4362e) - partition with quorum
* Last updated: Fri Jan 13 11:32:19 2023
* Last change: Fri Jan 6 15:43:49 2023 by hacluster via crmd on ha1
* 2 nodes configured
* 2 resource instances configured (1 BLOCKED from further action due to failure)

Node List:
* Online: [ ha1 ha2 ]

Full List of Resources:
* Resource Group: GTPGroup:
* gtp_agent (lsb:gtp_agent): FAILED ha1 (blocked)
* VIP (ocf::heartbeat:IPaddr2): Stopped

Failed Resource Actions:
* gtp_agent_stop_0 on ha1 'error' (1): call=30, status='complete', last-rc-change='Fri Jan 13 11:32:18 2023', queued=0ms, exec=353ms

Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled


Looking for any kind of help is appreciated.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] if [[ -n "$1" ]]; then FAIL FAIL FAIL rbees Programming 7 03-25-2015 02:39 PM
Warning: [fnn_insert] Column count doesn't match value count at row 1 in bondoq Programming 2 09-27-2011 04:11 PM
DBD::mysql::st execute failed: Column count doesn't match value count at row 1 shifter Programming 2 02-24-2010 07:42 PM
Need a way to count sub-directories and get a total count Mo-regard Linux - Newbie 1 08-14-2009 09:10 AM
Should posts in general count on your post count? Joey.Dale General 16 01-27-2004 01:31 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 02:50 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration