I am trying to set up Heartbeat + softdog on a SLES 10 installation so that if the machine gets seriously hung for some unlikely reason, it will automatically reboot. I cannot seem to find much documentation on using softdog, and the mentions of it on
www.linux-ha.org are sparse (in my opinion).
I'm working on a test machine to try and get this set up. I have done a "baseline" install of SLES 10 (basically just accepting all defaults during installation). I then install the Heartbeat application after booting the machine from the hard drive the first time. I am not setting up a cluster, but I was told that Heartbeat could easily trigger the watchdog to reboot my machine. So my "cluster" is a cluster of one machine. My ha.cf is pretty simple:
Code:
logfile /var/log/ha-log
logfacility local0
keepalive 2
deadtime 30
warntime 10
initdead 120
autojoin any
crm true
bcast eth0
watchdog /dev/watchdog
node testserver
respawn root /sbin/evmsd
apiauth evms uid=hacluster,root
I'm not trying to set up any failover of services or anything, so I don't have any resources set up. So with this in place, I reboot my machine, and it seems to come up just fine.
It is at this point that I have questions. How can I test my setup to be sure that it is configured properly and works the way I want it to? I looked in the log file and the only mention I see of the watchdog is a line like the following:
Code:
heartbeat[3549]: 2008/05/19_16:40:52 ERROR: WDIOC_SETTIMEOUT: Failed to set watchdog timer to 31 seconds.: Invalid argument
This makes me think that I have something configured wrong, but as I mentioned above, I have not been very successful in finding much detailed documentation about what I'm trying to do here.
Am I even barking up the right tree? Is there an easier/better way to monitor general system health and reboot if there is an issue?