Weird issue! More than 100 servers are going down if they reboot!

Yakooza · 09-30-2018, 05:26 AM

Hi,

First, my Linux knowledge is basic.

I have around 100 servers with OVH which they are installed with CentOS 6.x and Most of them CentOS 7.x and KVM virtualization. Recently I have found out if any of them reboots they go into 'Kernel Panic' (That is what the support said) and does not comes up. Unfortunately, I do not have KVM over IP access but from the Debian Rescue mode, I could not locate any relevant logs.

I have looked at the logs below but did not find anything relevant.
Here are some of the last logs

/var/log/messages

Code:

Sep 29 06:58:58 server01 named[1584]:   validating @0x7f51f80597f0: x SOA: no valid signature found
Sep 29 06:58:58 server01 named[1584]:   validating @0x7f51e40028e0: 1x NSEC: no valid signature found
Sep 29 06:58:58 server01 named[1584]: error (network unreachable) resolving 'xA/IN': 2001:41d0:1:1982::1#53
Sep 29 06:58:58 server01 named[1584]: error (network unreachable) resolving 'x': 2001:41d0:1:4a81::1#53
Sep 29 07:00:45 server01 init: tty (/dev/tty1) main process (2495) killed by TERM signal
Sep 29 07:00:45 server01 init: tty (/dev/tty2) main process (2497) killed by TERM signal
Sep 29 07:00:45 server01 init: tty (/dev/tty3) main process (2499) killed by TERM signal
Sep 29 07:00:45 server01 init: tty (/dev/tty4) main process (2501) killed by TERM signal
Sep 29 07:00:45 server01 init: tty (/dev/tty5) main process (2503) killed by TERM signal
Sep 29 07:00:45 server01 init: tty (/dev/tty6) main process (2505) killed by TERM signal
Sep 29 07:00:54 server01 ntpd[26063]: ntpd exiting on signal 15

Code:

root@rescue:/mnt/var/log# tail dmesg
parport0: PC-style at 0x378, irq 5 [PCSPP]
ppdev: user-space parallel port driver
tun: Universal TUN/TAP device driver, 1.6
tun: (C) 1999-2004 Max Krasnyansky <maxk@qualcomm.com>
EXT3-fs (sda2): using internal journal
kjournald starting.  Commit interval 5 seconds
EXT3-fs (sda1): using internal journal
EXT3-fs (sda1): mounted filesystem with ordered data mode
EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts:
Adding 4193276k swap on /dev/sda3.  Priority:-1 extents:1 across:4193276k SS

Code:

root@rescue:/mnt/var/log# tail boot.log
Starting ksmtuned:                                         [  OK  ]
Starting crond:                                            [  OK  ]
Starting atd:                                              [  OK  ]
Starting libvirtd daemon:                                  [  OK  ]
-
Starting php-fpm: Done...
Starting nginx: Done...
Starting MySQL.. SUCCESS!
RTNETLINK answers: No such process
RTNETLINK answers: No such file or directory

I could not find anything on the Internet. Any suggestion would be helpful.

Also, they are different configuration servers.

Thank you.

Ser Olmy · 09-30-2018, 06:01 PM

The log displayed by dmesg is volatile; it's held in a ring buffer in kernel memory until a process like klogd writes it to disk. All dmesg will tell you is what happened since last reboot. The interesting stuff should be in the logs, but it's entirely possible that whatever causes the kernel panic also prevents the system from dumping anything related to the issue to disk.

I'm not familiar with OVH, but perhaps it would be possible to set up a virtual serial port? Perhaps one that connects to another VM? If so, you could redirect the console to the serial port and capture the actual panic screen.

Do the systems panic when shutting down as well, or does it only happen when you try to reboot?

During troubleshooting, you might want to add "panic=s" (where s is a number of seconds) to the kernel command line. It causes the system to reboot automatically after s seconds in case of a kernel panic, and I figure since you can't see the panic screen anyway, the system might as well just reboot.

Yakooza · 09-30-2018, 06:07 PM

Quote:

Originally Posted by Ser Olmy

The log displayed by dmesg is volatile; it's held in a ring buffer in kernel memory until a process like klogd writes it to disk. All dmesg will tell you is what happened since last reboot. The interesting stuff should be in the logs, but it's entirely possible that whatever causes the kernel panic also prevents the system from dumping anything related to the issue to disk.

I'm not familiar with OVH, but perhaps it would be possible to set up a virtual serial port? Perhaps one that connects to another VM? If so, you could redirect the console to the serial port and capture the actual panic screen.

Do the systems panic when shutting down as well, or does it only happen when you try to reboot?

During troubleshooting, you might want to add "panic=s" (where s is a number of seconds) to the kernel command line. It causes the system to reboot automatically after s seconds in case of a kernel panic, and I figure since you can't see the panic screen anyway, the system might as well just reboot.

Thank you, I will ask their support about virtual serial port. They do offer KVM over IP but it is $30 - $40 per day. Not a problem with that but I would like to get a better idea before I go that path.

Also, I am wondering if there is a section here to hire a Linux expert for that matter (I could not find any section for that) or anywhere I can do that?

Thanks

Yakooza · 10-02-2018, 11:29 PM

Any suggestion to find out what is going on? I have requested KVM for last 2 days. Still nothing

Their cheap brand Soyoustart is not good!

syg00 · 10-03-2018, 01:18 AM

Without decent messages, we are more blind than you are. I don't use hosting but if a headless box won't boot, I'm dead in the water until I connect screen and keyboard, so for you that KVM is essential.
If everything is falling over on boot, that sounds like an infrastructure problem. Have you rebooted these images successfully before ?. What did you change ?. If nothing you did, chase them for any changes they did.

Yakooza · 10-03-2018, 05:53 PM

Quote:

Originally Posted by syg00

Without decent messages, we are more blind than you are. I don't use hosting but if a headless box won't boot, I'm dead in the water until I connect screen and keyboard, so for you that KVM is essential.
If everything is falling over on boot, that sounds like an infrastructure problem. Have you rebooted these images successfully before ?. What did you change ?. If nothing you did, chase them for any changes they did.

You were right.
Finally, OVH gave a KVM access (after paid 2 days ago -- I do not know what happens if something urgent happens!! That is why I have planned to move to a different provider at least with basic support and IPMI access) and found out the engineer which left a month ago messed with all of the /etc/fstab with Anisable

That is why it was preventing from booting which I have found the problem and now have to plan somehow to fix it on all the servers.

Thanks