How to troubleshoot why my server is slow?
Hello,
I have a 3 server set up for an OpenStack deployment. All three nodes are (almost) identical in what OpenStack services are installed on them. The node in question has a bit more control plane api services related to networking on it. They all have 2 CPU's and 48gb of RAM and run CentOS 7.5 Minimal. Now my problem is that this machine is super slow. It takes several minutes to SSH into the machine and running a command (# docker stats) takes about 10 minutes to provide output. This problem just started a couple of days ago. I think it has something to do with the RAM. I ran a test and the physical RAM itself seems to be fine. I am confused on how to troubleshoot this as both of the other nodes I have (with the same software on it) aren't slow like this node is. After restarting the machine it works just like it should for an hour or two, after that it starts slowing down and when using top it shows the memory usage and buff/cache constantly going up. For reference the server has been running all night and it is showing this with free -m: PHP Code:
EDIT: The Architecture is very specific to OpenStack. I am running mariadb, nova, heat, cinder/ceph, ironic, glance, keystone, and rabbitmq on all nodes. Node-1 should have the core neutron apis running on it. Thats the only difference between the software installed on each node. The cloud is not 'in-use' right now, as I am still trying to get everything finalized. Two VMs are running at the moment for testing the cloud. |
Whom does top show using cpu & ram?
|
PHP Code:
|
Is your cloud in production? Is it running any workloads? A few words about the architecture might help.
Use vmstat to confirm that there is paging activity. The first node uses 10GB of buffer cache, the other 1.4GB. Find out where the difference comes from - it has something to do with file IO. How do these “identical” nodes differ? Perhaps you run cinder-volume on the first node, with a file backend, and your instances use volumes a lot? Run top to get an idea which processes use the most CPU and memory. The CPU users might also be the heaviest file I/O users. |
check also top command to troubleshoot the performance
or see link below: http://www.dowdandassociates.com/blo...x-top-command/ Did it happen recently? Any change on the server, network load, bandwidth, application? |
Get a top report from when all-but-300mb of RAM is being used. This one shows most of it available. I don't know dockerd.
|
Disk problems can also cause extreme slowness. Assuming you already checked for disk drive errors in logs?
And, machine is swapping, that will slow down things. How is the network load? https://linux.die.net/man/1/nload |
The cloud is deployed and working, there are a few networks that were made off of neutron (OpenStack's Networking SDN). But I am only running two VMs at the moment. One of them is running on the slow node and the other is running on a node that works just fine. This issue started only a few days ago, there isn't anything that changed that I can think of from the time the machine was working normally vs. now when its super slow. Using top only shows me the total usage of RAM and the programs themselves don't show any significant usage of RAM either. Everything is running in docker containers and when I use 'docker stats' the most usage by a container is only around 1gb. There are 50 containers running and most of them run at around 100mb of RAM usage. The containers with high usage show use around 500mb, but thats only 1 or 2 containers. Over time the RAM usage just keeps going up, which leads me to believe the system is letting the disk cache take over unused RAM, however the system gets slower and slower while that happens. This makes me think the disk cache isn't releasing RAM that it has in use when something else needs it, but I don't know how to check or confirm this.
P.S. This set-up is not in use yet. I am an intern who was given the project of setting up this cloud. I've never worked with linux before this internship either so I am a super newbie. I appreciate all the help, thanks! |
|
Let's see /proc/meminfo - use [code] tags not [php]; might be worth changing the tags above too ...
|
Or have you check cron scheduler for any malware or process lurking around
/etc/cron.hourly/ /etc/crontab /etc/cron.daily/ |
here is a post of output for my machine after its been running all night. It is super slow right now.
Code:
[root@node-1 ~]# top Code:
MemTotal: 49278556 kB |
Quote:
Given the swap usage as well. I'd bet the disk response time is crap, and tie-ing up the whole system. Just a guess tho' ... Get something like sysstat to see the disk response time. |
Are you able to stop this process: 2001 root 20 0 3058504 9116 28 S 15.9 0.0 5:25.09 docker-containe
And check server performance |
It looks like the high load average is caused by processes wanting to do disk I/O. Thus the buffer cache size of 9GB. These are processes with a state of D (uninterruptable sleep); your top output shows the DB, nova-conductor and a large number of Neutron API servers.
Why is the DB process so large? Can you check what’s in the DB? And why are so many API requests made to Neutron (at least that’s my guess seeing the Neutron servers waiting for disk I/O)? Check their logs. Perhaps you use DEBUG logging. Switching this off would the improve the situation, but in the end you need to understand where all the Neutron activity is from. Perhaps there are other processes in the D state. top should be able to filter for them. Figure out what they are, there might be more clues. Also compare the top and vmstat output of this server with the two others. Are you deploying Neutron on this one controller only? Finally, use ask.openstack.org and the OpenStack mailing list for other opinions. |
All times are GMT -5. The time now is 09:34 AM. |