Health Check
1. Introduction
When users are reporting the servers are slow or facing issues in the connection, few commands that will make quick understanding about the system, The first and for most is doing the connectivity check from the local machine(laptop or desktop), ping server name, If we are successfully able to get the ping results, Connect to the server and do the basic commands
2. uptime command
uptime is used to get how long the system has been running
[root@control ~]# uptime 20:19:16 up 1:11, 1 user, load average: 0.00, 0.00, 0.00
We could see the server how may days / hours the server is up and running, Check the load average column of 5, 10, 15 minutes, if there is an increase from the normal values check the application that is consuming, If the load average is high check the top, sar, free commands for further debugging.
3. top command
top command is used to display Linux processes
root@control ~]# top top - 20:22:09 up 1:14, 1 user, load average: 0.00, 0.00, 0.00 Tasks: 141 total, 1 running, 140 sleeping, 0 stopped, 0 zombie %Cpu(s): 0.0 us, 6.2 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 1829.6 total, 107.9 free, 1516.0 used, 205.8 buff/cache MiB Swap: 2220.0 total, 1675.7 free, 544.2 used. 154.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 831 rabbitmq 20 0 1741920 49668 3732 S 6.2 2.7 0:26.86 beam.smp 1 root 20 0 176272 5552 3880 S 0.0 0.3 0:00.80 systemd 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblockd 8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
There are multiple columns to be notified
%Cpu(s) – will show current CPU usage
id – will show current CPU idle time
zombie – any zombie process that are currently running for a longer time
MiB Mem – Physical Memory usage for the server
MiB Swap – swap memory usage for the server
%CPU – CPU usage of a user process
%MEM – Memory usage for a user process
PID – Process ID that is consuming CPU, Memory
TIME+ – Process running for a period of time
Tasks – To check how many running/ sleeping/stopped/zombie process
4. disk free command
Disk free commands displays the mounted filesystem, check the Use% column, whether it is 100% full, if full clear the files that are consuming more
[root@control ~]# df -Ph Filesystem Size Used Avail Use% Mounted on devtmpfs 900M 0 900M 0% /dev tmpfs 915M 204K 915M 1% /dev/shm tmpfs 915M 8.5M 907M 1% /run tmpfs 915M 0 915M 0% /sys/fs/cgroup /dev/mapper/cl-root 29G 6.3G 23G 22% / /dev/sda1 976M 185M 724M 21% /boot tmpfs 183M 0 183M 0% /run/user/1001 tmpfs 183M 0 183M 0% /run/user/1000
-P will display in POSIX output format
-h Human readable format like 1K,200M,1G,1T
-T Display the filesystem type like ext4, xfs, swap
5. ps command
For example if httpd process consuming more CPU / Memory , execute the below command
[devops@server1 ~]$ ps -ef |grep httpd root 920 1 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 1067 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 1068 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 1069 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND apache 1073 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND devops 2346 2322 0 11:08 pts/0 00:00:00 grep --color=auto httpd
6. dmesg command
dmesg command prints the message from the kernel, when ever changes happened in device drivers it will print the messages in the console, make sure if any recent changes has been done on the server
Example if we add/remove any disks, or change in network interface settings up/down, or application consumed more memory and kernel got tainted like OOM(Out of memory)
[root@control ~]# dmesg | tail [ 7.177177] RAPL PMU: hw unit of domain psys 2^-0 Joules [ 7.394515] intel_pmc_core: initialized [ 8.733504] snd_intel8x0 0000:00:05.0: white list rate for 1028:0177 is 48000 [ 8.846091] EXT4-fs (dm-2): recovery complete [ 8.846094] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null) [ 8.872804] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) [ 12.491963] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready [ 12.504901] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready [ 12.505406] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [ 12.505654] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s3: link becomes ready
usually dmesg in the last we will find out recent change in the hardware, for example a network disconnection happens we could see or any memory or storage disconnection
7. Check the OS system log file
The system log file contains various messages that are defined in the /etc/logrotate.conf. lot of information regarding user authentication, errors, failure, session logs, and custom logs if we define the logrotate.conf
Redhat OS Family writes logs in /var/log/messages and Debian OS Family writes the logs in /var/log/syslog. there are few more file that are useful for trouble shooting purpose
[root@control log]# egrep -i 'error|fail' /var/log/messages May 18 02:38:52 control augenrules[951]: failure 1 May 18 02:38:52 control augenrules[951]: failure 1 May 18 02:38:53 control rngd[984]: Failed to init entropy source hwrng May 18 02:38:57 control systemd[1659]: Failed to fully start up daemon: Permission denied May 18 02:38:57 control systemd[1]: user@1001.service: Failed with result 'protocol'. May 18 02:38:57 control systemd[1]: Failed to start User Manager for UID 1001. May 18 02:39:03 control vboxdrv.sh[983]: vboxdrv.sh: failed: Look at /var/log/vbox-setup.log to find out what went wrong. May 18 02:39:03 control vboxdrv.sh[2179]: failed: Look at /var/log/vbox-setup.log to find out what went wrong. May 18 02:39:03 control systemd[1]: vboxdrv.service: Failed with result 'exit-code'. May 18 02:39:03 control systemd[1]: Failed to start VirtualBox Linux kernel module.
Depending upon the key words in the grep commands, Few other important file that are useful are /var/log/secure
8. Memory Check
Check the memory usage of physical and swap memory using the free command
[root@control ~]# free total used free shared buff/cache available Mem: 1873552 1542732 140832 24132 189988 166516 Swap: 2273276 569344 1703932
Column has to be notified
total – is the total memory available
used – memory usage column
free -available free memory for the usage
buff/cache – is the actual memory used
In the above example the used column shows memory used 1542732 KB, but the actual free memory is available 166516 KB, instead of 140832 KB. Also free -m is used to get values in MB and free -g will get values in GB
9. Check the read-only file-systems
In few scenarios the kernel will make filesystem read-only in order to avoid data corruption , it help the file system to prevent from data loss.
[root@control ~]# grep -w ro /proc/mounts tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
Check the read-only filesystems in the /proc/mounts, if there is an read only filesystem check and remount the filesystem if required.
-w option will only grep particular matching word
Depending upon the issues, few cases reboot is required to fix the issue, if it is related to application / database issue check the respective team to restart the service.
10. last command
last command is used for listing of last logged in users, if a issue triggered at a particular time, we can see users who are all logged in that time and it may easy to trouble shoot what may caused an issue, for example if any service account or users has done some changes, may be due an over load or any configuration changes we can able to find the fix quickly
[root@control ~]# last | head root pts/0 192.168.43.98 Mon May 18 12:09 still logged in reboot system boot 4.18.0-147.8.1.e Mon May 18 12:08 still running root pts/0 192.168.43.98 Mon May 18 12:04 - crash (00:04) reboot system boot 4.18.0-147.8.1.e Mon May 18 12:04 still running devops pts/0 192.168.43.98 Fri May 15 15:04 - 17:26 (02:22) reboot system boot 4.18.0-147.8.1.e Fri May 15 15:04 - 17:27 (02:23) devops pts/0 192.168.43.98 Thu May 14 19:47 - 00:49 (05:01) reboot system boot 4.18.0-147.8.1.e Thu May 14 19:40 - 00:49 (05:09) root pts/0 192.168.43.98 Tue May 12 10:19 - crash (2+09:21) reboot system boot 4.18.0-147.8.1.e Tue May 12 09:04 - 00:49 (2+15:45)
To check the when the last reboot happened
[root@control ~]# last reboot | head reboot system boot 4.18.0-147.8.1.e Mon May 18 12:08 still running reboot system boot 4.18.0-147.8.1.e Mon May 18 12:04 still running reboot system boot 4.18.0-147.8.1.e Fri May 15 15:04 - 17:27 (02:23) reboot system boot 4.18.0-147.8.1.e Thu May 14 19:40 - 00:49 (05:09)
11. Checking the console
If the server is not responding check the console of the server, generate vmcore for troubleshooting. If we raise a case to Redhat or Suse they will ask for the vmcore for the exact reason of the issue. the core dump is generated if you have configured the kdump. you can check the status by systemctl status kdump
[root@control ~]# systemctl status kdump ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor prese> Active: active (exited) since Mon 2020-05-18 12:04:41 IST; 12s ago Process: 1369 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCES> Main PID: 1369 (code=exited, status=0/SUCCESS) May 18 12:04:25 control.balapobi.com systemd[1]: Starting Crash recovery kernel> May 18 12:04:41 control.balapobi.com kdumpctl[1369]: kexec: loaded kdump kernel May 18 12:04:41 control.balapobi.com kdumpctl[1369]: Starting kdump: [OK] May 18 12:04:41 control.balapobi.com systemd[1]: Started Crash recovery kernel > lines 1-10/10 (END)
The crash dump will be generated in /var/crash by default, we can change the path in the configuration, kdump partition has to be separate filesystem, since in few scenarios, if the filesystem is full crash dump will not be generated.
You can test the crash dump by executing the command
echo 1 > /proc/sys/kernel/sysrq echo c > /proc/sysrq-trigger
The system will be automatically rebooted and crash dump will be created automatically, you can find the crash logs as below.
[root@control ~]# ls -l /var/crash/127.0.0.1-2020-05-18-02\:38\:31/ total 40592 -rw------- 1 root root 41527645 May 18 12:08 vmcore -rw-r--r-- 1 root root 35904 May 18 12:08 vmcore-dmesg.txt
Hope you have got the some useful information, you can share your ideas for further improvements.
Refer Basic commands https://www.redhat.com/sysadmin/basic-linux-commands
Refer lsblk commands https://computercarriage.com/2020/05/18/list-block-devices-using-lsblk-command/
One thought on “Quick Health check on Linux servers v1.0”