Quick Health check on Linux servers v1.0

Health Check

1. Introduction

When users are reporting the servers are slow or facing issues in the connection, few commands that will make quick understanding about the system, The first and for most is doing the connectivity check from the local machine(laptop or desktop), ping server name, If we are successfully able to get the ping results, Connect to the server and do the basic commands

2. uptime command

uptime is used to get how long the system has been running

[root@control ~]# uptime
20:19:16 up 1:11, 1 user, load average: 0.00, 0.00, 0.00

We could see the server how may days / hours the server is up and running, Check the load average column of 5, 10, 15 minutes, if there is an increase from the normal values check the application that is consuming, If the load average is high check the top, sar, free commands for further debugging.

3. top command

top command is used to display Linux processes

root@control ~]# top
top - 20:22:09 up 1:14, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 141 total, 1 running, 140 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 6.2 sy, 0.0 ni, 93.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 1829.6 total, 107.9 free, 1516.0 used, 205.8 buff/cache
MiB Swap: 2220.0 total, 1675.7 free, 544.2 used. 154.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
831 rabbitmq 20 0 1741920 49668 3732 S 6.2 2.7 0:26.86 beam.smp
1 root 20 0 176272 5552 3880 S 0.0 0.3 0:00.80 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp
4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp
6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblockd
8 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq

There are multiple columns to be notified

%Cpu(s) – will show current CPU usage
id – will show current CPU idle time
zombie – any zombie process that are currently running for a longer time
MiB Mem – Physical Memory usage for the server
MiB Swap – swap memory usage for the server
%CPU – CPU usage of a user process
%MEM – Memory usage for a user process
PID – Process ID that is consuming CPU, Memory
TIME+ – Process running for a period of time
Tasks – To check how many running/ sleeping/stopped/zombie process

4. disk free command

Disk free commands displays the mounted filesystem, check the Use% column, whether it is 100% full, if full clear the files that are consuming more

[root@control ~]# df -Ph
Filesystem Size Used Avail Use% Mounted on
devtmpfs 900M 0 900M 0% /dev
tmpfs 915M 204K 915M 1% /dev/shm
tmpfs 915M 8.5M 907M 1% /run
tmpfs 915M 0 915M 0% /sys/fs/cgroup
/dev/mapper/cl-root 29G 6.3G 23G 22% /
/dev/sda1 976M 185M 724M 21% /boot
tmpfs 183M 0 183M 0% /run/user/1001
tmpfs 183M 0 183M 0% /run/user/1000

-P will display in POSIX output format
-h Human readable format like 1K,200M,1G,1T
-T Display the filesystem type like ext4, xfs, swap

5. ps command

For example if httpd process consuming more CPU / Memory , execute the below command

[devops@server1 ~]$ ps -ef |grep httpd
root 920 1 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND
apache 1067 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND
apache 1068 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND
apache 1069 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND
apache 1073 920 0 11:07 ? 00:00:00 /usr/sbin/httpd -DFOREGROUND
devops 2346 2322 0 11:08 pts/0 00:00:00 grep --color=auto httpd

6. dmesg command

dmesg command prints the message from the kernel, when ever changes happened in device drivers it will print the messages in the console, make sure if any recent changes has been done on the server
Example if we add/remove any disks, or change in network interface settings up/down, or application consumed more memory and kernel got tainted like OOM(Out of memory)

[root@control ~]# dmesg | tail
[    7.177177] RAPL PMU: hw unit of domain psys 2^-0 Joules
[    7.394515] intel_pmc_core:  initialized
[    8.733504] snd_intel8x0 0000:00:05.0: white list rate for 1028:0177 is 48000
[    8.846091] EXT4-fs (dm-2): recovery complete
[    8.846094] EXT4-fs (dm-2): mounted filesystem with ordered data mode. Opts: (null)
[    8.872804] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
[   12.491963] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready
[   12.504901] IPv6: ADDRCONF(NETDEV_UP): enp0s3: link is not ready
[   12.505406] e1000: enp0s3 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
[   12.505654] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s3: link becomes ready

usually dmesg in the last we will find out recent change in the hardware, for example a network disconnection happens we could see or any memory or storage disconnection

7. Check the OS system log file

The system log file contains various messages that are defined in the /etc/logrotate.conf. lot of information regarding user authentication, errors, failure, session logs, and custom logs if we define the logrotate.conf

Redhat OS Family writes logs in /var/log/messages and Debian OS Family writes the logs in /var/log/syslog. there are few more file that are useful for trouble shooting purpose

[root@control log]# egrep -i 'error|fail' /var/log/messages
May 18 02:38:52 control augenrules[951]: failure 1
May 18 02:38:52 control augenrules[951]: failure 1
May 18 02:38:53 control rngd[984]: Failed to init entropy source hwrng
May 18 02:38:57 control systemd[1659]: Failed to fully start up daemon: Permission denied
May 18 02:38:57 control systemd[1]: user@1001.service: Failed with result 'protocol'.
May 18 02:38:57 control systemd[1]: Failed to start User Manager for UID 1001.
May 18 02:39:03 control vboxdrv.sh[983]: vboxdrv.sh: failed: Look at /var/log/vbox-setup.log to find out what went wrong.
May 18 02:39:03 control vboxdrv.sh[2179]: failed: Look at /var/log/vbox-setup.log to find out what went wrong.
May 18 02:39:03 control systemd[1]: vboxdrv.service: Failed with result 'exit-code'.
May 18 02:39:03 control systemd[1]: Failed to start VirtualBox Linux kernel module.

Depending upon the key words in the grep commands, Few other important file that are useful are /var/log/secure

8. Memory Check

Check the memory usage of physical and swap memory using the free command

[root@control ~]# free
              total        used        free      shared  buff/cache   available
Mem:        1873552     1542732      140832       24132      189988      166516
Swap:       2273276      569344     1703932

Column has to be notified

total – is the total memory available
used – memory usage column
free -available free memory for the usage
buff/cache – is the actual memory used

In the above example the used column shows memory used 1542732 KB, but the actual free memory is available 166516 KB, instead of 140832 KB. Also free -m is used to get values in MB and free -g will get values in GB

9. Check the read-only file-systems

In few scenarios the kernel will make filesystem read-only in order to avoid data corruption , it help the file system to prevent from data loss.

[root@control ~]# grep -w ro /proc/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0

Check the read-only filesystems in the /proc/mounts, if there is an read only filesystem check and remount the filesystem if required.

-w option will only grep particular matching word

Depending upon the issues, few cases reboot is required to fix the issue, if it is related to application / database issue check the respective team to restart the service.

10. last command

last command is used for listing of last logged in users, if a issue triggered at a particular time, we can see users who are all logged in that time and it may easy to trouble shoot what may caused an issue, for example if any service account or users has done some changes, may be due an over load or any configuration changes we can able to find the fix quickly

[root@control ~]# last | head
root     pts/0        192.168.43.98    Mon May 18 12:09   still logged in
reboot   system boot  4.18.0-147.8.1.e Mon May 18 12:08   still running
root     pts/0        192.168.43.98    Mon May 18 12:04 - crash  (00:04)
reboot   system boot  4.18.0-147.8.1.e Mon May 18 12:04   still running
devops   pts/0        192.168.43.98    Fri May 15 15:04 - 17:26  (02:22)
reboot   system boot  4.18.0-147.8.1.e Fri May 15 15:04 - 17:27  (02:23)
devops   pts/0        192.168.43.98    Thu May 14 19:47 - 00:49  (05:01)
reboot   system boot  4.18.0-147.8.1.e Thu May 14 19:40 - 00:49  (05:09)
root     pts/0        192.168.43.98    Tue May 12 10:19 - crash (2+09:21)
reboot   system boot  4.18.0-147.8.1.e Tue May 12 09:04 - 00:49 (2+15:45)

To check the when the last reboot happened

[root@control ~]# last reboot | head
reboot   system boot  4.18.0-147.8.1.e Mon May 18 12:08   still running
reboot   system boot  4.18.0-147.8.1.e Mon May 18 12:04   still running
reboot   system boot  4.18.0-147.8.1.e Fri May 15 15:04 - 17:27  (02:23)
reboot   system boot  4.18.0-147.8.1.e Thu May 14 19:40 - 00:49  (05:09)

11. Checking the console

If the server is not responding check the console of the server, generate vmcore for troubleshooting. If we raise a case to Redhat or Suse they will ask for the vmcore for the exact reason of the issue. the core dump is generated if you have configured the kdump. you can check the status by systemctl status kdump

[root@control ~]# systemctl status kdump
● kdump.service - Crash recovery kernel arming
   Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor prese>
   Active: active (exited) since Mon 2020-05-18 12:04:41 IST; 12s ago
  Process: 1369 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCES>
Main PID: 1369 (code=exited, status=0/SUCCESS)
May 18 12:04:25 control.balapobi.com systemd[1]: Starting Crash recovery kernel>
May 18 12:04:41 control.balapobi.com kdumpctl[1369]: kexec: loaded kdump kernel
May 18 12:04:41 control.balapobi.com kdumpctl[1369]: Starting kdump: [OK]
May 18 12:04:41 control.balapobi.com systemd[1]: Started Crash recovery kernel >
lines 1-10/10 (END)

The crash dump will be generated in /var/crash by default, we can change the path in the configuration, kdump partition has to be separate filesystem, since in few scenarios, if the filesystem is full crash dump will not be generated.

You can test the crash dump by executing the command

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

The system will be automatically rebooted and crash dump will be created automatically, you can find the crash logs as below.

[root@control ~]# ls -l /var/crash/127.0.0.1-2020-05-18-02\:38\:31/
total 40592
-rw------- 1 root root 41527645 May 18 12:08 vmcore
-rw-r--r-- 1 root root    35904 May 18 12:08 vmcore-dmesg.txt

Hope you have got the some useful information, you can share your ideas for further improvements.

Refer Basic commands https://www.redhat.com/sysadmin/basic-linux-commands

Refer lsblk commands https://computercarriage.wordpress.com/2020/05/18/list-block-devices-using-lsblk-command/

ComputerCarriage

Quick Health check on Linux servers v1.0

One response to “Quick Health check on Linux servers v1.0”

Leave a comment Cancel reply

Quick Health check on Linux servers v1.0

Health Check

1. Introduction

2. uptime command

3. top command

4. disk free command

5. ps command

6. dmesg command

7. Check the OS system log file

8. Memory Check

9. Check the read-only file-systems

10. last command

11. Checking the console

Share this:

One response to “Quick Health check on Linux servers v1.0”

Leave a comment Cancel reply