主机异常重启排查


###排查资源使用情况:

一般云计算平台或者传统的 IDC 主机都会有相应的监控平台

通过分析历史的 CPU 、MEM、磁盘 IO、带宽大小等数值分析是不是因为资源占用异常导致的系统重启

img

或者登录机器

  • 通过 free 命令查看内存信息
$free             
              total        used        free      shared  buff/cache   available 
Mem:        3880172      187996     1974484         544     1717692     3407556 
Swap:             0           0           0
  • top 命令查看占用较高内存 CPU 的进程有无异常
$top - 18:17:43 up 26 days,  7:19,  1 user,  load average: 0.00, 0.01, 0.05 
Tasks:  94 total,   1 running,  93 sleeping,   0 stopped,   0 zombie 
%Cpu(s):  0.2 us,  0.2 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 
KiB Mem :  3880172 total,  1973048 free,   189016 used,  1718108 buff/cache 
KiB Swap:        0 total,        0 free,        0 used.  3406504 avail Mem   
PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND    
	1 root      20   0   43576   3964   2580 S   0.0  0.1   1:26.70 systemd    
	2 root      20   0       0      0      0 S   0.0  0.0   0:00.68 kthreadd    
	4 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H    
	6 root      20   0       0      0      0 S   0.0  0.0   0:04.58 ksoftirqd/0    
	7 root      rt   0       0      0      0 S   0.0  0.0   0:07.54 migration/0    
	8 root      20   0       0      0      0 S   0.0  0.0   0:00.00 rcu_bh    
	9 root      20   0       0      0      0 S   0.0  0.0   2:26.47 rcu_sched   
	10 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 lru-add-drain   
	11 root      rt   0       0      0      0 S   0.0  0.0   0:06.27 watchdog/0   
	12 root      rt   0       0      0      0 S   0.0  0.0   0:05.24 watchdog/1   
	13 root      rt   0       0      0      0 S   0.0  0.0   0:07.68 migration/1   
	14 root      20   0       0      0      0 S   0.0  0.0   0:04.44 ksoftirqd/1   
	16 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/1:0H   
	18 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kdevtmpfs   
	19 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 netns   
	20 root      20   0       0      0      0 S   0.0  0.0   0:00.48 khungtaskd   
	21 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 writeback   
	22 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kintegrityd   
	23 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset   
	24 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset   
	25 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 bioset   
	26 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kblockd
  • 通过 netstat -ant | awk ‘{split($5, arr, “:”); print arr[1]}’ | sort | uniq -c |sort -nr | head 命令查看建立连接数
$netstat -ant | awk '{split($5, arr, ":"); print arr[1]}' | sort | uniq -c |sort -nr | head      
			12 0.0.0.0      
			10 58.33.27.210      
			7 180.164.153.113      
			1 169.254.0.55

通过 iostat vmstat 或 lsof 查看系统盘的 IO 情况, 找到异常读写的进程

###通过 linux 系统日志排查

系统日志通常在 /var/log 下:

/var/log/message 记录Linux操作系统常见的系统和服务错误信息 
/var/log/secure 与安全相关的日志信息 
/var/log/maillog 与邮件相关的日志信息 
/var/log/cron 与定时任务相关的日志信息 
/var/log/spooler 与UUCP和news设备相关的日志信息 
/var/log/boot.log 守护进程启动和停止相关的日志消息 
/var/log/wtmp 永久记录每个用户登录、注销及系统的启动、停机的事件 
/var/run/utmp 记录当前正在登录系统的用户信息;
/var/log/btmp 记录失败的登录尝试信息。
  • 通常我们主要分析操作系统日志 /var/log/message

执行 grep -E -i -r “panic|error|exception|shutdown” /var/log/message

看看得到的信息是否有异常的情况

例如出现下面类似的日志说明有人通过控制台或者电源键进行了关机

Mar 01 23:12:34 hostname shutdown: shutting down for system halt

还可以通过 grep -iv ‘: starting|kernel: .: Power Button|watching system buttons|Stopped Cleaning Up|Started Crash recovery kernel’ \ /var/log/messages /var/log/syslog /var/log/apcupsd \ | grep -iw ‘recover[a-z]*|power[a-z]*|shut[a-z ]*down|rsyslogd|ups’ 得到更多相关的日志

当发生意外断电或者硬件故障的时候,文件系统不会被正常的卸载所以下次主机启动的时候你会看到类似下面的日志:

EXT4-fs ... INFO: recovery required ...  
Starting XFS recovery filesystem ... 
systemd-fsck: ... recovering journal 
systemd-journald: File /var/log/journal/.../system.journal corrupted or uncleanly shut down, renaming and replacing.

当用户通过电源键进行关机时会得到下面类似的日志:

systemd-logind: Power key pressed. 
systemd-logind: Powering Off... 
systemd-logind: System is powering down.

###主机重启历史和登录情况分析

通过* last reboot* 查看最近系统重启的信息:第一行个时间是重启时间-8 第二个是当前时间

reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 21:56 - 14:40  (-7:-15) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 21:49 - 14:40  (-7:-8) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 21:41 - 14:40  (-7:00) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 21:25 - 14:40  (-6:-44) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 21:15 - 14:40  (-6:-34) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 20:57 - 14:40  (-6:-16) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 20:53 - 14:40  (-6:-13) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 20:49 - 14:40  (-6:-8) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 20:39 - 14:40  (-5:-58) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 18:41 - 14:40  (-4:00) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 17:57 - 14:40  (-3:-16) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 17:23 - 14:40  (-2:-42) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 17:11 - 14:40  (-2:-30) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 16:54 - 14:40  (-2:-13) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 16:27 - 14:40  (-1:-46) 
reboot   system boot  3.10.0-957.5.1.e Fri Dec  4 15:44 - 14:40  (-1:-3)

last 命令参数说明:

-a	将登录系统的主机名称或IP地址,显示在最后一行 
-d	将IP地址转换成主机名称 
-f	指定记录文件,默认是显示/var/log目录下的wtmp文件的记录,但/var/log目录下得btmp能显示的内容更丰富,可以显示远程登录,例如ssh登录 ,包括失败的登录请求。 
-i	显示特定ip登录的情况。跟踪用 
-o	Read an old-type wtmp file (written by linux-libc5 applications). 
-n	<显示列数>或-<显示列数>  设置列出名单的显示列数 
-w	Display full user and domain names in the output 
-R	不显示登入系统的主机名称或IP(省略 hostname 的栏位) 
-t	显示YYYYMMDDHHMMSS之前的信息 
-x	显示系统关闭、用户登录和退出的历史

通过 last -n 5 -a -i 排查最近异常登录情况, 排除入侵问题

root     pts/5        Fri Dec  4 14:42   still logged in    58.33.27.210 
root     pts/3        Fri Dec  4 14:38   still logged in    58.33.27.210 
root     pts/3        Fri Dec  4 14:23 - 14:23  (00:00)     123.123.6.205 
root     pts/4        Fri Dec  4 14:12 - 14:43  (00:31)     192.168.0.67 
root     pts/3        Fri Dec  4 14:08 - 14:20  (00:12)     1.202.240.26

结合 Kdump 和 crash 工具排查

  • Kdump

kdump是一种kernel crash dump的机制,它可以在内核crash时保存系统的内存信息用于后续的分析。kdump是基于kexec的。

crash是一个用于交互式地分析正在运行的Linux系统或者kernel crash后的core dump数据的工具。

dump的工作原理图:

img

img

  • crash

crash是redhat的工程师开发的,主要用来离线分析linux内核转存文件,它整合了gdb工具,功能非常强大。可以查看堆栈,dmesg日志,内核数据结构,反汇编等等。crash支持多种工具生成的转存文件格式,如kdump,LKCD,netdump和diskdump,而且还可以分析虚拟机Xen和Kvm上生成的内核转存文件。同时crash还可以调试运行时系统,直接运行crash即可,ubuntu下内核映象存放在/proc/kcore。

使用这两个工具来排除异常重启问题必须符合后续还会继续发生异常重启的情况,这个时候我们通过

kdump 工具保存内核在crash时的系统内存信息用于后续的分析;kdump 默认会在/var/crash/ 目录下保存生成的crash 信息

通过crash 命令分析

crash /usr/lib/debug/lib/modules/xxx/vmlinux /var/crash/xxx/vmcore

使用 crash 调试转储文件,需要在命令行输入两个参数:debug kernel 和 dump file,其中 dump file 是内核 crash 时生成的 dump 文件的名称,debug kernel 是由内核调试信息包安装的,不同的发行版名称略有不同,以 RHEL为例:

debug kernel 文件需要额外安装 kernel-debuginfo 才会有

通常我们使用 yum 进行安装

yum install http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-common-x86_64-3.10.0-957.10.1.el7.x86_64.rpm yum install http://debuginfo.centos.org/7/x86_64/kernel-debuginfo-3.10.0-957.10.1.el7.x86_64.rpm
$crash /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux /var/crash/127.0.0.1-2019-07-21-17\:07\:17/vmcore

执行命令后可以看到 crash 工具分析报告的摘要

crash 7.1.5-2.el7 
Copyright (C) 2002-2016  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation 
Copyright (C) 1999-2006  Hewlett-Packard Co 
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited 
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K. 
Copyright (C) 2005, 2011  NEC Corporation 
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc. 
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc. 
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under certain conditions.  
Enter "help copying" to see the conditions. This program has absolutely no warranty.  Enter "help warranty" for details.  GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.  Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"...      
			 KERNEL: /usr/lib/debug/lib/modules/3.10.0-514.el7.x86_64/vmlinux    
			 DUMPFILE: /var/crash/127.0.0.1-2019-07-21-17:07:17/vmcore  [PARTIAL DUMP]        
			 CPUS: 56        
			 DATE: Sun Jul 21 17:07:00 2019      
			 UPTIME: 8 days, 03:43:48 
			 LOAD AVERAGE: 1.98, 1.73, 1.73       
			 TASKS: 4444    
			 NODENAME: hyhive     
			 RELEASE: 3.10.0-514.el7.x86_64     
			 VERSION: #1 SMP Tue Nov 22 16:42:41 UTC 2016     
			 MACHINE: x86_64  (2600 Mhz)      
			 MEMORY: 255.9 GB       
			 PANIC: "BUG: unable to handle kernel paging request at ffff8800fc14dfb0"         
			 PID: 21793     
			 COMMAND: "sh"        
			 TASK: ffff883f97751f60  [THREAD_INFO: ffff8830e1b50000]         
			 CPU: 49       
			 STATE: TASK_RUNNING (PANIC) 
crash>

在 crash 工具内我们可以输入 help 看到所有 crash 的子命令

crash> help 
*              files          mach           repeat         timer           alias          foreach        mod            runq           tree            ascii          fuser          mount          search         union           bt             gdb            net            set            vm              btop           help           p              sig            vtop            dev            ipcs           ps             struct         waitq           dis            irq            pte            swap           whatis          eval           kmem           ptob           sym            wr              exit           list           ptov           sys            q               extend         log            rd             task           
crash version: 7.1.9-2   gdb version: 7.6 
For help on any command above, enter "help <command>". 
For help on input options, enter "help input". 
For help on output options, enter "help output".

例如:

bt显示内核堆栈跟踪

crash> bt
PID: 3320   TASK: ffff88017092dee0  CPU: 0   COMMAND: "qemu-kvm-2.6"
 #0 [ffff88007b2aba48] machine_kexec at ffffffff8105c4cb
 #1 [ffff88007b2abaa8] __crash_kexec at ffffffff81104a32
 #2 [ffff88007b2abb78] crash_kexec at ffffffff81104b20
 #3 [ffff88007b2abb90] oops_end at ffffffff816880f8
 #4 [ffff88007b2abbb8] no_context at ffffffff8167829a
 #5 [ffff88007b2abc08] __bad_area_nosemaphore at ffffffff81678330
 #6 [ffff88007b2abc50] bad_area_nosemaphore at ffffffff8167849a
 #7 [ffff88007b2abc60] __do_page_fault at ffffffff8168afbe
 #8 [ffff88007b2abcc0] do_page_fault at ffffffff8168b165
 #9 [ffff88007b2abcf0] page_fault at ffffffff81687388
​    [exception RIP: unknown or invalid address]
​    RIP: 00007ffd81487700  RSP: ffff88007b2abda0  RFLAGS: 00010002
​    RAX: ffff880175733e38  RBX: 0000000075733f28  RCX: 0000000000000000
​    RDX: 0000000000000000  RSI: 0000000000000003  RDI: ffff880175733e38
​    RBP: ffff88007b2abde0   R8: 0000000000000000   R9: 0000000000000000
​    R10: 00000000000103c0  R11: 0000000000000293  R12: ffffffff81a93648
​    R13: 000055deb2c80ef8  R14: 0000000000000000  R15: 0000000000000003
​    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#10 [ffff88007b2abda0] __wake_up_common at ffffffff810ba588
#11 [ffff88007b2abde8] __wake_up at ffffffff810bd4f9
#12 [ffff88007b2abe20] __vga_put at ffffffff8143535e
#13 [ffff88007b2abe48] vga_put at ffffffff814356ff
#14 [ffff88007b2abe70] vfio_pci_vga_rw at ffffffffc04a8e54 [vfio_pci]
#15 [ffff88007b2abed0] vfio_pci_rw at ffffffffc04a55c1 [vfio_pci]
#16 [ffff88007b2abee0] vfio_pci_read at ffffffffc04a594c [vfio_pci]
#17 [ffff88007b2abef0] vfio_device_fops_read at ffffffffc048b233 [vfio]
#18 [ffff88007b2abf00] vfs_read at ffffffff81200b9c
#19 [ffff88007b2abf30] sys_pread64 at ffffffff81201c32
#20 [ffff88007b2abf80] system_call_fastpath at ffffffff8168fe49
​    RIP: 00007feac42e1fc3  RSP: 00007feab9b6b908  RFLAGS: 00000206
​    RAX: 0000000000000011  RBX: ffffffff8168fe49  RCX: 0000000000000025
​    RDX: 0000000000000001  RSI: 00007feab9b6b700  RDI: 000000000000002b
​    RBP: 0000000000000001   R8: 0000000000000008   R9: 00000000000000ff
​    R10: 00000800000003da  R11: 0000000000000293  R12: 0000000000000000
​    R13: 000000000000001a  R14: 0000000000000001  R15: 0000559178fb02b0
​    ORIG_RAX: 0000000000000011  CS: 0033  SS: 002b

ps显示系统中进程的状态

Text
crash> ps | grep RU 0 0 0 ffffffff819c9480 RU 0.0 0 0 [swapper/0] 0 0 1 ffff880177609fa0 RU 0.0 0 0 [swapper/1] > 0 0 2 ffff88017760af70 RU 0.0 0 0 [swapper/2] 0 0 3 ffff88017760bf40 RU 0.0 0 0 [swapper/3] > 1889 1880 3 ffff8801727b3f40 RU 0.3 386192 21456 X > 3320 3200 0 ffff88017092dee0 RU 47.7 3914004 2986980 qemu-kvm-2.6 > 3652 1879 1 ffff880053795ee0 RU 0.0 106640 2640 qemu-img crash> ps | grep 3200 2276 1914 0 ffff880063320000 IN 0.0 784512 3096 gmain 3200 1879 1 ffff880174086eb0 IN 0.0 76728 1792 uniqb-runtime 3311 3200 1 ffff88017705cf10 IN 47.7 3914004 2986980 qemu-kvm-2.6 3316 3200 1 ffff88004f888fd0 IN 47.7 3914004 2986980 qemu-kvm-2.6 3319 3200 0 ffff88004f88bf40 IN 47.7 3914004 2986980 qemu-kvm-2.6 > 3320 3200 0 ffff88017092dee0 RU 47.7 3914004 2986980 qemu-kvm-2.6 3321 3200 3 ffff88017092af70 IN 47.7 3914004 2986980 qemu-kvm-2.6 3322 3200 1 ffff880170929fa0 IN 47.7 3914004 2986980 qemu-kvm-2.6 3323 3200 0 ffff88017092cf10 IN 47.7 3914004 2986980 qemu-kvm-2.6 3325 3200 3 ffff8800631d2f70 IN 47.7 3914004 2986980 qemu-kvm-2.6 3331 3200 0 ffff8801745daf70 IN 47.7 3914004 2986980 threaded-ml 3333 3200 3 ffff88004f88af70 IN 47.7 3914004 2986980 qemu-kvm-2.6 3334 3200 2 ffff880174082f70 IN 47.7 3914004 2986980 qemu-kvm-2.6 3337 3200 2 ffff880174080fd0 IN 47.7 3914004 2986980 qemu-kvm-2.6 3651 3200 3 ffff880077a30fd0 UN 0.0 142040 1680 sum

vm显示当前上下文的虚拟内存信息

通过分析 crash 中的异常进程和堆栈信息可以更进一步发现异常重启的真正原因


文章作者: Kevin
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kevin !
评论
  目录