AIX系统宕机分析教程PPT!
- 格式:pptx
- 大小:1.63 MB
- 文档页数:61
RS/6000小型机故障的基本定位方法一故障的定义.弄清楚系统发生了什么问题.系统现在能做什么?不能做什么?.故障什么时候发生的?.有没有做平时不同的操作?.故障有没有规律?定时还是不定时?发生的频率有多高?.是一台机器出现故障还是多台机器故障?故障现象是否相同?.最近有没有做改动?如安装了新的硬件、软件,改变了系统的一些设臵。
二故障信息的收集1)收集故障信息对于判断、诊断故障原因,修复系统非常重要。
2)系统故障记录(errorlog)errdemon进程在系统启动时自动运行记录包括硬件、软件及其他操作信息故障记录文件为/var/adm/ras/errlog,可备份下来或拷贝到别的机器上分析 errpt 命令的使用(普通用户权限也可使用)#errpt |more 列出简短出错信息ERROR_ID TIMESTAMP T C RESOURCE_NAME ERROR_DESCRIPTION192AC071 0723100300 T 0 errdemon Error logging turned off0E017ED1 0720131000 P H mem2 Memory failure9DBCFDEE 0701000000 T 0 errdemon Error logging turned on038F2580 0624131000 U H scdisk0 UNDETERMINED ERRORAA8AB241 0405130900 T O OPERATOR OPERATOR NOTIFICATIONTIMESTAMP: MMDDHHMMYY (月日时分年)T(类型): P 永久; T 临时; U 未知(永久性的错误应引起重视)C(分类): H 硬件; S 软件; O 用户; U未知#errpt -d H 列出所有硬件出错信息#errpt -d S 列出所有软件出错信息#errpt -aj ERROR_ID 列出详细出错信息# errpt -aj 0502f666 <--- ERROR_ID用大小写均可例:LABEL: SCSI_ERR1ID: 0502F666Date/Time: Jun 19 22:29:51Sequence Number: 95Machine ID: 123456789012Node ID: host1Class: HType: PERMResource Name: scsi0Resource Class: adapterResource Type: hscsiLocation: 00-08VPD: <--- Virtal Product DataDevice Driver Level (00)Diagnostic Level (00)Displayable Message.........SCSIEC Level....................C25928FRU Number..................30F8834 Manufacturer................IBM97FPart Number.................59F4566Serial Number (00002849)ROS Level and ID (24)Read/Write Register Ptr (0120)DescriptionADAPTER ERRORProbable CausesADAPTER HARDWARE CABLECABLE TERMINATOR DEVICEFailure CausesADAPTERCABLE LOOSE OR DEFECTIVERecommended ActionsPERFORM PROBLEM DETERMINATION PROCEDURESCHECK CABLE AND ITS CONNECTIONSDetail DataSENSE DATA0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 00003)控制面板上的LED 代码.8 位代码,通常系统故障灯会同时亮起。
多么痛的领悟:十三起惨痛宕机案例01AIX 下NTP 设置不当导致的多个集群宕机事情发生在一段时间之前,接到朋友电话,用户有三套oracle rac 集群运行在 aix 小机上,本地两套,同城机房两套,做完设备搬迁后的一天晚上,其中本地和同城的两套rac 突然就整个重启了,而且发生在同一时间点。
网络、小机、存储、数据库分属不同的维保厂商,这就开始了扯皮。
各家就开始从自己的方向自证无过错。
我去之前内心也比较倾向于 oracle 的网络心跳出了问题,crs 抢 vote disk 的时候触发了重启。
但由于是小机方的代表,仅从aix 层面做了排查,未发现明显原因。
对各主机宕机的时间做了一个梳理,去和oracle 的事件日志去比对。
暂时没查到什么东西。
宕机产生的dump 发到了IBM 原厂,IBM 后来出了个报告,根据dump 内容定位触发宕机的进程为cssd。
oracle dba 重点看了那个进程的日志,发现宕机时间前后,时间突然变更,提前了40多秒。
dba 确认,时间变更过多,cssd 进程会导致系统重启,怀疑和时间同步有关。
经检查,3套 aix 的 rac 集群使用了同一个 ntp server,但有一套没发生问题。
对比检查差异,发现没问题的那套主机集群使用xntpd 方式配置了时间同步。
出问题的主机则直接使用了ntpdate 命令做时间更新,并写入了 crontab 定期执行。
检查 /var/adm/cron/log 日志,发现定时任务的执行时间和 cssd 故障时间一致。
检查时间服务器,发现搬迁后,时间服务器的时间产生了较大偏差,xntpd 方式的时间同步在时间偏差大时不会去强制同步,ntpdate 命令的方式没有这个限制,会直接进行同步。
最终导致了 cssd 进程检测到过大时间偏差后触发了宕机。
经验分享:配置时间同步时,建议使用xntpd 服务的方式,不用直接在定时任务里写 ntpdate,因为 ntpdate 比较粗暴,发生故障时较大的时间偏差会导致应用出现问题,触发无法预知的后果。
有关aio引起AIX宕机的core_dump分析2008.03.27前些日子,客户的S7A主机发生了几次宕机,产生了CORE_DUMP文件,下面是利用crash命令分析宕机原因的过程pwd/# hostnames7a01# cd /var/adm/ras# ls -l 查看core文件名称total 395133-rw-rw-r-- 1 root system 4226 Apr 02 2003 BosMenus.log-rw-r--r-- 1 root system 2 Jan 07 2000 SRCSemID-rw------- 1 root system 8192 May 20 13:35 bootlog-rw-r--r-- 1 root system 8388 Apr 02 2003 bosinst.data-rw-rw-r-- 1 root system 16384 Apr 02 2003 bosinstlog--w------- 1 root system 2 May 16 15:47 bounds-rw-r--r-- 1 bin bin 197206 Jan 01 1970 codepoint.cat-rw--w--w- 1 root system 16384 May 20 15:52 conslog--w------- 1 root system 21 May 16 15:47 copyfilename-rw-r--r-- 1 root system 57078 Apr 02 2003 devinst.log-rw-r--r-- 1 root system 83319 May 20 14:00 diag_log-rw------- 1 root system 8192 May 16 15:49 dumpsymplog-rw-r--r-- 1 root system 151552 May 20 15:52 errlog-rw-r--r-- 1 root system 151552 Apr 22 2004 errlog0422.log-r--r--r-- 1 bin bin 103968 Jan 07 2000 errtmplt-rw-r--r-- 1 root system 7949 Apr 02 2003 image.data-rw-r--r-- 1 root system 8192 May 20 13:21 nimlog-rw-rw-rw- 1 root system 1334264 Jan 20 2000 trcfile-rw------- 1 root system 200136704 May 16 15:47 vmcore.0# crash vmcore.0 开打vmcore.0文件Using /unix as the default namelist file.2 dump routines failed. The following were recorded:0x0141cbe8 failed with rc=140x01422764 failed with rc=14> stat 查看宕机时的状态sysname: AIXnodename: s7a01release: 3version: 4machine: 000AAD014C00time of crash: Tue May 16 15:05:18 TAIST 2006age of system: 22 hr., 51 min.xmalloc debug: disabledabend code: 300 查看错误代码,这个代码很关键csa: 0x2ff3b400exception struct:dar: 0x00000000dsisr: 0x00000000:srv: 0x00000000dar2: 0x00000000dsirr: 0x00000000: (errno) "Error 0"> trace -mSkipping first MSTMST STACK TRACE:0x2ff3b400 (excpt=00000004:0a000000:00000000:00000004:00000106) (intpri=11) IAR: .compare_and_swap+2c (0000a4ec): stw r9,0x0(r4)LR: .[aiopin:untie_knot]+a8 (0143d7a8)2ff3a2e0: .[aio.ext:qlioreq]+b0 (014376ec)2ff3a340: .[aio.ext:listio]+128 (01438f5c)2ff3b3c0: .sys_call_ret+0 (00003a6c)0001113a: lasttocentry+fead9 (00348001)0452-771: Cannot read return address at address 0x01892c0b.> le 0000a4ecNo loader entry found for module address 0x0000a4ecNo loader entry found for module named '0000a4ec'> le 0143d7a8LoadList entry at 0x04ea7980Module *start:0x00000000_0143bef0 Module filesize:0x00000000_0000228cModule *end:0x00000000_0143e17c*data:0x00000000_0143dbe8 data length:0x00000000_00000594Use-count:0x0001 load_count:0x0000 *file:0x00000000flags:0x00000262 TEXT DATAINTEXT DATA DATAEXISTS*exp:0x04ed8000 *lex:0x00000000 *deferred:0x00000000 expsize:0x6e6c732f Name: /usr/lib/drivers/aiopinndepend:0x0001 maxdepend:0x0001*depend[00]:0x05039280*le_next: 04ea7680> le 014376ecLoadList entry at 0x04ea7680Module *start:0x00000000_014348c0 Module filesize:0x00000000_00007624Module *end:0x00000000_0143bee4*data:0x00000000_0143a4c0 data length:0x00000000_00001a24Use-count:0x0003 load_count:0x0001 *file:0x00000000flags:0x00000272 TEXT KERNELEX DATAINTEXT DATA DATAEXISTS*exp:0x051e3000 *lex:0x00000000 *deferred:0x00000000 expsize:0x6c696263 Name: /etc/drivers/aio.extndepend:0x0002 maxdepend:0x0002*depend[00]:0x04ea7980*depend[01]:0x05039280*le_next: 04edb700> le 01438f5cLoadList entry at 0x04ea7680Module *start:0x00000000_014348c0 Module filesize:0x00000000_00007624 Module *end:0x00000000_0143bee4*data:0x00000000_0143a4c0 data length:0x00000000_00001a24Use-count:0x0003 load_count:0x0001 *file:0x00000000flags:0x00000272 TEXT KERNELEX DATAINTEXT DATA DATAEXISTS*exp:0x051e3000 *lex:0x00000000 *deferred:0x00000000 expsize:0x6c696263 Name: /etc/drivers/aio.extndepend:0x0002 maxdepend:0x0002*depend[00]:0x04ea7980*depend[01]:0x05039280*le_next: 04edb700经查,宕机跟Name: /usr/lib/drivers/aiopin有关,> errpt 查看宕机时产生的错误日志LAST ERRORS READ BY ERRDEMON (MOST RECENT LAST):Tue May 16 15:05:18 TAIST: DSI_PROC data storage interrupt : processor Resource Name: SYSVMM0a000000 00000000 00000004 00000086LAST 3 ERRORS READ BY ERRDEMON (MOST RECENT FIRST):> od vmmerrlog 9 rpco proc - 0SLT ST PID PPID PGRP UID EUID TCNT NAME0 a 0 0 0 0 0 1 swapperFLAGS: swapped_in no_swap fixed_pri kprocLinks: *child:0xe20030c0 *siblings:0x00000000 *uinfo:0x50004020(0x0038) *ganchor:0x00000000 *pgrpl:0x00000000 *ttyl:0x00000000Dispatch Fields: pevent:0x00000000 *synch:0xfffffffflock:0x00000000 lock_d:0x00000000Thread Fields: *threadlist:0xe6000000 threadcount:1active:1 suspended:0 local:0 terminating:0Scheduler Fields: fixed pri: 16 repage:0x00000000 scount:0 sched_pri:0 *sched_next:0x00000000 *sched_back:0x00000000 cpticks:3087msgcnt:0 majfltsec:0Misc: adspace:0x0003c00f kstackseg:0x00000000 xstat:0x0000*p_ipc:0x00000000 *p_dblist:0x00000000 *p_dbnext:0x00000000Signal Information:pending:hi 0x00000000,lo 0x00000000sigcatch:hi 0x00000000,lo 0x00000000 sigignore:hi 0xffffffff,lo 0xfff7ffff Statistics: size:0x00000000(pages) audit:0x00000000accounting page frames:0 page space blocks:0Number of virtual pages in use :0pctcpu:0 minflt:1987 majflt:7> thread - 0SLT ST TID PID CPUID POLICY PRI CPU EVENT PROCNAME0 s 3 0 unbound FIFO 10 78 swappert_flags: wakeonsig kthreadLinks: *procp:0xe2000000 *uthreadp:0x2ff3b400 *userp:0x2ff3b6e0 *prevthread:0xe6000000 *nextthread:0xe6000000, *stackp:0x00000000*wchan1(real):0x00000000 *wchan2(VMM):0x00000000 *swchan:0x00000000 wchan1sid:0x00000000 wchan1offset:0x00000000pevent:0x00000000 wevent:0x00000001 *slist:0x00000000Dispatch Fields: *prior:0xe6000000 *next:0xe6000000polevel:0x0000000a ticks:0x0c0f *synch:0xffffffff result:0x00000000*eventlst:0x00000000 *wchan(hashed):0x00000000 suspend:0x0001thread waiting for: event(s)Scheduler Fields: cpuid:0xffffffff scpuid:0xffffffff pri: 16 policy:FIFO affinity:0x0001 affinity_ts:0x3b6e31e cpu:0x0078 run_queue:34a900lpri: 0 wpri:127 time:0x00 sav_pri:0x10Misc: lockcount:0x00000000 ulock:0x00000000 *graphics:0x00000000 dispct:0x00031718 fpuct:0x00000001 boosted:0x0000userdata:0x00000000fsflags: 00000000 adsp_flags: 0000Signal Information: cursig:0x00 *scp:0x00000000pending:hi 0x00000000,lo 0x00000000 sigmask:hi 0x00000000,lo 0x00000000 > q#lslpp -w /usr/lib/drivers/aiopin 查看相关的文件集File Fileset Type----------------------------------------------------------------------------/usr/lib/drivers/aiopin bos.rte.aio File# lslpp -ah bos.rte.aio 查看这个文件集的版本为4.3.3.1Fileset Level Action Status Date Time----------------------------------------------------------------------------Path: /usr/lib/objreposbos.rte.aio4.3.3.0 COMMIT COMPLETE 01/01/70 08:29:524.3.3.1 COMMIT COMPLETE 01/07/00 09:57:114.3.3.1 APPLY COMPLETE 01/07/00 09:55:52Path: /etc/objreposbos.rte.aio4.3.3.0 COMMIT COMPLETE 01/01/70 08:29:524.3.3.1 COMMIT COMPLETE 01/07/00 09:57:114.3.3.1 APPLY COMPLETE 01/07/00 09:55:53经查,宕机跟bos.rte.aio有关,在IBM网站上查到如下内容IY05599: AIO CRASH IN COMPARE_AND_SWAP 00/01/14 PTF PECHANGE APAR statusClosed as program error.Error descriptionWhen the parameter passed to the compare_and_swap() expectedto be a pointer to an integer, but the code passed an integer.I/O on this address (small integer) caused the system crashedwith DSI.Local fixProblem summary*************************************************************** *USERS AFFECTED: ** All users with the following filesets at these levels ** bos.rte.aio 4.3.3.1.*************************************************************** *PROBLEM DESCRIPTION: ** When the parameter passed to the compare_and_swap()* expected to be a pointer to an integer, but the code* passed an integer. I/O on this address (small* integer) caused the system crashed with DSI.*************************************************************** *RECOMMENDATION: ** Apply apar IY05599*************************************************************** Problem conclusionCorrected the parameter passed to compare_and_swap calls.Temporary fixCommentsAPAR informationAPAR number IY05599Reported component name AIX 4.3.0Reported component ID 5765C3403Reported release 430Status CLOSED PERPE YesPEHIPER NoHIPERSubmitted date 1999-11-02Closed date 1999-11-08Last modified date 2000-10-17APAR is sysrouted FROM one or more of the following:APAR is sysrouted TO one or more of the following:Fix informationFixed component name AIX 4.3.0Fixed component ID 5765C3403Applicable component levelsR430 PSY U467596 UP99/12/21 I 1000现在确定,这台机器需要打相关补丁才能彻底解决宕机.。