客户的一个数据库工作时异常,无论是CPU使用率还是操作系统上内存使用,都比平时有较大的提升。
这是一个10.2.0.5 RAC for AIX环境,出问题的是节点2,节点1上目前正常。
登录主机,运行vmstat:
ERPDB2@/home/oracle>vmstat 2 10
System configuration: lcpu=14 mem=93183MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
4 0 18761773 144324 0 0 0 0 0 0 654 15458 3714 29 1 70 0
2 0 18762563 143534 0 0 0 0 0 0 1033 21523 4569 35 1 64 0
2 0 18766475 139622 0 0 0 0 0 0 2463 33805 8193 35 3 58 3
2 0 18767065 139032 0 0 0 0 0 0 3038 35103 10483 28 3 66 3
3 0 18768118 137979 0 0 0 0 0 0 1782 30526 6418 38 2 57 2
3 0 18768144 137953 0 0 0 0 0 0 2617 43525 8417 50 3 43 4
4 0 18768173 137923 0 0 0 0 0 0 2695 57761 8819 43 4 50 2
3 0 18768157 137939 0 0 0 0 0 0 2693 33832 7419 43 3 54 0
3 0 18768681 137415 0 0 0 0 0 0 1048 25207 4581 46 2 52 0
3 0 18767478 138617 0 0 0 0 0 0 950 34960 4568 49 2 48 1
系统中空闲内存不到平时的一半,CPU的IDLE比率也比平时要低一些。
通过ps –ef检查oracle用户进程:
ERPDB2@/home/oracle>ps -ef|grep ora
oracle 131128 1 0 Jun 06 - 0:01 ora_reco_cnderpdb2
oracle 135254 1 0 May 15 - 21:29 /u01/agent10g/perl/bin/perl /u01/agent10g/bin/emwd.pl agent /u01/agent10g/sysman/log/emagent.nohup
oracle 143398 1 0 Jun 06 - 3:12 oraclecnderpdb2 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
oracle 147644 225730 0 May 15 - 0:08 /u01/crs/bin/evmlogger.bin -o /u01/crs/evm/log/evmlogger.info -l /u01/crs/evm/log/evmlogger.log
oracle 151692 209252 0 May 15 - 0:00 /bin/sh -c cd /u01/crs/log/p5b1/cssd/oclsomon; ulimit -c unlimited; /u01/crs/bin/oclsomon || exit $?
oracle 155674 1 0 Jun 06 - 29:12 ora_arc0_cnderpdb2
oracle 163980 1 0 Jun 06 - 6:31 ora_pz99_cnderpdb2
oracle 167972 1 0 Jun 06 - 38:23 ora_mmnl_cnderpdb2
oracle 176168 1 0 Jun 06 - 6:58 ora_smon_cnderpdb2
oracle 184522 1 3 09:04:57 - 0:04 oraclecnderpdb2 (LOCAL=NO)
oracle 188568 1 0 Jun 06 - 170:28 ora_lms0_cnderpdb2
oracle 196612 1 0 09:03:15 - 0:05 oraclecnderpdb2 (LOCAL=NO)
oracle 204830 1 0 08:40:50 - 0:01 oraclecnderpdb2 (LOCAL=NO)
oracle 213246 1 0 Jun 06 - 0:00 ora_q000_cnderpdb2
oracle 229486 1 0 08:57:37 - 0:03 oraclecnderpdb2 (LOCAL=NO)
oracle 245984 1 0 08:45:02 - 0:03 oraclecnderpdb2 (LOCAL=NO)
oracle 249858 1 0 08:27:13 - 0:08 oraclecnderpdb2 (LOCAL=NO)
.
.
.
oracle 254624 1 0 08:40:50 - 0:22 oraclecnderpdb2 (LOCAL=NO)
oracle 258610 1 0 09:00:18 - 0:01 oraclecnderpdb2 (LOCAL=NO)
oracle 262670 1 0 09:03:58 - 0:05 oraclecnderpdb2 (LOCAL=NO)
oracle 267004 1 104 13:44:29 - 927:49 oraclecnderpdb2 (LOCAL=NO)
oracle 275192 1 0 08:51:19 - 0:01 oraclecnderpdb2 (LOCAL=NO)
oracle 295508 1 0 09:02:51 - 0:01 oraclecnderpdb2 (LOCAL=NO)
oracle 299676 1 0 08:48:07 - 0:53 oraclecnderpdb2 (LOCAL=NO)
.
.
.
oracle 697136 1 0 08:14:58 - 0:02 oraclecnderpdb2 (LOCAL=NO)
oracle 725768 1 0 08:09:49 - 0:10 oraclecnderpdb2 (LOCAL=NO)
oracle 730046 1 0 08:17:20 - 0:14 oraclecnderpdb2 (LOCAL=NO)
oracle 738172 1 0 08:03:46 - 0:18 oraclecnderpdb2 (LOCAL=NO)
oracle 742256 1 0 08:16:20 - 0:05 oraclecnderpdb2 (LOCAL=NO)
oracle 746320 1 0 08:11:00 - 0:14 oraclecnderpdb2 (LOCAL=NO)
oracle 754448 1 0 Jul 05 - 4:34 oraclecnderpdb2 (LOCAL=NO)
oracle 758660 1 0 08:19:12 - 0:13 oraclecnderpdb2 (LOCAL=NO)
oracle 766894 1 0 14:14:16 - 0:39 oraclecnderpdb2 (LOCAL=NO)
oracle 771046 1 0 Jul 05 - 12:45 oraclecnderpdb2 (LOCAL=NO)
oracle 774984 1 0 07:54:22 - 0:03 oraclecnderpdb2 (LOCAL=NO)
oracle 779134 1 0 08:20:44 - 2:36 oraclecnderpdb2 (LOCAL=NO)
oracle 787356 1 0 13:51:39 - 0:50 oraclecnderpdb2 (LOCAL=NO)
oracle 791516 1 0 07:56:18 - 0:55 oraclecnderpdb2 (LOCAL=NO)
检查进程的时候发现一个进程267004占用了100%的CPU,且CPU运行时间超过了900分钟也就是15个小时,而这个进程并非Oracle的后台进程,而是用户进程。
登录数据库检查系统状态,发现V$LOCK视图中存在一个长时间持有锁的会话:
ERPDB2@/home/oracle>sqlplus / as sysdba
SQL*Plus: Release 10.2.0.5.0 - Production on Wed Jul 6 09:13:44 2011
Copyright (c) 1982, 2010, Oracle. All Rights Reserved.
Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
SQL> set pages 100 lines 138
SQL> select * from v$lock where type not in ('MR', 'TO');
ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK
---------------- ---------------- ---- -- --------- --------- ----- ------- ------- ------
0700000C89595068 0700000C89595088 1093 XR 4 0 1 0 0 2
0700000C89595110 0700000C89595130 1083 XR 0 0 1 0 2543466 2
0700000C895951B8 0700000C895951D8 1093 CF 0 0 2 0 2543466 2
0700000C89595260 0700000C89595280 1094 RT 2 1 6 0 2543460 2
0700000C89595308 0700000C89595328 1093 RS 25 1 2 0 2543463 2
0700000C89595458 0700000C89595478 1094 RT 2 0 6 0 2543463 2
0700000C895955A8 0700000C895955C8 1096 DM 1 0 4 0 2543459 2
0700000C89595650 0700000C89595670 1096 PW 1 0 3 0 2543458 2
0700000C895956F8 0700000C89595718 1096 RT 2 2 6 0 2543460 2
0700000C89596790 0700000C895967B0 1092 TS 16 2 3 0 2543459 2
0700000C89596838 0700000C89596858 1093 KT 4839 0 4 0 2543459 2
0700000C895AF8D0 0700000C895AF8F0 1065 CU 672471248 117440522 6 0 56181 0
12 rows selected.
其中SID为1065的会话持有了锁达到了56181秒,根据TYPE类型,这个一个和CURSOR有关的锁,检查会话和进程信息:
SQL> select sid, event, program
2 from v$session
3 where paddr in
4 (select addr
5 from v$process
6 where spid = 267004);
SID EVENT PROGRAM
---------- ----------------------------------- -------------------
1065 SQL*Net more data from client
SQL> select sid, event, seconds_in_wait from v$session where sid = 1065;
SID EVENT SECONDS_IN_WAIT
---------- ------------------------------ ---------------
1065 SQL*Net more data from client 56645
SQL> select sql_text
2 from v$sqltext_with_newlines
3 where sql_id =
4 (select sql_id
5 from v$session
6 where sid = 1065)
7 order by piece;
SQL_TEXT
----------------------------------------------------------------
select ' ' year,' ' month, '201103' yearmonth,'2011.03.31' vdate
,vtype,vno,ino,expl,0 vdc,acode,substr(acode,1,4) acode2 ,bco
de,substr(bcode,1,5) bcode2,ccode,dcode,ecode,gcode,substr(gcode
,1,7) gcode2,hcode,icode ,scode,max(vhtype) vhtype,fcode,rerate
,(case when substr(acode,1,4) in ('1125','1132','1152','2122',
'2132') then null else uerate end) uerate ,sum(rmbdebit) rmbdeb
it ,sum(rmbcredit) rmbcredit ,sum(rmb) rmb ,sum(fcydebit) fcy
debit ,sum(fcycredit) fcycredit ,sum(fcy) fcy ,(case when sub
str(acode,1,4) in ('1125','1132','1152','2122','2132') then null
else sum(usddebit) end) usddebit ,(case when substr(acode,1,4)
in ('1125','1132','1152','2122','2132') then null else sum(usdc
redit) end) usdcredit ,(case when substr(acode,1,4) in ('1125',
'1132','1152','2122','2132') then null else sum(usd) end) usd ,
sum(qtydebit) qtydebit ,sum(qtycredit) qtycredit ,sum(qty) qty
,'' lxzl ,'' vprepare , 0 vno2 from ( select to_char(year)
year,to_char(month) month,' ' vdate ,substr(ta.bcode,2,4) vtyp
e,' ' vno,0 ino,'???' expl,tb.dc vdc,ta.acode ,ta.bcode,ccod
e,dcode,ecode,hcode,icode,gcode,scode,' ' vhtype,fcode ,(case w
hen nvl(fcybalance,0)=0 then 0 else nvl(rmbbalance,0)/nvl(fcybal
ance,0) end) rerate ,(case when nvl(fcybalance,0) = 0 then 0
else nvl(usdbalance,0)/nvl(fcybalance,0) end) uerate ,0 rmbde
bit ,0 rmbcredit ,(nvl(rmbbalance,0)*tb.dc) rmb ,0 fcydebit
,0 fcycredit ,nvl(fcybalance,0)*tb.dc fcy ,0 usddebit ,0 usd
credit ,(nvl(usdbalance,0)*tb.dc) usd ,0 qtydebit ,0 qtycredi
t ,(nvl(qtybalance,0)*tb.dc) qty ,'' lxzl ,'' vprepare , 0 v
no2 from balance ta left outer join acode tb on ta.acode=tb.a
code where ( (year=2011 and month =3) ) and ((bcode='9100601
2209' and bitand(flag1,4)=4) or (bcode='91006012185' and bitand(
flag1,4)=4) or (bcode='91006011950' and bitand(flag1,4)=4) or (b
code='91006011936' and bitand(flag1,4)=4) or (bcode='91006011427
' and bitand(flag1,4)=4) or (bcode='90006011950' and bitand(flag
1,4)=4) or (bcode='90006011936' and bitand(flag1,4)=4) or (bcode
.
.
.
bcode like '01391%' or bcode like '91006%' or bcode like '02506%
' or bcode like '02306%' or bcode like '01399%' or bcode like '0
1396%' or bcode like '01387%' or bcode like '10006%' or bcode li
ke '02006%' or bcode like '01389%' or bcode like '01206%' or bco
de like '90006%' or bcode like '03006%' or bcode like '01394%' o
r bcode like '01393%' or bcode like '05006%' or bcode like '0138
6%' or bcode like '01306%') and ( bcode like '01306%' or bcode
like '01306%' or bcode like '04006%') and (ecode like '006013
060101%' or ecode like '006013060134%' or ecode like '0060400601
01%') and (ta.acode like '1243%') and (gcode like '140201500
7%') and 1=1
991 rows selected.
SQL> select sid, username, status from v$session where sid = 1065;
SID USERNAME STATUS
---------- ------------------------------ --------
1065 CNDERP ACTIVE
以前碰到过类似的情况,不过等待事件是SQL*Net more data to client,而且会话状态也是ACTIVE,现象同样是单个进程占用了100%的CPU,且一直处于运行状态。以前碰到的问题应该是网络闪断造成了通信异常所致。
而今天碰到的这个问题,应该和上面这个异常复杂的SQL有关。很可能Oracle在分析这个SQL的时候碰到了异常,因此一直处于SQL*Net more data from client的等待状态,这个SQL的大小已经接近64K了,如果在分析的过程中碰到了什么bug也并不奇怪。
这个进程属于用户进程,因此中止对于系统并没有影响,直接利用kill -9杀掉操作系统上的进程:
ERPDB2@/home/oracle>kill -9 267004
ERPDB2@/home/oracle>ps -ef|grep 267004
oracle 365340 529208 0 09:28:49 pts/1 0:00 grep 267004
ERPDB2@/home/oracle>vmstat 2 10
System configuration: lcpu=14 mem=93183MB
kthr memory page faults cpu
----- ----------- ------------------------ ------------ -----------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
2 0 15894079 2907698 0 10 0 0 0 0 1534 176158 6008 13 4 80 3
1 0 15891796 2909960 0 9 0 0 0 0 1137 18452 4937 11 2 85 2
2 0 15894685 2907066 0 2 0 0 0 0 1921 25201 6747 16 2 77 6
1 0 15892077 2909667 0 3 0 0 0 0 953 115275 4685 19 3 74 5
1 0 15892618 2909115 0 4 0 0 0 0 403 8784 3394 17 1 82 0
2 0 15894081 2907652 0 0 0 0 0 0 1193 19291 5034 29 1 69 0
4 0 15911090 2890642 0 0 0 0 0 0 2814 36575 8361 39 2 58 0
4 0 15921801 2879931 0 0 0 0 0 0 2941 41975 8528 52 3 45 0
4 0 15933411 2868320 0 0 0 0 0 0 1358 32741 5073 56 2 41 0
3 0 15941350 2860381 0 0 0 0 0 0 793 26134 4077 52 1 46 0
可以看到,杀掉问题会话后,系统空闲内存恢复正常,CPU的总体IDLE也有所提升。虽然这个问题解决了,但是如果这种复杂的SQL经常出现的话,那么再次碰到类似的问题也只是时间的问题。