1.问题描述
2012/05/21日 9:45左右,ESS前台报错:调用Tuxedo异 常,bea.jolt.ServiceException: TPELIMIT-a system limit has been reached”,之 后由亚联这边重启了JSL以及JREPSVR,重启后恢复正常。
2.问题分析
根据日志做以下重现
5月21日上午 8:51分 开始陆续出现一些服务超时
085138.bsstux8!BBL.21364768.1.0: CMDTUX_CAT:1836: WARN: Server(11469008) processing terminated with SIGKILL after SVCTIMEOUT
085138.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:541: WARN: Server GRPAMCBS4/26357 terminated
085138.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:557: INFO: Server GRPAMCBS4/26357 being restarted
该现象有可能因为业务计算执行时间超过目前设置的时间上限10分钟,8:51分Tux的Server GRPAMCBS4 被正常重启
085139.bsstux8!restartsrv.22675880.1.-2: 05-21-2012: Tuxedo Version 10.3.0.0, 64-bit
085139.bsstux8!restartsrv.22675880.1.-2: server GRPAMCBS4/26357: CMDTUX_CAT:580: INFO: A server process has restarted: 22741062
但之后9:03分出现Tux的 BBL中有服务状态被锁死的现象
090344.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:5005: WARN: USEM[0] lock appears stuck: held by pid=21364768
此后 GRPCSCRM2 进程号4456760 超时后无法再被重启,并且从BBL中清除
091030.bsstux8!BBL.21364768.1.0: CMDTUX_CAT:1836: WARN: Server(4456760) processing terminated with SIGKILL after SVCTIMEOUT
091032.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:541: WARN: Server GRPCSCRM2/14041 terminated
091032.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:557: INFO: Server GRPCSCRM2/14041 being restarted
091033.bsstux8!qcscrm1l1server.23986526.1.0: 05-21-2012: Tuxedo Version 10.3.0.0, 64-bit
091033.bsstux8!qcscrm1l1server.23986526.1.0: LIBTUX_CAT:262: INFO: Standard main starting
091043.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 4456760 died; removing from BB
至9:12分BBL中的服务信息再次被锁死,从日志信息来分析,该进程应该还是GRPCSCRM2
091203.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:5005: WARN: USEM[0] lock appears stuck: held by pid=6619352
091208.bsstux8!GWTDOMAIN.8519900.772.0: LIBGWT_CAT:1130: INFO: Disconnected from domain (domainid=<CUSTMANMWTC_158_01>)
091208.bsstux8!GWTDOMAIN.8519900.772.0: LIBGWT_CAT:1354: INFO: Retrying domain (domainid=<CUSTMANMWTC_158_01>) every 60 seconds
091213.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:5005: WARN: USEM[0] lock appears stuck: held by pid=6619352
091226.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:5005: WARN: USEM[0] lock appears stuck: held by pid=6619352
091228.bsstux8!GWTDOMAIN.8519900.772.0: LIBGWT_CAT:1128: INFO: Connection accepted from domain (domainid=<CUSTMANMWTC_158_01>)
091243.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:5005: WARN: USEM[0] lock appears stuck: held by pid=6619352
091300.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:541: WARN: Server GRPCSCRM2/14041 terminated
091330.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:557: INFO: Server GRPCSCRM2/14041 being restarted
至9:14分 GRPCSCRM2 再次被进程重启 至9:16分12秒重启开始
091430.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 4456760 died; removing from BB
091504.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:541: WARN: Server GRPCSCRM2/14041 terminated
091528.bsstux8!tcscrm1l1server.21627624.1.0: gtrid x0 x4fb65016 x3d7d8: LIBTUX_CAT:1387: ERROR: xa_commit returned XAER_NOTA
091612.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:557: INFO: Server GRPCSCRM2/14041 being restarted
9:16分22秒重启失败,原进程仍然存在于系统中,由此导致该服务不再可控,BBL已经无法控制该进程
091622.bsstux8!restartsrv.5112360.1.-2: 05-21-2012: Tuxedo Version 10.3.0.0, 64-bit
091622.bsstux8!restartsrv.5112360.1.-2: LIBTUX_CAT:577: ERROR: Unable to register because the slot is already owned by another process
9:28分开始所有通过Jolt调用的服务均被堵塞
092841.bsstux8!JSH.2687454.1.-2: JOLT_CAT:1043: "ERROR: tpacall() call failed, tperrno = 5"
092841.bsstux8!JSH.9503648.1.-2: JOLT_CAT:1043: "ERROR: tpacall() call failed, tperrno = 5"
092850.bsstux8!JSH.21299502.1.-2: JOLT_CAT:1043: "ERROR: tpacall() call failed, tperrno = 5"
092901.bsstux8!JSH.12124318.1.-2: JOLT_CAT:1043: "ERROR: tpacall() call failed, tperrno = 5"
9:51开始Tux 服务陆续开始自动重启
095106.bsstux8!BBL.21364768.1.0: CMDTUX_CAT:1836: WARN: Server(14483794) processing terminated with SIGKILL after SVCTIMEOUT
095106.bsstux8!JSH.21299502.1.-2: JOLT_CAT:1043: "ERROR: tpacall() call failed, tperrno = 5"
095106.bsstux8!JSH.8978438.1.-2: JOLT_CAT:1043: "ERROR: tpacall() call failed, tperrno = 5"
9:57分开始执行BBC命令,清除TUXEDO BB注册信息
095751.bsstux8!tmadmin.27591196.1.-2: TMADMIN_CAT:1330: INFO: Command: bbc
095751.bsstux8!restartsrv.6488454.1.-2: server GRPCSCRM3/24076: CMDTUX_CAT:580: INFO: A server process has restarted: 6423456
095751.bsstux8!tcscrm1l1server.20250878.1.0: 05-21-2012: Tuxedo Version 10.3.0.0, 64-bit
095751.bsstux8!tcscrm1l1server.20250878.1.0: LIBTUX_CAT:262: INFO: Standard main starting
095751.bsstux8!tcscrm1l1server.20250878.1.0: LIBTUX_CAT:476: WARN: Server 303/24077: client process 11993292: lost message
095751.bsstux8!tcscrm1l1server.20250878.1.0: LIBTUX_CAT:477: WARN: SERVICE=TCS_CRM1_L1SVC MSG_ID=0 REASON=server died
095751.bsstux8!restartsrv.6488454.1.-2: server GRPCSCRM3/24077: CMDTUX_CAT:580: INFO: A server process has restarted: 20250878
同时执行JSH的重启
095757.bsstux8!JSL.15139582.1.0: JOLT_CAT:1506: "INFO: Terminating Jolt administration services in preparation for shutdown"
095757.bsstux8!JSL.15139582.1.0: JOLT_CAT:1196: "INFO: Terminating handlers in preparation for shutdown"
095757.bsstux8!JSL.15139582.1.0: JOLT_CAT:1197: "INFO: Exiting system"
095757.bsstux8!JSH.8650974.1.-2: JOLT_CAT:1198: "WARN: Forced shutdown of client; user name 'ZXD_127_03'; client name 'R_ZXD_127_03'"
095757.bsstux8!JSH.8650974.1.-2: JOLT_CAT:1198: "WARN: Forced shutdown of client; user name 'WB_129_04'; client name 'R_WB_129_04'"
095757.bsstux8!JSH.21627256.1.-2: JOLT_CAT:1198: "WARN: Forced shutdown of client; user name 'WB_127_03'; client name 'R_WB_127_03'"
095757.bsstux8!JSH.21627256.1.-2: JOLT_CAT:1198: "WARN: Forced shutdown of client; user name 'WB_127_03'; client name 'R_WB_127_03'"
095757.bsstux8!JSH.17301636.1.-2: JOLT_CAT:1198: "WARN: Forced shutdown of client; user name 'WB_127_03'; client name 'R_WB_127_03'"
BBC命令出发了Tux对BB的信息清理
095758.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 12255400 died; removing from BB
095758.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 17892156 died; removing from BB
095758.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 4653940 died; removing from BB
095758.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 8716892 died; removing from BB
095758.bsstux8!BBL.21364768.1.0: LIBTUX_CAT:216: WARN: Process 4194608 died; removing from BB
10:00再次执行BBC
100058.bsstux8!tmadmin.17498802.1.-2: 05-21-2012: Tuxedo Version 10.3.0.0, 64-bit
100058.bsstux8!tmadmin.17498802.1.-2: TMADMIN_CAT:1330: INFO: Command: bbc
100058.bsstux8!qcscrm1l1server.35586630.1.0: ERROR: msgsnd err:(LIBTUX_CAT:669: ERROR: Message operation failed because of the invalid message queue identifier) errno=22,qid=35652015,buf=4569367528,bytes=293,flag=2048
100058.bsstux8!qcscrm1l1server.35586630.1.0: LIBTUX_CAT:1286: ERROR: tpreturn could not send reply TPEOS - operating system error
100059.bsstux8!qcscrm1l1server.24379702.1.0: ERROR: msgsnd err:(LIBTUX_CAT:669: ERROR: Message operation failed because of the invalid message queue identifier) errno=22,qid=15728912,buf=4569024456,bytes=293,flag=2048
100059.bsstux8!qcscrm1l1server.24379702.1.0: LIBTUX_CAT:1286: ERROR: tpreturn could not send reply TPEOS - operating system error
从上面的问题出现过程来看,由于 BBL在定期检查自身状态时,有时候会发生BBL被锁(由于有进程正在访���BB)的情况,当这个请求是user发起 时, BBL无法主动将其清理,必须在user lock释放后才能清除,重启。而9:14分开始的主动重启服务任务由于user lock始终存在,因 此无法成功,导致此后该服务的队列会持续积压,直至12分钟后将Jolt处理队列积满,自此Tuxedo无法再接收任何请求,致使前端业务无法执行。应用 方表示,在Tuxedo调用后台服务时有时候会使用动态链接库来进行高性能处理,且这些处理有时会相当耗时,因此 初步判断,有可能由于Tuxedo服务 执行时配置的SVCTIMEOUT已超时,但处理句柄在外部的动态链接库中,因此导致服务注册信息在BB中被清零,但BB无法彻底清除正在运行的服务进 程,导致服务自动重启后无法正常处理请求。从而造成后续的处理队列积累,又由于Jolt处理队列直接受被调服务队列积累的影响,进一步造成Jolt队列的 积压直至停止服务。
3.问题解决
从现象上看这个问题关键是BB无法主动清除正在运行的业务进程。症结是找出为什么这个进程会被锁住。
提出以下建议:
1)建议多监控日志中是否有LIBTUX_CAT:5005: WARN: USEM[0] lock appears stuck
这个警告,如果有立即对此进程进行kill -6 pid(被锁的进程号),保留core文件再用gdb工具进行debug(关于详细的排错方法见
https://supporthtml.oracle.com 上的文章How To Investigate Or Bypass LIBTUX_CAT:5005 Issues? [ID 1380149.1])。
2)由于设置了SVCTIMEOUT
,tuxedo会定时去主动剔除超时的服务进程,作为缓解的方法之一就是讲SVCTIMEOUT这个参数去掉,让其不主动清除超时的服务。这样的话,就需要手动对某些超时,坏死的服务进行重启。
3)在用建议1中间的方法取得相应的core以及相应的bb即时信息之后,立即清除bb重启相关服务。