1 现象
今天上午XX的weblogic域XXdomain中所有服务变得非常慢,无法访问。从操作系统看CPU使用正常。把域中所有服务重启之后,系统能够重新正常运行。
2 分析
2.1 thread dump分析
从取到的thread dump中看,以被管服务m123_7002的分析为例,当时154个线程:其中有84个线程空闲,有24个线程在等锁,有42个线程在做session复制,还有4个线程runable。
按照当前thread dump分析,系统应该可以正常接受服务请求,而不会造成服务挂起的现象。怀疑取thread dump的时间并不是服务出问题的时刻。
Thread dump的异常之处为:有42个线程在做session复制,24个等锁的线程所等的资源,都被做session复制的线程持有。
2.2服务日志分析
从日志来看,异常之处有一下几点:
1)因为需要复制的对象没有序列化,有些session复制失败
####<2010-11-22 上午08时51分05秒 CST> <Error> <Cluster> <linux-n83k> <m123_7002> <[ACTIVE] ExecuteThread: '60' for queue: 'weblogic.kernel.Default (self-tuning)'> <<anonymous>> <> <> <1290387065588> <BEA-000126> <All session objects should be serializable to replicate. Check the objects in your session. Failed to replicate non-serializable object.>
2)因“Could not find secondary on remote server”而session复制失败
####<2010-11-22 上午08时52分58秒 CST> <Error> <Kernel> <linux-n83k> <m123_7002> <[ACTIVE] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1290387178747> <BEA-000802> <ExecuteRequest failed
weblogic.utils.NestedError: Could not find secondary on remote server.
weblogic.utils.NestedError: Could not find secondary on remote server
at weblogic.servlet.internal.session.ReplicatedSessionData.syncSession(ReplicatedSessionData.java:591)
at weblogic.servlet.internal.session.ReplicatedSessionContext.sync(ReplicatedSessionContext.java:82)
at weblogic.servlet.internal.ServletRequestImpl$SessionHelper.syncSession(ServletRequestImpl.java:2521)
at weblogic.servlet.internal.ServletRequestImpl$SessionHelper.syncSession(ServletRequestImpl.java:2496)
at weblogic.servlet.internal.ServletResponseImpl$1.run(ServletResponseImpl.java:1319)
at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:121)
at weblogic.servlet.internal.ServletResponseImpl.send(ServletResponseImpl.java:1313)
at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1375)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
weblogic.cluster.replication.NotFoundException: Unable to find object -7302744785832416994
at weblogic.cluster.replication.ReplicationManager.getPrimary(ReplicationManager.java:611)
at weblogic.cluster.replication.ReplicationManager.updateSecondary(ReplicationManager.java:513)
at weblogic.servlet.internal.session.ReplicatedSessionData.syncSession(ReplicatedSessionData.java:577)
at weblogic.servlet.internal.session.ReplicatedSessionContext.sync(ReplicatedSessionContext.java:82)
at weblogic.servlet.internal.ServletRequestImpl$SessionHelper.syncSession(ServletRequestImpl.java:2521)
at weblogic.servlet.internal.ServletRequestImpl$SessionHelper.syncSession(ServletRequestImpl.java:2496)
at weblogic.servlet.internal.ServletResponseImpl$1.run(ServletResponseImpl.java:1319)
at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:121)
at weblogic.servlet.internal.ServletResponseImpl.send(ServletResponseImpl.java:1313)
at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1375)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
>
3)被管服务有被踢出集群的现象
####<2010-11-22 上午08时53分14秒 CST> <Info> <Cluster> <linux-n83k> <m123_7002> <[ACTIVE] ExecuteThread: '9' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1290387194521> <BEA-000112> <Removing m122_7001 jvmid:-2074781177021694794S:11.137.76.122:[7001,7001,-1,-1,-1,-1,-1]:piccntpdomain:m122_7001 from cluster view due to timeout.>
4)有jdbc连接泄露
####<2010-11-20 下午03时10分16秒 CST> <Warning> <JDBC> <linux-n83k> <m123_7002> <[ACTIVE] ExecuteThread: '11' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1290237016454> <BEA-001153> <Forcibly releasing inactive connection "weblogic.jdbc.wrapper.PoolConnection_com_informix_jdbc_IfxSqliConnect@10d32b9" back into the connection pool "claimDataSourceTest", currently reserved by: java.lang.Exception
at weblogic.jdbc.common.internal.ConnectionEnv.setup(ConnectionEnv.java:293)
at weblogic.common.resourcepool.ResourcePoolImpl.reserveResource(ResourcePoolImpl.java:306)
at weblogic.common.resourcepool.ResourcePoolImpl.reserveResource(ResourcePoolImpl.java:284)
at weblogic.jdbc.common.internal.ConnectionPool.reserve(ConnectionPool.java:466)
at weblogic.jdbc.common.internal.ConnectionPool.reserve(ConnectionPool.java:357)
at weblogic.jdbc.common.internal.ConnectionPoolManager.reserve(ConnectionPoolManager.java:83)
at weblogic.jdbc.common.internal.ConnectionPoolManager.reserve(ConnectionPoolManager.java:96)
at weblogic.jdbc.pool.Driver.connect(Driver.java:150)
at weblogic.jdbc.jts.Driver.getNonTxConnection(Driver.java:647)
at weblogic.jdbc.jts.Driver.connect(Driver.java:137)
at weblogic.jdbc.common.internal.RmiDataSource.getConnection(RmiDataSource.java:359)
at com.p6spy.engine.spy.P6DataSource.getConnection(P6DataSource.java:361)
at org.springframework.orm.hibernate3.LocalDataSourceConnectionProvider.getConnection(LocalDataSourceConnectionProvider.java:82)
at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:417)
at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:144)
at org.hibernate.jdbc.JDBCContext.connection(JDBCContext.java:119)
at org.hibernate.transaction.JDBCTransaction.begin(JDBCTransaction.java:57)
at org.hibernate.impl.SessionImpl.beginTransaction(SessionImpl.java:1326)
at org.springframework.orm.hibernate3.HibernateTransactionManager.doBegin(HibernateTransactionManager.java:558)
at org.springframework.transaction.support.AbstractPlatformTransactionManager.getTransaction(AbstractPlatformTransactionManager.java:374)
at org.springframework.transaction.interceptor.TransactionAspectSupport.createTransactionIfNecessary(TransactionAspectSupport.java:263)
at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:101)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:89)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
at org.springframework.aop.framework.Cglib2AopProxy$DynamicAdvisedInterceptor.intercept(Cglib2AopProxy.java:635)
at cn.com.picc.claim.xfire.service.spring.PolicytpWebServiceSpringImpl$$EnhancerByCGLIB$$ceba46b3.addTpDescript(<generated>)
at sun.reflect.GeneratedMethodAccessor8293.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:307)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:198)
at $Proxy171.addTpDescript(Unknown Source)
at sun.reflect.GeneratedMethodAccessor8293.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.codehaus.xfire.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:59)
at org.codehaus.xfire.service.binding.ServiceInvocationHandler.sendMessage(ServiceInvocationHandler.java:320)
at org.codehaus.xfire.service.binding.ServiceInvocationHandler$1.run(ServiceInvocationHandler.java:86)
at org.codehaus.xfire.service.binding.ServiceInvocationHandler.execute(ServiceInvocationHandler.java:134)
at org.codehaus.xfire.service.binding.ServiceInvocationHandler.invoke(ServiceInvocationHandler.java:109)
at org.codehaus.xfire.handler.HandlerPipeline.invoke(HandlerPipeline.java:131)
at org.codehaus.xfire.transport.DefaultEndpoint.onReceive(DefaultEndpoint.java:64)
at org.codehaus.xfire.transport.AbstractChannel.receive(AbstractChannel.java:38)
at org.codehaus.xfire.transport.http.XFireServletController.invoke(XFireServletController.java:304)
at org.codehaus.xfire.transport.http.XFireServletController.doService(XFireServletController.java:129)
at org.codehaus.xfire.spring.remoting.XFireServletControllerAdapter.handleRequest(XFireServletControllerAdapter.java:67)
at org.codehaus.xfire.spring.remoting.XFireExporter.handleRequest(XFireExporter.java:48)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:875)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:807)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:571)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:511)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:763)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
at weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:227)
at weblogic.servlet.internal.StubSecurityHelper.invokeServlet(StubSecurityHelper.java:125)
at weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:283)
at weblogic.servlet.internal.TailFilter.doFilter(TailFilter.java:26)
at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at ins.platform.filters.UrlFilter.doFilter(UrlFilter.java:36)
at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at ins.framework.web.CompressFilter.doFilter(CompressFilter.java:82)
at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at org.springframework.orm.hibernate3.support.OpenSessionInViewFilter.doFilterInternal(OpenSessionInViewFilter.java:198)
at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:76)
at weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)
at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3242)
at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:121)
at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2010)
at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:1916)
at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1366)
at weblogic.work.ExecuteThread.execute(ExecuteThread.java:209)
at weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
.>
5)Jdbc连接池中没有可用的连接
####<2010-11-22 上午08时51分04秒 CST> <Error> <HTTP> <linux-wanw> <m122_7001> <[ACTIVE] ExecuteThread: '117' for queue: 'weblogic.kernel.Default (self-tuning)'> <<WLS Kernel>> <> <> <1290387064036> <BEA-101017> <[weblogic.servlet.internal.WebAppServletContext@4a612616 - appName: 'webapp', name: 'webapp', context-path: '/piccclaim'] Root cause of ServletException.
org.springframework.jdbc.UncategorizedSQLException: Hibernate operation: Cannot open connection; uncategorized SQLException for SQL [???]; SQL state [null]; error code [0]; weblogic.common.resourcepool.ResourceLimitException: No resources currently available in pool claimDataSourceTest to allocate to applications, please increase the size of the pool and retry..; nested exception is weblogic.jdbc.extensions.PoolLimitSQLException: weblogic.common.resourcepool.ResourceLimitException: No resources currently available in pool claimDataSourceTest to allocate to applications, please increase the size of the pool and retry.
3 总结
1) 服务部署的应用中有没有序列化的对象,造成有的session服务复制。
2) 被管服务有被踢出集群的现象,也会造成有些session无法复制
3) 有jdbc连接泄露
目前无法分析weblogic出现问题的真正原因,但是通过日志分析,可以发现数据库连接池中已经没有可用的连接,没有可用连接的原因很可能是因为jdbc连接泄露造成。
建议通过某种方式,在系统下次出问题的时候,取到当时的thread dump,以便定位问题。
4 监控
该域下服务经常有挂起现象,从今天的监控看,异常现象主要所有两个:1)管理控制台在业务高峰时期会变得很慢
2)三个被管服务:m123_7003、m124_7002和m125_7003管理控制在管理控制台中无法监控,提示“This server is not currently reachable. No additional monitoring information is available.”。但是这三个被管服务上部署的应用能够正常运行。
3)123,125这两台服务器在业务高峰时期swap分区使用率接近100%,多台服务器的swap分区比内存小很多。
5 分析
经过观察,发现管理控制台上面并没有部署任何应用,在管理控制台变得很慢的时候,服务器的cpu使用率在80%以上,admin server的heap内存空闲率也在50%。通过thread dump中也没有看到admin server中有异常的操作。由此可以推断出控制台变慢的原因在于控制台从被管服务中获取数据慢。
被管服务在业务高峰期压力很大,造成和admin server通信中断,也造成了控制台中无法获取被管服务m123_7003、m124_7002和m125_7003的信息。
6 采取措施
修改123应用服务器的被管服务器m123_7003和124应用服务器的被管服务m124_7002的启动脚本,增加垃圾回收的日志,以便监控业务高峰时期的垃圾回收情况。
7 建议
增加应用服务器的swap 分区至内存的2-3倍。