DWR

Synchronized block in DefaultScriptSessionManager locking entire appserver

Details

  • Type: Bug Bug
  • Status: Resolved Resolved
  • Priority: Critical Critical
  • Resolution: Fixed
  • Affects Version/s: 2.0.rc3, 2.0.rc4, 2.0.rc5, 2.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.0.5, 2.0.6
  • Fix Version/s: 2.0.7
  • Component/s: Core
  • Description:
    Hide
    I am seeing a very troubling situation on our QA server. Every couple of days, our QA WebLogic cluster will fail with every thread blocked in DefaultScriptSessionManager. I have 50 threads, and here's how it seems to wind up:

    1 thread is blocked here:

    "[STUCK] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.lang.Object@164b23e BLOCKED org.directwebremoting.impl.DefaultScriptSessionManager.invalidate(DefaultScriptSessionManager.java:125)

    1 thread is blocked here (blocked, I believe, on the thread above):

     "[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.lang.Object@70207c BLOCKED
    org.directwebremoting.impl.DefaultScriptSession.isInvalidated(DefaultScriptSession.java:127)

    And 48 threads are blocked here:

    "[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.lang.Object@164b23e BLOCKED
    org.directwebremoting.impl.DefaultScriptSessionManager.checkTimeouts(DefaultScriptSessionManager.java:175)

    As you can see, both the invalidate() call and the checkTimeouts() call are blocking on the same java object. Looking in the code, both checkTimeouts and invalidate block on the same sessionLock member variable. This is highly problematic because the call to invalidate() happens after the synchronized block in checkTimeouts(). Here's what happens: (by the way, I think this is only a problem if you have multiple in-flight DWR requests for the same user)

    - Thread A calls checkTimeouts, makes it through the synchronized block and calls invalidate() on a session. That grabs the invalidLock on the session. Thread A then gets interrupted.
    - Thread B calls checkTimeouts and holds the sessionLock, and then calls into isInvalidated on the same session that Thread A is interrupted on. Thread B tries to acquire the invalidLock but can't, since thread A holds it.
    - Thread A resumes and calls the second line of invalidate, which is manager.invalidate(). That method tries to grab the sessionLock, but cannot since Thread B is holding it waiting for the invalidLock held by thread A.

    And deadlock ensues.

    The easist thing I can see to do is to move the for loop at the end of checkTimeouts() inside the synchronized block. That way it'd be guaranteed not to deadlock.
    Show
    I am seeing a very troubling situation on our QA server. Every couple of days, our QA WebLogic cluster will fail with every thread blocked in DefaultScriptSessionManager. I have 50 threads, and here's how it seems to wind up: 1 thread is blocked here: "[STUCK] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.lang.Object@164b23e BLOCKED org.directwebremoting.impl.DefaultScriptSessionManager.invalidate(DefaultScriptSessionManager.java:125) 1 thread is blocked here (blocked, I believe, on the thread above):  "[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.lang.Object@70207c BLOCKED org.directwebremoting.impl.DefaultScriptSession.isInvalidated(DefaultScriptSession.java:127) And 48 threads are blocked here: "[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" waiting for lock java.lang.Object@164b23e BLOCKED org.directwebremoting.impl.DefaultScriptSessionManager.checkTimeouts(DefaultScriptSessionManager.java:175) As you can see, both the invalidate() call and the checkTimeouts() call are blocking on the same java object. Looking in the code, both checkTimeouts and invalidate block on the same sessionLock member variable. This is highly problematic because the call to invalidate() happens after the synchronized block in checkTimeouts(). Here's what happens: (by the way, I think this is only a problem if you have multiple in-flight DWR requests for the same user) - Thread A calls checkTimeouts, makes it through the synchronized block and calls invalidate() on a session. That grabs the invalidLock on the session. Thread A then gets interrupted. - Thread B calls checkTimeouts and holds the sessionLock, and then calls into isInvalidated on the same session that Thread A is interrupted on. Thread B tries to acquire the invalidLock but can't, since thread A holds it. - Thread A resumes and calls the second line of invalidate, which is manager.invalidate(). That method tries to grab the sessionLock, but cannot since Thread B is holding it waiting for the invalidLock held by thread A. And deadlock ensues. The easist thing I can see to do is to move the for loop at the end of checkTimeouts() inside the synchronized block. That way it'd be guaranteed not to deadlock.

Activity

Hide
Joe Walker added a comment - 10/Apr/07 11:08 AM

From what I can see your analysis is correct - I've checked the change into CVS and hope to cut RC4 with this in later on today.

Show
Joe Walker added a comment - 10/Apr/07 11:08 AM From what I can see your analysis is correct - I've checked the change into CVS and hope to cut RC4 with this in later on today.
Hide
Jonas added a comment - 04/Aug/10 6:56 AM

Hi Joe, I have the even problem in my application. When I getting this version with the corretion (2.0.rc4)

Show
Jonas added a comment - 04/Aug/10 6:56 AM Hi Joe, I have the even problem in my application. When I getting this version with the corretion (2.0.rc4)
Hide
David Marginian added a comment - 04/Aug/10 9:24 PM

Jonas, the fix recommended previously does not solve the problem. Now we are making foreign calls with a lock held, which is a recipe for deadlock. I revisited this issue several weeks ago (but did not look up this Jira) due to a user's request. I made some changes that should resolve this problem. Please try the latest 2.x build here - http://ci.directwebremoting.org/bamboo/browse/DWR20-ALL-1/artifact. If this fix works this version will be released as 2.0.7. I have re-opened this issue.

Show
David Marginian added a comment - 04/Aug/10 9:24 PM Jonas, the fix recommended previously does not solve the problem. Now we are making foreign calls with a lock held, which is a recipe for deadlock. I revisited this issue several weeks ago (but did not look up this Jira) due to a user's request. I made some changes that should resolve this problem. Please try the latest 2.x build here - http://ci.directwebremoting.org/bamboo/browse/DWR20-ALL-1/artifact. If this fix works this version will be released as 2.0.7. I have re-opened this issue.
Hide
Jonas added a comment - 05/Aug/10 5:44 AM

David. this version don't solve my problem, because my app running in JVM 1.4, the code on build (http://ci.directwebremoting.org/bamboo/browse/DWR20-ALL-1/artifact) is 1.5.

Tks

Show
Jonas added a comment - 05/Aug/10 5:44 AM David. this version don't solve my problem, because my app running in JVM 1.4, the code on build (http://ci.directwebremoting.org/bamboo/browse/DWR20-ALL-1/artifact) is 1.5. Tks
Hide
David Marginian added a comment - 05/Aug/10 6:40 AM

It is version 2 and should be compatible with 1.3, if it is not I did something wrong, can you elaborate?

Show
David Marginian added a comment - 05/Aug/10 6:40 AM It is version 2 and should be compatible with 1.3, if it is not I did something wrong, can you elaborate?
Hide
David Marginian added a comment - 06/Aug/10 7:15 PM

Jonas, any progress?

Show
David Marginian added a comment - 06/Aug/10 7:15 PM Jonas, any progress?
Hide
Jonas added a comment - 06/Aug/10 7:25 PM

David, my app it's testing yet.

Show
Jonas added a comment - 06/Aug/10 7:25 PM David, my app it's testing yet.
Hide
David Marginian added a comment - 06/Aug/10 7:35 PM

You are testing? Please let us know as soon as you know something. Thanks.

Show
David Marginian added a comment - 06/Aug/10 7:35 PM You are testing? Please let us know as soon as you know something. Thanks.
Hide
Jonas added a comment - 07/Aug/10 11:10 AM

Yes, I'm testing.

Show
Jonas added a comment - 07/Aug/10 11:10 AM Yes, I'm testing.
Hide
Rob Moore added a comment - 02/Sep/11 8:01 AM

We have 2.0.7 in a production instance and are seeing a slew of ConcurrentModificationExceptions like the following:

java.util.ConcurrentModificationException
at java.util.HashMap$AbstractMapIterator.checkConcurrentMod(HashMap.java:122)
at java.util.HashMap$AbstractMapIterator.makeNext(HashMap.java:127)
at java.util.HashMap$ValueIterator.next(HashMap.java:212)
at org.directwebremoting.impl.DefaultScriptSessionManager.invalidate(DefaultScriptSessionManager.java:129)
at org.directwebremoting.impl.DefaultScriptSession.invalidate(DefaultScriptSession.java:109)
at org.directwebremoting.impl.DefaultScriptSessionManager.checkTimeouts(DefaultScriptSessionManager.java:196)

It seems that we can safely ignore this error so we've created a subclass that logs but ignores the exception for the moment but would like a better solution for the long term. It seems like that might be to use ConcurrentHashMap but I'm not sure what the policy is for JVM support is for 2.x.

Show
Rob Moore added a comment - 02/Sep/11 8:01 AM We have 2.0.7 in a production instance and are seeing a slew of ConcurrentModificationExceptions like the following: java.util.ConcurrentModificationException at java.util.HashMap$AbstractMapIterator.checkConcurrentMod(HashMap.java:122) at java.util.HashMap$AbstractMapIterator.makeNext(HashMap.java:127) at java.util.HashMap$ValueIterator.next(HashMap.java:212) at org.directwebremoting.impl.DefaultScriptSessionManager.invalidate(DefaultScriptSessionManager.java:129) at org.directwebremoting.impl.DefaultScriptSession.invalidate(DefaultScriptSession.java:109) at org.directwebremoting.impl.DefaultScriptSessionManager.checkTimeouts(DefaultScriptSessionManager.java:196) It seems that we can safely ignore this error so we've created a subclass that logs but ignores the exception for the moment but would like a better solution for the long term. It seems like that might be to use ConcurrentHashMap but I'm not sure what the policy is for JVM support is for 2.x.
Hide
David Marginian added a comment - 02/Sep/11 8:18 AM

Thanks Rob. In 2.x we cannot use ConcurrentMaps. For 3.x we require more recent versions of Java and we have already refactored the class in question to take advantage of the Java 1.5+ concurrency features.

We can fix this for you in 2.x though, can you create a separate issue for us? Thanks.

Show
David Marginian added a comment - 02/Sep/11 8:18 AM Thanks Rob. In 2.x we cannot use ConcurrentMaps. For 3.x we require more recent versions of Java and we have already refactored the class in question to take advantage of the Java 1.5+ concurrency features. We can fix this for you in 2.x though, can you create a separate issue for us? Thanks.
Hide
Rob Moore added a comment - 02/Sep/11 8:56 AM

Will do, David. Just a thought – it might be worth using a library that backports the concurrent APIs such as http://backport-jsr166.sourceforge.net – so that you can take advantage of the API.

Show
Rob Moore added a comment - 02/Sep/11 8:56 AM Will do, David. Just a thought – it might be worth using a library that backports the concurrent APIs such as http://backport-jsr166.sourceforge.net – so that you can take advantage of the API.
Hide
David Marginian added a comment - 02/Sep/11 9:03 AM

I am aware of that library. The problem is that adding it requires all pre 1.5 users to have it in their runtime and we have always tried to run with as few dependencies as possible.

Show
David Marginian added a comment - 02/Sep/11 9:03 AM I am aware of that library. The problem is that adding it requires all pre 1.5 users to have it in their runtime and we have always tried to run with as few dependencies as possible.
Hide
Rob Moore added a comment - 02/Sep/11 12:44 PM

Per David's request I've created DWR-536.

Show
Rob Moore added a comment - 02/Sep/11 12:44 PM Per David's request I've created DWR-536.
Hide
David Marginian added a comment - 02/Sep/11 12:51 PM

Thanks Rob.

Show
David Marginian added a comment - 02/Sep/11 12:51 PM Thanks Rob.

People

Dates

  • Created:
    09/Apr/07 7:22 PM
    Updated:
    02/Sep/11 12:51 PM
    Resolved:
    05/Oct/10 7:29 PM