Details
-
Documentation Required:No
-
- Description:
-
HideSee: [dwr-user] DeadLock in DWR
I replaced Object locks with source names of those locks in thread dump.
>> Check this:
>>
>> "pool-1-thread-2" Id=88 BLOCKED on java.lang.Object@##wakeUpCalledLock##
>> owned by
>> "btpool0-4" Id=55
>> at
>>
>>
>> org.directwebremoting.dwrp.JettyContinuationSleeper.wakeUp(JettyContinuationSleeper.java:111)
>> at
>> org.directwebremoting.dwrp.BasicAlarm.raiseAlarm(BasicAlarm.java:33)
>> - locked <##sleeperLock##> (
>> at
>>
>>
>> org.directwebremoting.dwrp.OutputAlarm$AlarmScriptConduit.addScript(OutputAlarm.java:95)
>> at
>>
>>
>> org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:202)
>> - locked <##scriptLock##>
>>
>>
>>
>>
>>
>> "btpool0-4" Id=55 BLOCKED on java.lang.Object@##scriptLock## owned by
>> "pool-1-thread-2" Id=88
>> at
>>
>>
>> org.directwebremoting.impl.DefaultScriptSession.removeScriptConduit(DefaultScriptSession.java:285)
>> at
>> org.directwebremoting.dwrp.OutputAlarm.cancel(OutputAlarm.java:67)
>> at
>> org.directwebremoting.dwrp.PollHandler$1.run(PollHandler.java:189)
>> at
>>
>> org.directwebremoting.dwrp.JettyContinuationSleeper.goToSleep(JettyContinuationSleeper.java:80)
>> - locked <##wakeUpCalledLock##>
>> at
>> org.directwebremoting.dwrp.PollHandler.handle(PollHandler.java:211)
>> at
>> org.directwebremoting.servlet.UrlProcessor.handle(UrlProcessor.java:101)
>> at
>> org.directwebremoting.servlet.DwrServlet.doPost(DwrServlet.java:146)
>>
>>
>>
>>
>> So I think, the deadlock problem is in JettyContinuationSleeper.java,
>> maybe
>> "scriptlock" should be called before "wakeUpCalledLock". Or some code
>> refactoring is needed.
>>
>> What should I do now? is it in dwr 3.0.0 fixed? Is it good idea to turn
>> to
>> 3.0.0 or should I try to fix code on my own and wait till some fix will
>> be
>> released(in version 3.0.X).
>>
>> Or should I force calling DwrServlet.doPost(...) when I am calling
>> DefaultScriptSession.addScript(...) ? Will this slow application too
>> much,
>> when I have thousand of users connected?
>>
>> I need Jetty continuation, so I can't turn it off.
>>
>> thx for any help.ShowSee: [dwr-user] DeadLock in DWR I replaced Object locks with source names of those locks in thread dump. >> Check this: >> >> "pool-1-thread-2" Id=88 BLOCKED on java.lang.Object@##wakeUpCalledLock## >> owned by >> "btpool0-4" Id=55 >> at >> >> >> org.directwebremoting.dwrp.JettyContinuationSleeper.wakeUp(JettyContinuationSleeper.java:111) >> at >> org.directwebremoting.dwrp.BasicAlarm.raiseAlarm(BasicAlarm.java:33) >> - locked <##sleeperLock##> ( >> at >> >> >> org.directwebremoting.dwrp.OutputAlarm$AlarmScriptConduit.addScript(OutputAlarm.java:95) >> at >> >> >> org.directwebremoting.impl.DefaultScriptSession.addScript(DefaultScriptSession.java:202) >> - locked <##scriptLock##> >> >> >> >> >> >> "btpool0-4" Id=55 BLOCKED on java.lang.Object@##scriptLock## owned by >> "pool-1-thread-2" Id=88 >> at >> >> >> org.directwebremoting.impl.DefaultScriptSession.removeScriptConduit(DefaultScriptSession.java:285) >> at >> org.directwebremoting.dwrp.OutputAlarm.cancel(OutputAlarm.java:67) >> at >> org.directwebremoting.dwrp.PollHandler$1.run(PollHandler.java:189) >> at >> >> org.directwebremoting.dwrp.JettyContinuationSleeper.goToSleep(JettyContinuationSleeper.java:80) >> - locked <##wakeUpCalledLock##> >> at >> org.directwebremoting.dwrp.PollHandler.handle(PollHandler.java:211) >> at >> org.directwebremoting.servlet.UrlProcessor.handle(UrlProcessor.java:101) >> at >> org.directwebremoting.servlet.DwrServlet.doPost(DwrServlet.java:146) >> >> >> >> >> So I think, the deadlock problem is in JettyContinuationSleeper.java, >> maybe >> "scriptlock" should be called before "wakeUpCalledLock". Or some code >> refactoring is needed. >> >> What should I do now? is it in dwr 3.0.0 fixed? Is it good idea to turn >> to >> 3.0.0 or should I try to fix code on my own and wait till some fix will >> be >> released(in version 3.0.X). >> >> Or should I force calling DwrServlet.doPost(...) when I am calling >> DefaultScriptSession.addScript(...) ? Will this slow application too >> much, >> when I have thousand of users connected? >> >> I need Jetty continuation, so I can't turn it off. >> >> thx for any help.
Attachments
-
- 383_1.zip
- (3 kB)
- David Marginian
- 07/Oct/09 4:24 AM
-
- JettyContinuationSleeper.java
- (9 kB)
- Tim Peierls
- 04/Dec/09 7:26 AM
-
- JettyContinuationSleeper.java
- (9 kB)
- Tim Peierls
- 04/Dec/09 7:09 AM
Activity
This may provide some help here:
https://dwr.dev.java.net/servlets/ReadMsg?listName=dev&msgNo=423
It looks like we may have the same problem in all of our sleeper classes (Servlet3Sleeper, TomcatSleeper, etc.).
Yes, GrizzlyContinuationSleeper, JettyContinuationSleeper, Servlet3Sleeper, and TomcatSleeper all are at risk of deadlock because they hold a lock when calling foreign methods. ThreadWaitSleeper uses a CountDownLatch to avoid the need for explicit locking; a similar approach would work with the other implementations. It's going to be tricky to get right; I'll try to post sketches as comments to this issue.
The ThreadWaitSleeper proxy idea is probably overkill – just have a final CountDownLatch that you use if the continuation stuff doesn't work. They're very cheap to construct. Don't complicate things with a lazily constructed volatile field.
I notice that the interruption handling still hasn't been addressed in ThreadWaitSleeper. In the thread that David refers to above, in response to Joe's feeling that onAwakening should run even if the current thread is interrupted – something I'm still not sure about – I then suggested restoring the interrupt, running onAwakening and then restoring the interrupt again in case onAwakening cleared it. You should never swallow an interrupt – it's a sign that someone wants the current thread to exit. You should use it clean up and get out as fast as possible. A well-written onAwakening Runnable would also check the interrupt status at various points.
Thanks Tim.
The sketches would be helpful. I haven't worked with this particular section of code and I am not quite sure the CountDownLatch is safe (as it is a one shot deal). I noticed in a previous thread that you had remarked on this and Joe never responded (not sure if the silence meant it was ok, or if he never had a chance to look into it more). It seems to me like these methods could be called more than once, in that case what is our best alternative (CyclicBarrier?).
I also noticed that the problem with the interuption handling has not been resolved.
I would like to get Joe involved in this discussion (I will work on that).
AFAICT, Sleeper instances are created via ContainerAbstraction.createSleeper in BasePollHandler.handle and passed to a ScriptConduit and various Alarms that might call wakeUp. The call to goToSleep is the last thing that happens in BasePollHandler.handle, and this is the only place that goToSleep is called. So I'm pretty sure that Sleeper instances are used once only, which means that CountDownLatch ought to be fine. (Calling countDown more than once, via wakeUp, is not a problem.)
[Otherwise, yes, CyclicBarrier (or the upcoming Phaser) might be more appropriate.]
David, as promised I have taken a look at this and I think everything Tim suggests is fine. From what I can see Sleeper instances are one-shots and a new one is created for every poll. Also, when a Sleeper wakes up due to an alarm it always executes this code:
Runnable onAwakening = new Runnable()
{
public void run()
{
// Cancel all the alarms
for (Alarm alarm : alarms)
// We can't be used as a conduit to the browser any more
scriptSession.removeScriptConduit(conduit);
// Tell the browser to come back at the right time
try
catch (IOException ex)
{ log.warn("Failed to write reconnect info to browser"); } }
};
which indicates that there is always a browser/request roundtrip (discarding Sleeper) after every awakening.
Tim also suggests that we get out as fast as possible on a Thread interrupt. I guess the code above shows why Joe wants to run the onAwakening code even on an interrupt; because it sends instructions to the browser to make a new request.
Though, I am not sure what would actually trigger a Thread interrupt as we don't have any such calls in the codebase (apart from for restoring the interrupt flag). Maybe I am missing something?
Thanks, Mike – I now see why onAwakening is crucial clean-up code.
I still think the code should ensure that no interrupts are swallowed. It's hard to reason about whether interrupt might be called, and it's trivially easy to restore the interrupt status after onAwakening is called. An ounce of prevention ...
Right Tim, I also try to be safe and always do the "right" thing. My statement was mostly asking about if there were any interrupt sources I wasn't aware about.
Just checking: when you say that interrupts shouldn't be swallowed, do you still think it is ok to squelch the exception as long as we restore the interrupt flag and "get out" reasonably fast?
Btw, it would be great if someone could add a test case that triggers this problem. Being able to trigger it just by running testdwr on Jetty would make it a lot easier getting started on this bug.
Yes, we're forced to by the absence of "throws InterruptedException" on goToSleep(). By "swallowing" I meant denying anyone up the stack the opportunity to detect interruption either with catch IE or Thread.currentThread.isInterrupted().
And "reasonably fast" can be flexible – it could mean not starting anything new but finishing whatever you're working on, or it could mean dropping everything on the floor and logging that fact, or anything in between.
It might have been better to declare goToSleep() as throwing InterruptedException, since it's (potentially) a blocking call – but I guess that ship has sailed.
Regarding a test case: Do you really want to have a test case that creates a deadlock? There's no graceful way to get out of it.
Right, I don't think we will change the goToSleep() signature for 3.0 at least, but let's keep this in mind for the next major version.
Test cases and deadlocks: we have a separate area for "manual" tests, ie code that isn't fit to run in the normal test suite. I still like the idea of having code reproducing bugs in a test case format, even if we don't run them as part of the normal test suite.
I had a chance to take a look a this in more detail this evening and it seems to me we have a lot of unnecessary locking going on. Take JettyContinuationSleeper for example. The boolean fields can be made volatile. Also, continuation is final. It seems to me that we could remove most of if not all of the locks here. I am sure I am full of it, thoughts?
It's not enough to make the booleans volatile, since you want to be able to compare and set them atomically. But I think replacing them with (final) AtomicReference<State> fields would work. Then you could control the state transitions safely. I'm still trying to get my head around it, though.
I think the state space is captured better with a single state enum rather than with multiple booleans. Something like this:
enum State { INITIAL, // the state at construction time PRE_AWAKENED, // awakened before sleep SLEEPING, // sleeping using continuation BLOCKED, // sleeping by blocking using ThreadWaitSleeper FINAL, // awakened from sleep or during attempt to sleep }
and maybe some intermediate states (e.g., TRYING_CONTINUATION).
Get rid of the booleans and have a single field:
private final AtomicReference<State> state = new AtomicReference<State>(INITIAL);
Then when you want to make a state transition from state S1 to state S2, you use this pattern:
if (state.compareAndSet(S1, S2)) { // do things appropriate to leaving state S1 and entering state S2 }
Introduce intermediate states when transitions can fail or you want to make sure that your actions on entry to a state are completed before the rest of the world sees the object in the new state:
if (state.compareAndSet(S1, TRY_S2)) {
// do things appropriate to leaving state S1
if (ableToEnterS2()) { // try to enter state S2
// do things appropriate to entering state S2
state.set(S2); // no need to compareAndSet, since we're guaranteed to be in TRY_S2 here
} else {
// couldn't get in state S2, do something else
}
}
AtomicReference.compareAndSet has the memory effect of a volatile read and write, so it should be possible to ensure visibility of non-volatile fields without extra synchronization by proper placement of the writes and reads of those fields.
I have ignored this issue for too long. I took a stab at coming up with something this evening - even if it is just a start. Tim and Mike can you guys take a look at the attachment? I think it is better but it will probably have some issues that you guys will be able to spot. Now that some code has been written maybe it will be easier to fix this with your help. Thanks.
You're not using state.compareAndSet(S1, S2) to make state transitions. It's much easier to reason about what's going on when you structure everything with compareAndSet. Also I'm not sure that SleeperState completely captures the possible states.
I feel terrible about not doing more to help you here, but for the next few days I have no time to spare. If it can wait until next week, I really think I'll be able to contribute more.
Tim,
Don't feel bad, whenever you can help out - no rush. It is not that I disregarded your advice but it seemed tricky to implement that way. My goal wasn't to get it right, it was to get something started so we could come up with a final/solid solution through discussion. I may take another stab at it this week and post something else, then when you have more time we can work out it all out.
Thanks
> Tim and Mike can you guys take a look at the attachment?
I still have very limited time, but if any of you could share your test setup project that triggers the bug, then I would hopefully be able to review and comment on the patch.
BrowserMob was just recently bit by this bug. Would the stack dump be useful, or are we now at the point of just needing the time to work on a fix?
Needing time to work on a fix, I have looked at it quite a bit. All of the sleepers are affected so the fix is not simple.
@Patrick
Any information helping to trigger the bug would be interesting to have. Once we think we have a fix it would be nice to have some way to verify it, other than running for X hours and saying it is probably ok ![]()
I've attached a modified version of JettyContinuationSleeper that uses an AtomicReference to a state enum to handle the different possible state transitions. There is no explicit synchronization; the visibility is ensured by the use of AtomicReference methods that have the same memory consistency effects as read-volatile or write-volatile.
The states and transitions are as follows:
INITIAL -> goToSleep -> ABOUT_TO_SLEEP
INITIAL -> wakeUp -> PRE_AWAKENED
PRE_AWAKENED -> goToSleep -> FINAL
PRE_AWAKENED -> wakeUp -> PRE_AWAKENED
ABOUT_TO_SLEEP -> [Jetty continuations available] -> SLEEPING
ABOUT_TO_SLEEP -> [Jetty continuations not available] -> BLOCKED
ABOUT_TO_SLEEP -> wakeUp -> (spins until SLEEPING or BLOCKED)
SLEEPING -> wakeUp -> RESUMING
BLOCKED -> wakeUp -> FINAL
RESUMING -> resume -> FINAL
RESUMING -> wakeUp -> RESUMING
FINAL -> wakeUp -> FINAL
Any transition not listed here will result in an IllegalStateException. This should probably be part of the documentation.
The proxy and onAwakening fields are @GuardedBy("state") – they are written before a write-volatile, and read after a read-volatile.
David, I've ignored SleeperState.java, because I think it's better to keep the state enum inside JettyContinuationSleeper – it's intimately tied to the implementation of that class.
Could you folks take a look at it, fix my style mistakes, and, ideally, try it out? I have no way to do a real test of it at this point.
Another version of JettyContinuationSleeper attached with minor improvement – wakeUp doesn't call itself recursively, it just has a do-while loop.
Great Tim! I will take a look next chance I get.
I have checked in Tim's recommended changes. I am still in the process of testing these changes and applying a similar pattern to the other sleepers. If any of the watchers are still interested in this issue and want to experiment with the latest build you can grab it here:
http://ci.directwebremoting.org/bamboo/browse/DWR-TRUNK-33/artifact
Any help, feedback would be appreciated.
Tim, I have finally gotten to re-factoring all of the sleeper classes based on your initial pattern. I have tested the Jetty and Tomcat versions (still haven't gotten to test Servlet3, Grizzly). If you can take a look and let me know if I missed something that would be great. Thanks.
David, I'll look at these, but I noticed that you worked from the earlier of the two JettyContinuationSleeper impls I attached to this issue. It's not huge deal, but the more recent one doesn't use a recursive call to wakeUp().
TomcatSleeper has unnecessary test of onAwakening != null in case SLEEPING of wakeUp(). I would replace it with an assertion – since the only way you can get into SLEEPING state is to set onAwakening to a value that is tested for nullity in goToSleep(). Other than that, TomcatSleeper looks OK, although here, as with JettyContinuationSleeper, I think it would be better not recursively call wakeUp(), but instead use a loop.
GrizzlyContinuationSleeper looks OK to me.
Servlet3Sleeper doesn't make any attempt to use ThreadWaitSleeper; it just rethrows any Exception caught during the suspend attempt as RuntimeException. I guess that's OK, but that could leave the sleeper in the ABOUT_TO_SLEEP state – it should at the very least put it into some failure state.
I am resolving this as the deadlocks have been resolved by Tim's new design. However, the Servlet3 support is still not complete (the new sleeper will follow Tim's pattern).
Hi Davide,
I am sorry, that I didn't wrote earlier, I was too busy. In fact, I am
not sure about my fix. It is working with no problem, for my project,
but I am not sure if it is good universally. I went through the source
code, and I didn't find reason for calling synchronized
(wakeUpCalledLock) in goToSleep function in
org.directwebremoting.dwrp.JettyContinuationSleeper. I compared it to
other classes, which are used when Jetty continuation is not enabled,
and I think other classes used to have such wakeUpCalledLock too, and
it was changed probably for the same type of deadlock. In my opinion,
JettyContinuationSleeper was forgotten to be fixed like the others.
But I didn't spend much time on this, so maybe I am missing something,
and this synchronization need to be there, but if it is so, than the
code need much larger refactoring.
So my simple fix is from:
public void goToSleep(Runnable awakening)
{
synchronized (wakeUpCalledLock)
{
this.onAwakening = awakening;
if (wakeUpCalled)
{ onAwakening.run(); return; }
try
{ continuation.suspend(Integer.MAX_VALUE); }
catch (Exception ex)
{ Continuation.rethrowIfContinuation(ex); log.warn("Exception", ex); proxy = new ThreadWaitSleeper(); proxy.goToSleep(onAwakening); }
}
}
to
public void goToSleep(Runnable awakening)
{
this.onAwakening = awakening;
if (wakeUpCalled)
{ onAwakening.run(); return; } }
synchronized (wakeUpCalledLock) ///this is moved synchronized
block, so onAwakening.run(); is not locked now.
{
try
{ continuation.suspend(Integer.MAX_VALUE); }catch (Exception ex)
{ Continuation.rethrowIfContinuation(ex); log.warn("Exception", ex); proxy = new ThreadWaitSleeper(); proxy.goToSleep(onAwakening); }}
}
I think, that wakeUpCalledLock don't need to be in goToSleep function
at all. But I tested only this version.
Sorry that I didn't add this to Jira directly, I was lack on time.