[Open-FCoE] deadlock in testing (Re: [RFC PATCH V3 11/13] fcoe: use RTNL mutex to serialize create/destroy and protect hostlist)

Joe Eykholt jeykholt at cisco.com
Thu Jul 9 23:44:02 UTC 2009


Chris Leech wrote:
> On Thu, Jul 09, 2009 at 03:47:57PM -0700, Joe Eykholt wrote:
>> Chris Leech wrote:
>>> I ran a parallel create/destroy/remove test overnight and something
>>> deadlocked.  Running with lockdep enabled gives a reproducible warning,
>>> but I'm having trouble making sense of it.  I'm not sure I understand
>>> what the "events" lock is here.
>> I'm not sure why it says "events" either.  I think it has something
>> to do with flush_work() calling lock_map_acquire/release to indicate
>> that the work items it will wait on may need locks.
>>
>> I see one problem, though.  fcoe_ctlr_destroy() is doing a
>> flush_work and it uses the general work thread.  So does
>> linkwatch_event(), which needs rtnl_lock().  So the flush_work()
>> may hang forever if there's a linkwatch_event queued.  Shoot.
>>
>> Ways to fix it:
>> 1) have FIP use its own work queue.
>> 2) separately flush the FIP work queue while not holding rtnl_lock.
>> 3) go back to using a separate mutex for fcoe create/delete, but
>>     use rtnl_lock for the hostlist to protect the notification.
>> 4) something better?
> 
> What about switching from flush_work to cancel_work_sync?  We don't care
> if the queued work gets run or not, just that it's off the queue.

No, we really need it flushed so it doesn't call up into libfc after
we free the lport.

> From what I can tell, flush_work attempts to move the work struct to the
> front of the queue in order to prevent these kinds of deadlocks.  But,
> the worker thread might already by blocked on the mutex before
> flush_work gets called.
> 
> cancel_work_sync will wait for completion if that work struct is
> currently being run, otherwise it will delete it off the queue.  So it
> either gets flushed, or if the work thread is blocked we just remove our
> work items and keep going.
> 
> ... testing ...
> 
> Which seems to fix the linkwatch problem, but there are more issues
> buried in here.  If we flush fip->recv_work under rtnl, it has the
> potential to call fip->update_mac and fcoe_update_src_mac also wants to
> take rtnl.

Yeah.  All this is leading me to thing the solution 3 above is the best.

	Joe




More information about the devel mailing list