[Open-FCoE] deadlock in testing (Re: [RFC PATCH V3 11/13] fcoe: use RTNL mutex to serialize create/destroy and protect hostlist)

Chris Leech christopher.leech at intel.com
Thu Jul 9 23:40:31 UTC 2009

On Thu, Jul 09, 2009 at 03:47:57PM -0700, Joe Eykholt wrote:
> Chris Leech wrote:
> > I ran a parallel create/destroy/remove test overnight and something
> > deadlocked.  Running with lockdep enabled gives a reproducible warning,
> > but I'm having trouble making sense of it.  I'm not sure I understand
> > what the "events" lock is here.
> I'm not sure why it says "events" either.  I think it has something
> to do with flush_work() calling lock_map_acquire/release to indicate
> that the work items it will wait on may need locks.
> I see one problem, though.  fcoe_ctlr_destroy() is doing a
> flush_work and it uses the general work thread.  So does
> linkwatch_event(), which needs rtnl_lock().  So the flush_work()
> may hang forever if there's a linkwatch_event queued.  Shoot.
> Ways to fix it:
> 1) have FIP use its own work queue.
> 2) separately flush the FIP work queue while not holding rtnl_lock.
> 3) go back to using a separate mutex for fcoe create/delete, but
>     use rtnl_lock for the hostlist to protect the notification.
> 4) something better?

What about switching from flush_work to cancel_work_sync?  We don't care
if the queued work gets run or not, just that it's off the queue.

>From what I can tell, flush_work attempts to move the work struct to the
front of the queue in order to prevent these kinds of deadlocks.  But,
the worker thread might already by blocked on the mutex before
flush_work gets called.

cancel_work_sync will wait for completion if that work struct is
currently being run, otherwise it will delete it off the queue.  So it
either gets flushed, or if the work thread is blocked we just remove our
work items and keep going.

... testing ...

Which seems to fix the linkwatch problem, but there are more issues
buried in here.  If we flush fip->recv_work under rtnl, it has the
potential to call fip->update_mac and fcoe_update_src_mac also wants to
take rtnl.

More information about the devel mailing list