[Open-FCoE] deadlock in testing (Re: [RFC PATCH V3 11/13] fcoe: use RTNL mutex to serialize create/destroy and protect hostlist)

Chris Leech christopher.leech at intel.com
Thu Jul 9 22:54:14 UTC 2009


On Thu, Jul 09, 2009 at 03:47:57PM -0700, Joe Eykholt wrote:
> Chris Leech wrote:
> > I ran a parallel create/destroy/remove test overnight and something
> > deadlocked.  Running with lockdep enabled gives a reproducible warning,
> > but I'm having trouble making sense of it.  I'm not sure I understand
> > what the "events" lock is here.
> 
> I'm not sure why it says "events" either.  I think it has something
> to do with flush_work() calling lock_map_acquire/release to indicate
> that the work items it will wait on may need locks.
> 
> I see one problem, though.  fcoe_ctlr_destroy() is doing a
> flush_work and it uses the general work thread.  So does
> linkwatch_event(), which needs rtnl_lock().  So the flush_work()
> may hang forever if there's a linkwatch_event queued.  Shoot.

Thanks, that must be it.  I knew it had something to do with a destroy and
a linkwatch event firing at the same time, I just couldn't put together
the deadlock scenario.

> Ways to fix it:
> 1) have FIP use its own work queue.
> 2) separately flush the FIP work queue while not holding rtnl_lock.
> 3) go back to using a separate mutex for fcoe create/delete, but
>     use rtnl_lock for the hostlist to protect the notification.
> 4) something better?




More information about the devel mailing list