[Open-FCoE] deadlock in testing (Re: [RFC PATCH V3 11/13] fcoe: use RTNL mutex to serialize create/destroy and protect hostlist)
christopher.leech at intel.com
Thu Jul 9 22:54:14 UTC 2009
On Thu, Jul 09, 2009 at 03:47:57PM -0700, Joe Eykholt wrote:
> Chris Leech wrote:
> > I ran a parallel create/destroy/remove test overnight and something
> > deadlocked. Running with lockdep enabled gives a reproducible warning,
> > but I'm having trouble making sense of it. I'm not sure I understand
> > what the "events" lock is here.
> I'm not sure why it says "events" either. I think it has something
> to do with flush_work() calling lock_map_acquire/release to indicate
> that the work items it will wait on may need locks.
> I see one problem, though. fcoe_ctlr_destroy() is doing a
> flush_work and it uses the general work thread. So does
> linkwatch_event(), which needs rtnl_lock(). So the flush_work()
> may hang forever if there's a linkwatch_event queued. Shoot.
Thanks, that must be it. I knew it had something to do with a destroy and
a linkwatch event firing at the same time, I just couldn't put together
the deadlock scenario.
> Ways to fix it:
> 1) have FIP use its own work queue.
> 2) separately flush the FIP work queue while not holding rtnl_lock.
> 3) go back to using a separate mutex for fcoe create/delete, but
> use rtnl_lock for the hostlist to protect the notification.
> 4) something better?
More information about the devel