[Open-FCoE] deadlock in testing (Re: [RFC PATCH V3 11/13] fcoe: use RTNL mutex to serialize create/destroy and protect hostlist)

Chris Leech christopher.leech at intel.com
Tue Jul 14 18:03:16 UTC 2009

On Mon, Jul 13, 2009 at 03:52:39PM -0700, Joe Eykholt wrote:
> All we need to do is cease using the interface.  We can destroy the
> instance later.   The netdev layer should (IMO) protect against that.
> I looked a bit at the code, and it seems like it would work to hold
> the netdev from fcoe_create to either fcoe_destroy or the unregister
> notification.

I think I've got the netdev reference counting right, and doing a
workqueue destroy by calling schedule_work in the netdev notifier
works fine (with the change to use cancel_work_sync instead of
flush_work for the two FIP work structs)

> > c) The destroy can't be made asynchronous using schedule_work(), because
> > it then attempts to flush the very workqueue thread it's running in.  It
> > might be possible if it is safe to change all those flushes to
> > cancel_work_sync().
> Agreed.   I don't know how hard it would be to make all of the work
> cancel-able, but that's a good thing to look into.

I've got this working now, and the locking looks good but there are a
few details I need to look after today.  My patches messed up the
ordering between unregistering the packet type handlers and destroying
the FIP controller, so my overnight test ran into a BUG where I think a
scheduled work struct was freed.  Also, I need to make sure that the FIP
packet queue is freed after the receive work function gets canceled.

I'm going to fix those up and repost the updated patch set.

> An alternative would be to refcount the fcoe instance, and do a get when
> scheduling a work item, and a put in the work function when finished.
> That way the instance would stay around until the work completes.
> The work item would have to take a lock (rtnl_lock) and make
> sure the instance is still "alive".
> Then we could do the flush on module unload after dropping the locks,
> and on other fcoe_destroys we wouldn't have to flush before returning.

That's an interesting idea, I'll keep it in mind as I'm trying to work
out these last details.  It looks like when the workqueue is run it
explicitly supports freeing the work struct from within the work
function, so this could work.

Thanks for all your help on this Joe.

More information about the devel mailing list