[Open-FCoE] BUG in fcoe_hostlist_remove() - and test script

Chris Leech christopher.leech at intel.com
Wed Jul 8 17:11:01 UTC 2009


On Tue, Jul 07, 2009 at 01:47:17PM -0700, Joe Eykholt wrote:
> Joe Eykholt wrote:
> > With the fcoe-next tree and the patch set I'm building, I hit this
> > BUG() at line 1862 in fcoe.c:
> > 
> > int fcoe_hostlist_remove(const struct fc_lport *lp)
> > {
> >         struct fcoe_softc *fc;
> > 
> >         write_lock_bh(&fcoe_hostlist_lock);
> >         fc = fcoe_hostlist_lookup_softc(fcoe_netdev(lp));
> >         BUG_ON(!fc);
> > 
> > This is called from fcoe_destroy(), and I guess two of those were
> > active at the same time as both 'fcoeadm -d' and 'rmmod' might be trying
> > to remove the instance.
> > 
> > I think the list locking should be simplified so that any 
> > create/delete/lookup
> > just uses the same lock, and holds it across a the entire operation.
> > 
> > I think there may be unapplied patches floating out there to fix this,
> > but it'd be really nice if they got applied.
>
> This may be because even though fcoe_exit() removed every instance of fcoe it found
> on the list, there was still one instance that had been removed by fcoe_destroy(), but
> is still in the process of being deleted ... it still had transport attributes.
> So calling fc_release_transport() caused this BUG.
> 
> BTW, before this I changed fcoe_exit() to lock the list, which it wasn't doing before.
> 
> Maybe we should hold the lock across the entire fcoe_destroy() and fcoe_exit() functions,
> up to where the scsi_host is deleted.   I don't think this would cause a problem because
> it only serializes create / delete and netdev notifications.
> 
> Comments?  Are these issues fixed by other patches that are about to be integrated?
> 
> I know my fixes would have some conflict with Chris's NPIV changes, but I haven't
> reviewed them enough to know whether they fix the problems.
> 
> Again, the reason I'm interested in this is that my test script shows these
> problems more now than before for some reason, so it's holding up my rport / discovery
> patches.

Hey Joe,

I've got patches mixed in with my NPIV work to fix this, even though
it's not directly related to NPIV support.

The version I sent out previously added a mutex that was held across
the entire create and destroy calls.  I had missed the module exit
function, but I've got that covered now too.

That leaves an interesting race where create or destroy calls are
waiting on the mutex while the module exit routine is running.  It turns
out that the sysfs files created for module parameters aren't removed
until after the modules exit function is called.  So I added a check to
see if the scsi transport template is registered at the top of the
create and destroy functions, so they can just exit if the fcoe module
is either not initialized yet or is about to be unloaded (John spotted
that one in testing and suggested the fix).

I think that should take care of all these issues.  I've still got a
little work around link state management before enabling NPIV, but I'll
send the latest version of all my cleanup/prep-work patches today
(including this fix).  They've all been reviewed and tested, so I think
they're ready to go.

- Chris




More information about the devel mailing list