[Open-FCoE] BUG in fcoe_hostlist_remove() - and test script

Joe Eykholt jeykholt at cisco.com
Tue Jul 7 20:47:17 UTC 2009


Joe Eykholt wrote:
> With the fcoe-next tree and the patch set I'm building, I hit this
> BUG() at line 1862 in fcoe.c:
> 
> int fcoe_hostlist_remove(const struct fc_lport *lp)
> {
>         struct fcoe_softc *fc;
> 
>         write_lock_bh(&fcoe_hostlist_lock);
>         fc = fcoe_hostlist_lookup_softc(fcoe_netdev(lp));
>         BUG_ON(!fc);
> 
> This is called from fcoe_destroy(), and I guess two of those were
> active at the same time as both 'fcoeadm -d' and 'rmmod' might be trying
> to remove the instance.
> 
> I think the list locking should be simplified so that any 
> create/delete/lookup
> just uses the same lock, and holds it across a the entire operation.
> 
> I think there may be unapplied patches floating out there to fix this,
> but it'd be really nice if they got applied.
> 
> Otherwise, my tests keep hitting this and every time we rebase I have
> to come up with some temporary workaround.
> 
> Here's a test script, which I would like everyone to use to bang on
> the create/delete/rmmod issues.  It uses fcc but I commented that
> out and you could use the equivalent fcoeadm command.
> 
> Note the fcoeadm commands are done in parallel for all specified nics.
> Do this on a machine with at least 4 threads and a list of 2 nics or 
> more nics.
> 
> #! /bin/bash
> 
> nics="eth0 eth4"
> count=1000
> bs=64k
> iopass=true
> 
> while :
> do
>         modprobe fcoe
>         for nic in $nics
>         do
>                 fcoeadm -c $nic &
>                 :
>         done
>         wait
>         sleep 10
>         # fcc   # note could do fcoeadm -l or something here.
> 
>         if $iopass
>         then
>                 for disk in /dev/sd[b-z]
>                 do
>                         dd if=$disk of=/dev/null bs=$bs \
>                                 count=$count iflag=direct &
>                         :
>                 done
>                 iopass=false
>                 echo waiting for i/o bs $bs count $count
>                 date
>                 wait
>                 date
>                 sleep 5
>         else
>                 iopass=true
>                 sleep 10
>         fi
> 
>         for nic in $nics
>         do
>                 fcoeadm -d $nic &
>                 :
>         done
>         while :
>         do
>                 rmmod fcoe libfcoe libfc && break
>                 sleep 1
>         done
>         sleep 10
> done
> 
> 
>     Joe
> 

After some tries to fix these races between fcoe_destroy() and fcoe_exit(),
I had this crash:


[ 1099.675885] host14: rport fffffc: Remove port
[ 1099.675886] host14: rport fffffc: Port entered LOGO state from Ready state
[ 1099.675894] host14: rport fffffc: Delete port
[ 1099.675898] host14: lport 6a0000: Received a 3 event for port (fffffc)
[ 1099.675903] host14: rport fffffc: Received a LOGO response
[ 1099.675905] host14: rport fffffc: Received a LOGO response, but in state Delete
[ 1099.675909] host14: lport 6a0000: Entered LOGO state from Ready state
[ 1099.676856] kernel BUG at include/linux/transport_class.h:92!
[ 1099.676856] invalid opcode: 0000 [#1] SMP
[ 1099.676856] last sysfs file: /sys/module/fcoe/parameters/destroy
[ 1099.676856] CPU 2
[ 1099.676856] Modules linked in: fcoe(-) libfcoe libfc st act_skbedit cls_basic sch_multiq scsi_transport_fc 
e1000e ixgbe mdio [last unloaded: libfc]
[ 1099.676856] Pid: 9650, comm: rmmod Not tainted 2.6.31-rc2-rp7 #1 X7DB8
[ 1099.676856] RIP: 0010:[<ffffffffa004a612>]  [<ffffffffa004a612>] fc_release_transport+0x1c/0x64 
[scsi_transport_fc]
[ 1099.676856] RSP: 0018:ffff88013181bea8  EFLAGS: 00010286
[ 1099.676856] RAX: 00000000fffffff0 RBX: ffff880131853000 RCX: 0000000000000000
[ 1099.676856] RDX: 0000000000000880 RSI: 0000000000000001 RDI: ffffffff8150750b
[ 1099.676856] RBP: ffff88013181beb8 R08: 0000000000000000 R09: 0000000000000000
[ 1099.676856] R10: ffffffff8150aff0 R11: ffff88013faf1380 R12: ffffffff81820048
[ 1099.676856] R13: 0000000000000880 R14: 0000000000000000 R15: 0000000000000000
[ 1099.676856] FS:  00007fdd018546f0(0000) GS:ffff880028088000(0000) knlGS:0000000000000000
[ 1099.676856] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1099.676856] CR2: 000000000114b1ac CR3: 0000000131920000 CR4: 00000000000006e0
[ 1099.676856] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1099.676856] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1099.676856] Process rmmod (pid: 9650, threadinfo ffff88013181a000, task ffff88011d116100)
[ 1099.676856] Stack:
[ 1099.676856]  0000000000000000 0000000000000040 ffff88013181bee8 ffffffffa014e088
[ 1099.676856] <0> ffff88013181bec8 ffff88013181bec8 0000000000000000 ffffffffa01508e0
[ 1099.676856] <0> ffff88013181bf78 ffffffff8106e551 ffff8800656f6366 ffffffff810dc77f
[ 1099.676856] Call Trace:
[ 1099.676856]  [<ffffffffa014e088>] fcoe_exit+0xe0/0xe9 [fcoe]
[ 1099.676856]  [<ffffffff8106e551>] sys_delete_module+0x1d3/0x249
[ 1099.676856]  [<ffffffff810dc77f>] ? path_put+0x1d/0x21
[ 1099.676856]  [<ffffffff81087a88>] ? audit_syscall_entry+0x114/0x140
[ 1099.676856]  [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[ 1099.676856] Code: 01 00 00 e8 3a fb ff ff 5a 48 89 d8 5b c9 c3 55 48 89 e5 53 48 89 fb 48 8d bf 90 00 00 00 
48 83 ec 08 e8 3a 7f 24 e1 85 c0 74 04 <0f> 0b eb fe 48 89 df e8 2a 7f 24 e1 85 c0 74 04 0f 0b eb fe 48
[ 1099.676856] RIP  [<ffffffffa004a612>] fc_release_transport+0x1c/0x64 [scsi_transport_fc]
[ 1099.676856]  RSP <ffff88013181bea8>
[ 1099.947728] ---[ end trace 9e34a642b96a5faa ]---

This may be because even though fcoe_exit() removed every instance of fcoe it found
on the list, there was still one instance that had been removed by fcoe_destroy(), but
is still in the process of being deleted ... it still had transport attributes.
So calling fc_release_transport() caused this BUG.

BTW, before this I changed fcoe_exit() to lock the list, which it wasn't doing before.

Maybe we should hold the lock across the entire fcoe_destroy() and fcoe_exit() functions,
up to where the scsi_host is deleted.   I don't think this would cause a problem because
it only serializes create / delete and netdev notifications.

Comments?  Are these issues fixed by other patches that are about to be integrated?

I know my fixes would have some conflict with Chris's NPIV changes, but I haven't
reviewed them enough to know whether they fix the problems.

Again, the reason I'm interested in this is that my test script shows these
problems more now than before for some reason, so it's holding up my rport / discovery
patches.

	Thanks,
	Joe



More information about the devel mailing list