[Open-FCoE] libfc: exchange circular lock dependency detected.

Vasu Dev vasu.dev at linux.intel.com
Wed Jul 15 21:20:58 UTC 2009


On Fri, 2009-07-10 at 16:21 -0700, Vasu Dev wrote:
> On Fri, 2009-07-10 at 15:21 -0700, Joe Eykholt wrote:
> > This is with the latest fcoe-next.git tree.
> > 
> > I see this circular dependency message when running my tests.  I haven't
> > completely figured it out, but I think there must be cases where ex lock
> > is held while em lock is acquired, 
> 
> That is the case of fc_exch_em_alloc() for new exch allocation, the exch
> lock is held along with em lock while adding new exch to exches array.
> 
> > and that's a problem.
> 
> This shouldn't be the issue here since fc_exch_em_alloc() is the only
> case when these two locks held together, so no possibility of circular
> locking due to any messed up order of taking these two locks together
> somewhere else.
> 
> > It may also be caused by flushing fc_exch_timeout() work while holding the
> > em lock it sometimes needs.
> > 
> > I also noticed that fc_exch_timeout is calling fc_exch_rrq while holding
> > an ex_lock and that does a new fc_exch_alloc() ... which gets the em_lock,
> > and a different ex_lock so that maybe what it's complaining about.
> > 
> 
> Yeah this what this log trace is suggesting and I also got same trace
> with my exch pool patches ( pasted below, that also suggested exact
> above call flow could lead to circular locking on exch_lock). However I
> don't see how since fc_exch_rrq will be called with exch lock held but
> that same exch's lock won't be held again when sending out fc_exch_rrq
> in that call flow. The exch lock held before fc_exch_rrq and during
> fc_exch_seq_send() in that flow are on two different exch instances but
> perhaps checking engine is deducing as if same exch lock is held twice
> on same exch instance. So perhaps false alarm from lock checking engine
> or I'm missing something here, but just to avoid this nasty warning,
> perhaps exch lock should be dropped before calling fc_exch_rrq() but I
> need to think more about that since that might have some other issues.
> 

Dropping exch lock before calling  fc_exch_rrq() seems to be the best
fix but with one caveat to handle fc_exch_reset case before RRQ retry
attempted, I handled fc_exch_reset case by checking additional flags in
case of grabbing exch lock again to schedule RRQ retry.

The fix is:-

diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
index a074562..c261893 100644
--- a/drivers/scsi/libfc/fc_exch.c
+++ b/drivers/scsi/libfc/fc_exch.c
@@ -445,9 +445,9 @@ static void fc_exch_timeout(struct work_struct *work)
        e_stat = ep->esb_stat;
        if (e_stat & ESB_ST_COMPLETE) {
                ep->esb_stat = e_stat & ~ESB_ST_REC_QUAL;
+               spin_unlock_bh(&ep->ex_lock);
                if (e_stat & ESB_ST_REC_QUAL)
                        fc_exch_rrq(ep);
-               spin_unlock_bh(&ep->ex_lock);
                goto done;
        } else {
                resp = ep->resp;
@@ -1672,14 +1672,14 @@ static void fc_exch_rrq(struct fc_exch *ep)
        struct fc_lport *lp;
        struct fc_els_rrq *rrq;
        struct fc_frame *fp;
-       struct fc_seq *rrq_sp;
        u32 did;

        lp = ep->lp;

        fp = fc_frame_alloc(lp, sizeof(*rrq));
        if (!fp)
-               return;
+               goto retry;
+
        rrq = fc_frame_payload_get(fp, sizeof(*rrq));
        memset(rrq, 0, sizeof(*rrq));
        rrq->rrq_cmd = ELS_RRQ;
@@ -1695,13 +1695,20 @@ static void fc_exch_rrq(struct fc_exch *ep)
                       fc_host_port_id(lp->host), FC_TYPE_ELS,
                       FC_FC_FIRST_SEQ | FC_FC_END_SEQ | FC_FC_SEQ_INIT, 0);

-       rrq_sp = fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, NULL, ep,
-                                 lp->e_d_tov);
-       if (!rrq_sp) {
-               ep->esb_stat |= ESB_ST_REC_QUAL;
-               fc_exch_timer_set_locked(ep, ep->r_a_tov);
+       if (fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, NULL, ep, lp->e_d_tov))
+               return;
+
+retry:
+       spin_lock_bh(&ep->ex_lock);
+       if (ep->state & (FC_EX_RST_CLEANUP | FC_EX_DONE)) {
+               spin_unlock_bh(&ep->ex_lock);
+               /* drop hold for rec qual */
+               fc_exch_release(ep);
                return;
        }
+       ep->esb_stat |= ESB_ST_REC_QUAL;
+       fc_exch_timer_set_locked(ep, ep->r_a_tov);
+       spin_unlock_bh(&ep->ex_lock);
 }


I tested this fix by triggering ABTS/RRQ and no more circular lock
warning but not able to trigger circular locking warning without above
fix also. I've not changed kernel configuration since I got previously
sent log of circular locking warning on same kernel, so same kernel
config with all lock checking enabled. I also tried without the fix
after reboot but still no more same warning, so I'm not sure how to
force lock checking code to always check checking circular locking for
exch loch in fc_exch_rrq call flow, any suggestion ?
 
Let me know your comments on above fix. I'll provide this fix against
fcoe-fixes tree once that tree is updated to .31 kernel which is
currently on 2.6.30-rc4.
	
	Vasu

> Here is similar log with exch pool patches which coincidentally happened
> today and I was wondering perhaps it was with my patches but your
> response confirms that wasn't the case as it happens without them also.
> 
>  =======================================================
> [239459.637422] [ INFO: possible circular locking dependency detected ]
> [239459.637426] 2.6.31-rc2-upstream #3
> [239459.637428] -------------------------------------------------------
> [239459.637431] events/0/51 is trying to acquire lock:
> [239459.637433]  (&pool->lock){+.-...}, at: [<ffffffffa02bf4d2>]
> fc_exch_em_alloc+0x6d/0x20c [libfc]
> [239459.637448]
> [239459.637449] but task is already holding lock:
> [239459.637451]  (&ep->ex_lock){+.-...}, at: [<ffffffffa02c0037>]
> fc_exch_timeout+0x24/0xea [libfc]
> [239459.637463]
> [239459.637464] which lock already depends on the new lock.
> [239459.637465]
> [239459.637467]
> [239459.637468] the existing dependency chain (in reverse order) is:
> [239459.637470]
> [239459.637471] -> #1 (&ep->ex_lock){+.-...}:
> [239459.637476]        [<ffffffff81066e19>] check_prev_add+0x146/0x1c7
> [239459.637483]        [<ffffffff810672e3>] validate_chain+0x449/0x4e3
> [239459.637487]        [<ffffffff8106766f>] __lock_acquire+0x2f2/0x363
> [239459.637492]        [<ffffffff810677aa>] lock_acquire+0xca/0xee
> [239459.637496]        [<ffffffff8150348c>] _spin_lock_bh+0x31/0x40
> [239459.637501]        [<ffffffffa02bf548>] fc_exch_em_alloc+0xe3/0x20c
> [libfc]
> [239459.637510]        [<ffffffffa02bf6aa>] fc_exch_alloc+0x39/0x57
> [libfc]
> [239459.637519]        [<ffffffffa02bfab7>] fc_exch_seq_send+0x2a/0x1cb
> [libfc]
> [239459.637528]        [<ffffffffa02c1002>] fc_elsct_send+0x4f3/0x509 
> [239459.637537]        [<ffffffffa02c1ad4>] fc_lport_enter_flogi
> +0xb1/0xc8 [libfc]
> [239459.637546]        [<ffffffffa02c1dd7>] fc_linkup+0x5b/0x68 [libfc]
> [239459.637556]        [<ffffffffa02d6c66>] fcoe_ctlr_link_up+0x8b/0xa4
> [libfcoe]
> [239459.637562]        [<ffffffffa02ddc91>] fcoe_if_create+0x403/0x444
> [fcoe]
> [239459.637570]        [<ffffffffa02ddd75>] fcoe_create+0xa3/0xd1 [fcoe]
> [239459.637577]        [<ffffffff81054fbd>] param_attr_store+0x22/0x30
> [239459.637582]        [<ffffffff81055010>] module_attr_store+0x21/0x25
> [239459.637587]        [<ffffffff81127f0d>] sysfs_write_file+0xf6/0x132
> [239459.637594]        [<ffffffff810d665b>] vfs_write+0xa9/0x106
> [239459.637600]        [<ffffffff810d676e>] sys_write+0x45/0x69
> [239459.637604]        [<ffffffff8100baeb>] system_call_fastpath
> +0x16/0x1b
> [239459.637611]        [<ffffffffffffffff>] 0xffffffffffffffff
> [239459.637622]
> [239459.637623] -> #0 (&pool->lock){+.-...}:
> [239459.637627]        [<ffffffff81066d19>] check_prev_add+0x46/0x1c7
> [239459.637632]        [<ffffffff810672e3>] validate_chain+0x449/0x4e3
> [239459.637636]        [<ffffffff8106766f>] __lock_acquire+0x2f2/0x363
> [239459.637641]        [<ffffffff810677aa>] lock_acquire+0xca/0xee
> [239459.637645]        [<ffffffff8150348c>] _spin_lock_bh+0x31/0x40
> [239459.637649]        [<ffffffffa02bf4d2>] fc_exch_em_alloc+0x6d/0x20c
> [libfc]
> [239459.637658]        [<ffffffffa02bf6aa>] fc_exch_alloc+0x39/0x57
> [libfc]
> [239459.637667]        [<ffffffffa02bfab7>] fc_exch_seq_send+0x2a/0x1cb
> [libfc]
> [239459.637675]        [<ffffffffa02bff05>] fc_exch_rrq+0xe6/0x10c
> [libfc]
> [239459.637684]        [<ffffffffa02c006d>] fc_exch_timeout+0x5a/0xea
> [libfc]
> [239459.637693]        [<ffffffff81052764>] run_workqueue+0x112/0x20c
> [239459.637697]        [<ffffffff81052948>] worker_thread+0xea/0xfb
> [239459.637701]        [<ffffffff81056bde>] kthread+0x85/0x8d
> [239459.637706]        [<ffffffff8100cbfa>] child_rip+0xa/0x20
> [239459.637710]        [<ffffffffffffffff>] 0xffffffffffffffff
> [239459.637714]
> [239459.637714] other info that might help us debug this:
> [239459.637716]
> [239459.637719] 3 locks held by events/0/51:
> [239459.637721]  #0:  (events){+.+.+.}, at: [<ffffffff81052713>]
> run_workqueue+0xc1/0x20c
> [239459.637728]  #1:  (&(&ep->timeout_work)->work){+.+...}, at:
> [<ffffffff81052713>] run_workqueue+0xc1/0x20c
> [239459.637735]  #2:  (&ep->ex_lock){+.-...}, at: [<ffffffffa02c0037>]
> fc_exch_timeout+0x24/0xea [libfc]
> [239459.637747]
> [239459.637747] stack backtrace:
> [239459.637751] Pid: 51, comm: events/0 Not tainted 2.6.31-rc2-upstream
> #3
> [239459.637753] Call Trace:
> [239459.637758]  [<ffffffff8106692d>] print_circular_bug_tail+0x71/0x7c
> [239459.637763]  [<ffffffff81066d19>] check_prev_add+0x46/0x1c7
> [239459.637768]  [<ffffffff810672e3>] validate_chain+0x449/0x4e3
> [239459.637772]  [<ffffffff8106766f>] __lock_acquire+0x2f2/0x363
> [239459.637777]  [<ffffffff810677aa>] lock_acquire+0xca/0xee
> [239459.637785]  [<ffffffffa02bf4d2>] ? fc_exch_em_alloc+0x6d/0x20c
> [libfc]
> [239459.637792]  [<ffffffff810a66d7>] ? mempool_alloc+0x5d/0x114
> [239459.637796]  [<ffffffff8150348c>] _spin_lock_bh+0x31/0x40
> [239459.637804]  [<ffffffffa02bf4d2>] ? fc_exch_em_alloc+0x6d/0x20c
> [libfc]
> [239459.637813]  [<ffffffffa02bf4d2>] fc_exch_em_alloc+0x6d/0x20c
> [libfc]
> [239459.637819]  [<ffffffff813e0bb0>] ? __alloc_skb+0x41/0x160
> [239459.637827]  [<ffffffffa02bf6aa>] fc_exch_alloc+0x39/0x57 [libfc]
> [239459.637836]  [<ffffffffa02bfab7>] fc_exch_seq_send+0x2a/0x1cb
> [libfc]
> [239459.637844]  [<ffffffffa02bef86>] ? fc_exch_rrq_resp+0x0/0x104
> [libfc]
> [239459.637853]  [<ffffffffa02bff05>] fc_exch_rrq+0xe6/0x10c [libfc]
> [239459.637861]  [<ffffffffa02c0037>] ? fc_exch_timeout+0x24/0xea
> [libfc]
> [239459.637870]  [<ffffffffa02c006d>] fc_exch_timeout+0x5a/0xea [libfc]
> [239459.637878]  [<ffffffffa02c0013>] ? fc_exch_timeout+0x0/0xea [libfc]
> [239459.637883]  [<ffffffff81052764>] run_workqueue+0x112/0x20c
> [239459.637887]  [<ffffffff81052713>] ? run_workqueue+0xc1/0x20c
> [239459.637891]  [<ffffffff81052948>] worker_thread+0xea/0xfb
> [239459.637895]  [<ffffffff81056f1f>] ? autoremove_wake_function
> +0x0/0x34
> [239459.637900]  [<ffffffff8105285e>] ? worker_thread+0x0/0xfb
> [239459.637904]  [<ffffffff81056bde>] kthread+0x85/0x8d
> [239459.637908]  [<ffffffff8100cbfa>] child_rip+0xa/0x20
> [239459.637913]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
> [239459.637917]  [<ffffffff81056b38>] ? kthreadd+0x107/0x128
> [239459.637922]  [<ffffffff81056b59>] ? kthread+0x0/0x8d
> [239459.637926]  [<ffffffff8100cbf0>] ? child_rip+0x0/0x20
> 
> 
> 
> > Is this explained by recent changes in this area or am I just seeing this due to
> > changes in my configuration (I have tapes again) and/or debug features?
> > 
> > 
> > [  116.871020] =======================================================
> > [  116.872004] [ INFO: possible circular locking dependency detected ]
> > [  116.872004] 2.6.31-rc2-rp7 #1
> > [  116.872004] -------------------------------------------------------
> > [  116.872004] events/1/16 is trying to acquire lock:
> > [  116.872004]  (&mp->em_lock){+.-...}, at: [<ffffffffa00570bd>] fc_exch_alloc+0xa1/0x26e [libfc]
> > [  116.872004]
> > [  116.872004] but task is already holding lock:
> > [  116.872004]  (&ep->ex_lock){+.-...}, at: [<ffffffffa005814a>] fc_exch_timeout+0x25/0x20f [libfc]
> > [  116.872004]
> > [  116.872004] which lock already depends on the new lock.
> > [  116.872004]
> > [  116.872004]
> > [  116.872004] the existing dependency chain (in reverse order) is:
> > [  116.872004]
> > [  116.872004] -> #1 (&ep->ex_lock){+.-...}:
> > [  116.872004]        [<ffffffff81066e10>] __lock_acquire+0xa5e/0xbe6
> > [  116.872004]        [<ffffffff81067051>] lock_acquire+0xb9/0xdd
> > [  116.872004]        [<ffffffff81508971>] _spin_lock_bh+0x31/0x3d
> > [  116.872004]        [<ffffffffa0057138>] fc_exch_alloc+0x11c/0x26e [libfc]
> > [  116.872004]        [<ffffffffa0057f79>] fc_exch_seq_send+0x2a/0x1d6 [libfc]
> > [  116.872004]        [<ffffffffa005888e>] fc_elsct_send+0x4a8/0x4be [libfc]
> > [  116.872004]        [<ffffffffa0059be9>] fc_lport_enter_flogi+0xb4/0xcd [libfc]
> > [  116.872004]        [<ffffffffa0059eeb>] fc_linkup+0x5b/0x68 [libfc]
> > [  116.872004]        [<ffffffffa006da1a>] fcoe_ctlr_link_up+0x8b/0xa4 [libfcoe]
> > [  116.872004]        [<ffffffffa0074d1f>] fcoe_create+0x7ce/0x856 [fcoe]
> > [  116.872004]        [<ffffffff81054e48>] param_attr_store+0x25/0x35
> > [  116.872004]        [<ffffffff81054e9d>] module_attr_store+0x21/0x25
> > [  116.872004]        [<ffffffff81124af2>] sysfs_write_file+0xe4/0x119
> > [  116.872004]        [<ffffffff810d554c>] vfs_write+0xab/0x105
> > [  116.872004]        [<ffffffff810d566a>] sys_write+0x47/0x6e
> > [  116.872004]        [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
> > [  116.872004]        [<ffffffffffffffff>] 0xffffffffffffffff
> > [  116.872004]
> > [  116.872004] -> #0 (&mp->em_lock){+.-...}:
> > [  116.872004]        [<ffffffff81066d04>] __lock_acquire+0x952/0xbe6
> > [  116.872004]        [<ffffffff81067051>] lock_acquire+0xb9/0xdd
> > [  116.872004]        [<ffffffff81508971>] _spin_lock_bh+0x31/0x3d
> > [  116.872004]        [<ffffffffa00570bd>] fc_exch_alloc+0xa1/0x26e [libfc]
> > [  116.872004]        [<ffffffffa0057f79>] fc_exch_seq_send+0x2a/0x1d6 [libfc]
> > [  116.872004]        [<ffffffffa0058276>] fc_exch_timeout+0x151/0x20f [libfc]
> > [  116.872004]        [<ffffffff81052600>] worker_thread+0x1fa/0x30a
> > [  116.872004]        [<ffffffff81056a41>] kthread+0x88/0x90
> > [  116.872004]        [<ffffffff8100cbfa>] child_rip+0xa/0x20
> > [  116.872004]        [<ffffffffffffffff>] 0xffffffffffffffff
> > [  116.872004]
> > [  116.872004] other info that might help us debug this:
> > [  116.872004]
> > [  116.872004] 3 locks held by events/1/16:
> > [  116.872004]  #0:  (events){+.+.+.}, at: [<ffffffff810525a9>] worker_thread+0x1a3/0x30a
> > [  116.872004]  #1:  (&(&ep->timeout_work)->work){+.+...}, at: [<ffffffff810525a9>] worker_thread+0x1a3/0x30a
> > [  116.872004]  #2:  (&ep->ex_lock){+.-...}, at: [<ffffffffa005814a>] fc_exch_timeout+0x25/0x20f [libfc]
> > [  116.872004]
> > [  116.872004] stack backtrace:
> > [  116.872004] Pid: 16, comm: events/1 Not tainted 2.6.31-rc2-rp7 #1
> > [  116.872004] Call Trace:
> > [  116.872004]  [<ffffffff8106603c>] print_circular_bug_tail+0x71/0x7c
> > [  116.872004]  [<ffffffff81066d04>] __lock_acquire+0x952/0xbe6
> > [  116.872004]  [<ffffffff81067051>] lock_acquire+0xb9/0xdd
> > [  116.872004]  [<ffffffffa00570bd>] ? fc_exch_alloc+0xa1/0x26e [libfc]
> > [  116.872004]  [<ffffffffa0056c84>] ? fc_exch_rrq_resp+0x0/0x106 [libfc]
> > [  116.872004]  [<ffffffff81508971>] _spin_lock_bh+0x31/0x3d
> > [  116.872004]  [<ffffffffa00570bd>] ? fc_exch_alloc+0xa1/0x26e [libfc]
> > [  116.872004]  [<ffffffffa00570bd>] fc_exch_alloc+0xa1/0x26e [libfc]
> > [  116.872004]  [<ffffffffa0056c84>] ? fc_exch_rrq_resp+0x0/0x106 [libfc]
> > [  116.872004]  [<ffffffffa0057f79>] fc_exch_seq_send+0x2a/0x1d6 [libfc]
> > [  116.872004]  [<ffffffffa0058276>] fc_exch_timeout+0x151/0x20f [libfc]
> > [  116.872004]  [<ffffffff81052600>] worker_thread+0x1fa/0x30a
> > [  116.872004]  [<ffffffff810525a9>] ? worker_thread+0x1a3/0x30a
> > [  116.872004]  [<ffffffff81506d7c>] ? thread_return+0x3e/0xc3
> > [  116.872004]  [<ffffffffa0058125>] ? fc_exch_timeout+0x0/0x20f [libfc]
> > [  116.872004]  [<ffffffff81056d84>] ? autoremove_wake_function+0x0/0x38
> > [  116.872004]  [<ffffffff81052406>] ? worker_thread+0x0/0x30a
> > [  116.872004]  [<ffffffff81056a41>] kthread+0x88/0x90
> > [  116.872004]  [<ffffffff8100cbfa>] child_rip+0xa/0x20
> > [  116.872004]  [<ffffffff8103b1f8>] ? finish_task_switch+0x3b/0xd5
> > [  116.872004]  [<ffffffff8100c5bc>] ? restore_args+0x0/0x30
> > [  116.872004]  [<ffffffff810569b9>] ? kthread+0x0/0x90
> > [  116.872004]  [<ffffffff8100cbf0>] ? child_rip+0x0/0x20
> > _______________________________________________
> > devel mailing list
> > devel at open-fcoe.org
> > http://www.open-fcoe.org/mailman/listinfo/devel
> 
> _______________________________________________
> devel mailing list
> devel at open-fcoe.org
> http://www.open-fcoe.org/mailman/listinfo/devel




More information about the devel mailing list