[Open-FCoE] libfc: exchange circular lock dependency detected.

Joe Eykholt jeykholt at cisco.com
Fri Jul 17 20:47:05 UTC 2009


Vasu Dev wrote:
> On Fri, 2009-07-10 at 16:21 -0700, Vasu Dev wrote:
>> On Fri, 2009-07-10 at 15:21 -0700, Joe Eykholt wrote:
>>> This is with the latest fcoe-next.git tree.
>>>
>>> I see this circular dependency message when running my tests.  I haven't
>>> completely figured it out, but I think there must be cases where ex lock
>>> is held while em lock is acquired, 
>> That is the case of fc_exch_em_alloc() for new exch allocation, the exch
>> lock is held along with em lock while adding new exch to exches array.

But that's not what I meant.  I meant the ex lock is already held when
spin_lock() is called for the em_lock.

>>> and that's a problem.
>> This shouldn't be the issue here since fc_exch_em_alloc() is the only
>> case when these two locks held together, so no possibility of circular
>> locking due to any messed up order of taking these two locks together
>> somewhere else.
>>
>>> It may also be caused by flushing fc_exch_timeout() work while holding the
>>> em lock it sometimes needs.
>>>
>>> I also noticed that fc_exch_timeout is calling fc_exch_rrq while holding
>>> an ex_lock and that does a new fc_exch_alloc() ... which gets the em_lock,
>>> and a different ex_lock so that maybe what it's complaining about.
>>>
>> Yeah this what this log trace is suggesting and I also got same trace
>> with my exch pool patches ( pasted below, that also suggested exact
>> above call flow could lead to circular locking on exch_lock). However I
>> don't see how since fc_exch_rrq will be called with exch lock held but
>> that same exch's lock won't be held again when sending out fc_exch_rrq
>> in that call flow. The exch lock held before fc_exch_rrq and during
>> fc_exch_seq_send() in that flow are on two different exch instances but
>> perhaps checking engine is deducing as if same exch lock is held twice
>> on same exch instance. So perhaps false alarm from lock checking engine
>> or I'm missing something here, but just to avoid this nasty warning,
>> perhaps exch lock should be dropped before calling fc_exch_rrq() but I
>> need to think more about that since that might have some other issues.
>>
> 
> Dropping exch lock before calling  fc_exch_rrq() seems to be the best
> fix but with one caveat to handle fc_exch_reset case before RRQ retry
> attempted, I handled fc_exch_reset case by checking additional flags in
> case of grabbing exch lock again to schedule RRQ retry.
> 
> The fix is:-
> 
> diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> index a074562..c261893 100644
> --- a/drivers/scsi/libfc/fc_exch.c
> +++ b/drivers/scsi/libfc/fc_exch.c
> @@ -445,9 +445,9 @@ static void fc_exch_timeout(struct work_struct *work)
>         e_stat = ep->esb_stat;
>         if (e_stat & ESB_ST_COMPLETE) {
>                 ep->esb_stat = e_stat & ~ESB_ST_REC_QUAL;
> +               spin_unlock_bh(&ep->ex_lock);
>                 if (e_stat & ESB_ST_REC_QUAL)
>                         fc_exch_rrq(ep);
> -               spin_unlock_bh(&ep->ex_lock);
>                 goto done;
>         } else {
>                 resp = ep->resp;
> @@ -1672,14 +1672,14 @@ static void fc_exch_rrq(struct fc_exch *ep)
>         struct fc_lport *lp;
>         struct fc_els_rrq *rrq;
>         struct fc_frame *fp;
> -       struct fc_seq *rrq_sp;
>         u32 did;
> 
>         lp = ep->lp;
> 
>         fp = fc_frame_alloc(lp, sizeof(*rrq));
>         if (!fp)
> -               return;
> +               goto retry;
> +
>         rrq = fc_frame_payload_get(fp, sizeof(*rrq));
>         memset(rrq, 0, sizeof(*rrq));
>         rrq->rrq_cmd = ELS_RRQ;
> @@ -1695,13 +1695,20 @@ static void fc_exch_rrq(struct fc_exch *ep)
>                        fc_host_port_id(lp->host), FC_TYPE_ELS,
>                        FC_FC_FIRST_SEQ | FC_FC_END_SEQ | FC_FC_SEQ_INIT, 0);
> 
> -       rrq_sp = fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, NULL, ep,
> -                                 lp->e_d_tov);
> -       if (!rrq_sp) {
> -               ep->esb_stat |= ESB_ST_REC_QUAL;
> -               fc_exch_timer_set_locked(ep, ep->r_a_tov);
> +       if (fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, NULL, ep, lp->e_d_tov))
> +               return;
> +
> +retry:
> +       spin_lock_bh(&ep->ex_lock);
> +       if (ep->state & (FC_EX_RST_CLEANUP | FC_EX_DONE)) {
> +               spin_unlock_bh(&ep->ex_lock);
> +               /* drop hold for rec qual */
> +               fc_exch_release(ep);
>                 return;
>         }
> +       ep->esb_stat |= ESB_ST_REC_QUAL;
> +       fc_exch_timer_set_locked(ep, ep->r_a_tov);
> +       spin_unlock_bh(&ep->ex_lock);
>  }
> 
> 
> I tested this fix by triggering ABTS/RRQ and no more circular lock
> warning but not able to trigger circular locking warning without above
> fix also. I've not changed kernel configuration since I got previously
> sent log of circular locking warning on same kernel, so same kernel
> config with all lock checking enabled. I also tried without the fix
> after reboot but still no more same warning, so I'm not sure how to
> force lock checking code to always check checking circular locking for
> exch loch in fc_exch_rrq call flow, any suggestion ?

You may need a tape drive to see RRQs in normal course, or need to
simulate errors by dropping frames.   You could change code to
force RRQs to be sent anyway on other kinds of timeouts (just do
them unconditionally in a test-only version of the code).

> Let me know your comments on above fix. I'll provide this fix against
> fcoe-fixes tree once that tree is updated to .31 kernel which is
> currently on 2.6.30-rc4.
> 	
> 	Vasu

This looks OK to me.

	Joe



More information about the devel mailing list