[Open-FCoE] libfc: exchange circular lock dependency detected.

Vasu Dev vasu.dev at linux.intel.com
Fri Jul 17 22:56:53 UTC 2009


On Fri, 2009-07-17 at 13:47 -0700, Joe Eykholt wrote:
> Vasu Dev wrote:
> > On Fri, 2009-07-10 at 16:21 -0700, Vasu Dev wrote:
> >> On Fri, 2009-07-10 at 15:21 -0700, Joe Eykholt wrote:
> >>> This is with the latest fcoe-next.git tree.
> >>>
> >>> I see this circular dependency message when running my tests.  I haven't
> >>> completely figured it out, but I think there must be cases where ex lock
> >>> is held while em lock is acquired, 
> >> That is the case of fc_exch_em_alloc() for new exch allocation, the exch
> >> lock is held along with em lock while adding new exch to exches array.
> 
> But that's not what I meant.  I meant the ex lock is already held when
> spin_lock() is called for the em_lock.
> 
> >>> and that's a problem.
> >> This shouldn't be the issue here since fc_exch_em_alloc() is the only
> >> case when these two locks held together, so no possibility of circular
> >> locking due to any messed up order of taking these two locks together
> >> somewhere else.
> >>
> >>> It may also be caused by flushing fc_exch_timeout() work while holding the
> >>> em lock it sometimes needs.
> >>>
> >>> I also noticed that fc_exch_timeout is calling fc_exch_rrq while holding
> >>> an ex_lock and that does a new fc_exch_alloc() ... which gets the em_lock,
> >>> and a different ex_lock so that maybe what it's complaining about.
> >>>
> >> Yeah this what this log trace is suggesting and I also got same trace
> >> with my exch pool patches ( pasted below, that also suggested exact
> >> above call flow could lead to circular locking on exch_lock). However I
> >> don't see how since fc_exch_rrq will be called with exch lock held but
> >> that same exch's lock won't be held again when sending out fc_exch_rrq
> >> in that call flow. The exch lock held before fc_exch_rrq and during
> >> fc_exch_seq_send() in that flow are on two different exch instances but
> >> perhaps checking engine is deducing as if same exch lock is held twice
> >> on same exch instance. So perhaps false alarm from lock checking engine
> >> or I'm missing something here, but just to avoid this nasty warning,
> >> perhaps exch lock should be dropped before calling fc_exch_rrq() but I
> >> need to think more about that since that might have some other issues.
> >>
> > 
> > Dropping exch lock before calling  fc_exch_rrq() seems to be the best
> > fix but with one caveat to handle fc_exch_reset case before RRQ retry
> > attempted, I handled fc_exch_reset case by checking additional flags in
> > case of grabbing exch lock again to schedule RRQ retry.
> > 
> > The fix is:-
> > 
> > diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> > index a074562..c261893 100644
> > --- a/drivers/scsi/libfc/fc_exch.c
> > +++ b/drivers/scsi/libfc/fc_exch.c
> > @@ -445,9 +445,9 @@ static void fc_exch_timeout(struct work_struct *work)
> >         e_stat = ep->esb_stat;
> >         if (e_stat & ESB_ST_COMPLETE) {
> >                 ep->esb_stat = e_stat & ~ESB_ST_REC_QUAL;
> > +               spin_unlock_bh(&ep->ex_lock);
> >                 if (e_stat & ESB_ST_REC_QUAL)
> >                         fc_exch_rrq(ep);
> > -               spin_unlock_bh(&ep->ex_lock);
> >                 goto done;
> >         } else {
> >                 resp = ep->resp;
> > @@ -1672,14 +1672,14 @@ static void fc_exch_rrq(struct fc_exch *ep)
> >         struct fc_lport *lp;
> >         struct fc_els_rrq *rrq;
> >         struct fc_frame *fp;
> > -       struct fc_seq *rrq_sp;
> >         u32 did;
> > 
> >         lp = ep->lp;
> > 
> >         fp = fc_frame_alloc(lp, sizeof(*rrq));
> >         if (!fp)
> > -               return;
> > +               goto retry;
> > +
> >         rrq = fc_frame_payload_get(fp, sizeof(*rrq));
> >         memset(rrq, 0, sizeof(*rrq));
> >         rrq->rrq_cmd = ELS_RRQ;
> > @@ -1695,13 +1695,20 @@ static void fc_exch_rrq(struct fc_exch *ep)
> >                        fc_host_port_id(lp->host), FC_TYPE_ELS,
> >                        FC_FC_FIRST_SEQ | FC_FC_END_SEQ | FC_FC_SEQ_INIT, 0);
> > 
> > -       rrq_sp = fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, NULL, ep,
> > -                                 lp->e_d_tov);
> > -       if (!rrq_sp) {
> > -               ep->esb_stat |= ESB_ST_REC_QUAL;
> > -               fc_exch_timer_set_locked(ep, ep->r_a_tov);
> > +       if (fc_exch_seq_send(lp, fp, fc_exch_rrq_resp, NULL, ep, lp->e_d_tov))
> > +               return;
> > +
> > +retry:
> > +       spin_lock_bh(&ep->ex_lock);
> > +       if (ep->state & (FC_EX_RST_CLEANUP | FC_EX_DONE)) {
> > +               spin_unlock_bh(&ep->ex_lock);
> > +               /* drop hold for rec qual */
> > +               fc_exch_release(ep);
> >                 return;
> >         }
> > +       ep->esb_stat |= ESB_ST_REC_QUAL;
> > +       fc_exch_timer_set_locked(ep, ep->r_a_tov);
> > +       spin_unlock_bh(&ep->ex_lock);
> >  }
> > 
> > 
> > I tested this fix by triggering ABTS/RRQ and no more circular lock
> > warning but not able to trigger circular locking warning without above
> > fix also. I've not changed kernel configuration since I got previously
> > sent log of circular locking warning on same kernel, so same kernel
> > config with all lock checking enabled. I also tried without the fix
> > after reboot but still no more same warning, so I'm not sure how to
> > force lock checking code to always check checking circular locking for
> > exch loch in fc_exch_rrq call flow, any suggestion ?
> 
> You may need a tape drive to see RRQs in normal course, or need to
> simulate errors by dropping frames. 

Exactly this is how I triggered ABTS/RRQ, I wrote a test patch to cause
a frame drop by writing few values from sysfs in my each test with or
w/o above fix and I did notice RRQ going out in wireshark after
ep->r_a_tov (10 sec) of ABTS due to frame drop. 

As expected I didn't see circular lock warning with above fix but on
that day I also did't see circular lock warning without above fix also
though RRQ was called using a frame drop and that was strange to me
since I didn't change kernel config, I had same config as it was in my
last post with to get posted warning log. So I was wondering any way to
force lock checking engine to always throw warning without above fix
when RRQ is called, so that I can be sure that with above fix there is
no more same warning but without the above fix the warning always
present. 

Anyway today I repeated same test again after reboot and I do see
warning once w/o the fix and I don't see warning with above fix in any
case (after reboot or not). So all looks good and it looks like lock
checking engines checks or throw warning only once after reboot.

>   You could change code to
> force RRQs to be sent anyway on other kinds of timeouts (just do
> them unconditionally in a test-only version of the code).
> 
> > Let me know your comments on above fix. I'll provide this fix against
> > fcoe-fixes tree once that tree is updated to .31 kernel which is
> > currently on 2.6.30-rc4.
> > 	
> > 	Vasu
> 
> This looks OK to me.

Thanks for reviewing, the patch is out for validation and I'll post the
patch once validated.

> 
> 	Joe




More information about the devel mailing list