[Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT fromcertain ELS

Abhijeet Joglekar ajoglekar at nuovasystems.com
Wed Jan 28 18:04:28 UTC 2009


> -----Original Message-----
> From: devel-bounces at open-fcoe.org [mailto:devel-bounces at open-fcoe.org]
On
> Behalf Of Robert Love
> Sent: Tuesday, January 27, 2009 2:01 PM
> To: Vasu Dev
> Cc: devel at open-fcoe.org
> Subject: Re: [Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT
> fromcertain ELS
> 
> On Tue, 2009-01-27 at 12:24 -0800, Robert Love wrote:
> > On Thu, 2009-01-22 at 16:53 -0800, Vasu Dev wrote:
> > > Abhijeet Joglekar wrote:
> 
> <snip>
> 
> > > > BTW, before applying Chris's LS_RJT retry patch, a reject to a
Plogi
> > > > from a libFC initiator to another libFC initiator wasn't
resulting
> in a
> > > > retry, so the rogue port would get deleted right away after 1
plogi
> try
> > > > and so we were not hitting this issue. After I applied the
patch,
> and
> > > > increased the number of plogi retries, I started hitting this
issue.
> > > >
> > >
> > > This patch increased the probability of hitting this issue but
this
> > > issue was already there due to untracked rogue rport on any list
to
> > > later purge them when libfc stack is unloading. I mean we were
calling
> > > fc_rport_error in case the fc_frame_alloc or elsct_send failed
even
> > > before this latest patch from Chris.
> >
> > I'm a bit confused about the scenario. The "transition state" that
you
> > guys have been talking about, are you talking about the rogue state
(1)
> > or the time between an RTV response and a fc_remote_port_add() (2)?
> >
> > (1)
> > If it's the rogue state then rogue ports are bound to exchanges and
> > unloading the module should cause an EM reset to send the CLOSED
event
> > to the rport. Is there a reference counting problem here?
> >
> > (2)
> > If you mean after a valid RTV response, but before the
> > fc_remote_port_add() to the FC transport class then the retry timer
> > shouldn't fire. The rogue rport wouldn't be bound to anything, but
we'd
> > be in process context and we'd then try adding to the transport
class.
> >
> > If this is the scenario, then I think we care about fc_host locking
and
> > not the disc->rports list. It would be a timing issue as to whether
the
> > real rport was added to the transport before the fc_host was freed
or
> > after. For either case I would guess that there is locking in the FC
> > transport to prevent problems, but maybe there is a defect.
> >
> > I have a patch-set that adds the rogue rport to the disc->rports
list
> > just after it's created, but I'm not sure that it solves your
problem.
> >
> > Can you help me understand the scenario a little better?
> >
> I talked to Vasu and he explained that the critical piece of
information
> is that a timeout is occuring and therefore there is scheduled work,
but
> the rport isn't bound to anything while the timer is ticking. I have
> patches that add rogue rports to the disc->rports list, and they'll
> likely fix the problem, but I haven't been able to reproduce the
> scenario yet. I'm not sure I'll be able to reproduce this without
> hacking the code a bit to force a retry.
> 


Sorry, couldn't get back earlier, I was sick and not checking emails
much last 2 days. 

The above description is accurate - the remote port is not tracked while
in rogue state, and that coupled with the fact that there were timeouts
pending for it (in my case, Plogi retries), meant that when I unloaded
the module, retry exchanges didn't get cleaned up.

To reproduce the problem, try this - 

1) Have 2 libFC initiators talk to each other
2) Increase the retry value to something large (set to -1 for infinite
retries)
3) This would have the 2 libFC initiators keep Plogi'ing to each other
4) Now try to unload the libFC module

If you have the patch ready, please send it out. I was going to work on
this today, but instead, I will give your patch a shot first.

-- abhijeet



More information about the devel mailing list