[Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT fromcertain ELS
ajoglekar at nuovasystems.com
Wed Jan 28 18:04:28 UTC 2009
> -----Original Message-----
> From: devel-bounces at open-fcoe.org [mailto:devel-bounces at open-fcoe.org]
> Behalf Of Robert Love
> Sent: Tuesday, January 27, 2009 2:01 PM
> To: Vasu Dev
> Cc: devel at open-fcoe.org
> Subject: Re: [Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT
> fromcertain ELS
> On Tue, 2009-01-27 at 12:24 -0800, Robert Love wrote:
> > On Thu, 2009-01-22 at 16:53 -0800, Vasu Dev wrote:
> > > Abhijeet Joglekar wrote:
> > > > BTW, before applying Chris's LS_RJT retry patch, a reject to a
> > > > from a libFC initiator to another libFC initiator wasn't
> in a
> > > > retry, so the rogue port would get deleted right away after 1
> > > > and so we were not hitting this issue. After I applied the
> > > > increased the number of plogi retries, I started hitting this
> > > >
> > >
> > > This patch increased the probability of hitting this issue but
> > > issue was already there due to untracked rogue rport on any list
> > > later purge them when libfc stack is unloading. I mean we were
> > > fc_rport_error in case the fc_frame_alloc or elsct_send failed
> > > before this latest patch from Chris.
> > I'm a bit confused about the scenario. The "transition state" that
> > guys have been talking about, are you talking about the rogue state
> > or the time between an RTV response and a fc_remote_port_add() (2)?
> > (1)
> > If it's the rogue state then rogue ports are bound to exchanges and
> > unloading the module should cause an EM reset to send the CLOSED
> > to the rport. Is there a reference counting problem here?
> > (2)
> > If you mean after a valid RTV response, but before the
> > fc_remote_port_add() to the FC transport class then the retry timer
> > shouldn't fire. The rogue rport wouldn't be bound to anything, but
> > be in process context and we'd then try adding to the transport
> > If this is the scenario, then I think we care about fc_host locking
> > not the disc->rports list. It would be a timing issue as to whether
> > real rport was added to the transport before the fc_host was freed
> > after. For either case I would guess that there is locking in the FC
> > transport to prevent problems, but maybe there is a defect.
> > I have a patch-set that adds the rogue rport to the disc->rports
> > just after it's created, but I'm not sure that it solves your
> > Can you help me understand the scenario a little better?
> I talked to Vasu and he explained that the critical piece of
> is that a timeout is occuring and therefore there is scheduled work,
> the rport isn't bound to anything while the timer is ticking. I have
> patches that add rogue rports to the disc->rports list, and they'll
> likely fix the problem, but I haven't been able to reproduce the
> scenario yet. I'm not sure I'll be able to reproduce this without
> hacking the code a bit to force a retry.
Sorry, couldn't get back earlier, I was sick and not checking emails
much last 2 days.
The above description is accurate - the remote port is not tracked while
in rogue state, and that coupled with the fact that there were timeouts
pending for it (in my case, Plogi retries), meant that when I unloaded
the module, retry exchanges didn't get cleaned up.
To reproduce the problem, try this -
1) Have 2 libFC initiators talk to each other
2) Increase the retry value to something large (set to -1 for infinite
3) This would have the 2 libFC initiators keep Plogi'ing to each other
4) Now try to unload the libFC module
If you have the patch ready, please send it out. I was going to work on
this today, but instead, I will give your patch a shot first.
More information about the devel