[Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT fromcertain ELS
robert.w.love at linux.intel.com
Tue Jan 27 20:24:32 UTC 2009
On Thu, 2009-01-22 at 16:53 -0800, Vasu Dev wrote:
> Abhijeet Joglekar wrote:
> > The problem seems to be that outstanding exchanges on a rogue remote
> > port are not getting cleaned up when we unload the libFC module.
> > We add the remote port to the discovery object's peer list only when the
> > remote port moves from rogue -> real. During the transition though, the
> > rogue part is not part of any list. Thus, if a port is in this
> > transition state, and we try to remove the libFC module, the
> > fc_disc_stop_rports() function does not find the rogue port and so does
> > not flush its retry work.
> > My guess is that the exch_mgr_rest(0,0) is resetting and freeing all
> > exchanges, but just after that, retry work timeout fires for the rogue
> > port and it creates another exchange which hangs around.
> If this retry timer fires after libfc stack is fully unloaded then this
> will cause more serious issues like kernel crash than just exchange or
> stale rogue rport memory leak. Good catch. Not keeping track of rouge
> rport anywhere is the main issue here as you also mentioned below.
If this is the case then we're not canceling the timer correctly.
> > What was the original motivation behind not keeping the rogue port in
> > the disc object's peer list?
> Not sure why we left rogue port untracked, to me keeping all rport
> tracked whether rogue or real in single list make sense.
The only reason that rogue rports aren't on the discovery list is
because they should be short-lived and it just didn't seem necessary.
The rogue rports are bound to exchanges until they're converted to real
rports at which point they're maintained by the disc->rports list. I
don't have any objections to tracking the rogue rports on the
disc->rports list if it helps something.
> > Can we add the rogue port to the disc list
> > right when it is created?
> Yeap that would be right thing to do, I think this could be done around
> fc_rport_rogue_create calling and any way fc_rport_rogue_create calling
> needs to be fixed since this breaks libfc cross LP/RP/EM/DISC block
> calling. I mean function calls across these libfc blocks should be via
> lport->tt for portability but currently fc_disc.c and fc_lport.c
> directly calls fc_rport_rogue_create.
I've got a patch to fix this.
> Another reason to do around fc_rport_rogue_create calling is that
> currently only fc_disc_rport_callback add rport to list but
> fc_rport_rogue_create is also called from fc_lport.c with
> fc_lport_rport_callback and fc_lport_rport_callback shouldn't access
> fc_disc->rports directly to add a rport.
The dns rport isn't tracked by disc->rports, it's attached to the lport.
I don't think this is an issue.
> > Its trans_State would identify whether it's a
> > rogue or a real.
> Yeap and this could be used as required, for instance to figure out how
> to delete a created rport.
> > This way, lookup will find all remote ports (rogue or
> > real) and clean up exchanges on all.
> All make sense.
> > BTW, before applying Chris's LS_RJT retry patch, a reject to a Plogi
> > from a libFC initiator to another libFC initiator wasn't resulting in a
> > retry, so the rogue port would get deleted right away after 1 plogi try
> > and so we were not hitting this issue. After I applied the patch, and
> > increased the number of plogi retries, I started hitting this issue.
> This patch increased the probability of hitting this issue but this
> issue was already there due to untracked rogue rport on any list to
> later purge them when libfc stack is unloading. I mean we were calling
> fc_rport_error in case the fc_frame_alloc or elsct_send failed even
> before this latest patch from Chris.
I'm a bit confused about the scenario. The "transition state" that you
guys have been talking about, are you talking about the rogue state (1)
or the time between an RTV response and a fc_remote_port_add() (2)?
If it's the rogue state then rogue ports are bound to exchanges and
unloading the module should cause an EM reset to send the CLOSED event
to the rport. Is there a reference counting problem here?
If you mean after a valid RTV response, but before the
fc_remote_port_add() to the FC transport class then the retry timer
shouldn't fire. The rogue rport wouldn't be bound to anything, but we'd
be in process context and we'd then try adding to the transport class.
If this is the scenario, then I think we care about fc_host locking and
not the disc->rports list. It would be a timing issue as to whether the
real rport was added to the transport before the fc_host was freed or
after. For either case I would guess that there is locking in the FC
transport to prevent problems, but maybe there is a defect.
I have a patch-set that adds the rogue rport to the disc->rports list
just after it's created, but I'm not sure that it solves your problem.
Can you help me understand the scenario a little better?
More information about the devel