[Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT fromcertain ELS

Abhijeet Joglekar ajoglekar at nuovasystems.com
Wed Jan 21 23:55:04 UTC 2009


> -----Original Message-----
> From: Abhijeet Joglekar
> Sent: Wednesday, January 21, 2009 3:39 PM
> To: 'Robert Love'
> Cc: 'devel at open-fcoe.org'
> Subject: RE: [Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT
> fromcertain ELS
> 
> > -----Original Message-----
> > From: Abhijeet Joglekar
> > Sent: Wednesday, January 21, 2009 2:25 PM
> > To: 'Robert Love'
> > Cc: devel at open-fcoe.org
> > Subject: RE: [Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT
> > fromcertain ELS
> >
> > > -----Original Message-----
> > > From: Robert Love [mailto:robert.w.love at linux.intel.com]
> > > Sent: Wednesday, January 21, 2009 10:14 AM
> > > To: Abhijeet Joglekar
> > > Cc: devel at open-fcoe.org
> > > Subject: Re: [Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT
> > > fromcertain ELS
> > >
> > > On Tue, 2009-01-20 at 18:33 -0800, Abhijeet Joglekar wrote:
> > > > Chris,
> > > >
> > > > I am getting exchange leak, and hence a crash while unloading
libfc
> > > > module after applying this patch. Are you seeing this in your
tests?
> > > >
> > > I've created a fcoe-fixes.git tree that I'm going to use to send
> patches
> > > up to linux-scsi. It's based on Linus' tree and I rebased
yesterday
> > > (maybe the day before) to 2.6.29-rc2  and applied all of our (FCoE
> devel
> > > list) outstanding patches. There might be a leak since my test
scripts
> > > don't test for it, but my test scrips have done a lot of loading
and
> > > unloading of the modules and I don't see a crash.
> > >
> > > How are you noticing the leak?
> >
> > fc_destroy_exch_mgr() is complaining during module unload that all
> > exchanges in "libfc_em" are not free.
> >
> > I am not hitting the WARN_ON in fc_exch_mgr_free() where it checks
for
> the
> > total_exches to be zero. So the exch ptr is getting removed from the
> > manager exchs array and list, but probably has a non-zero reference
> count
> > and so does not get released. Maybe we need to put a WARN_ON() when
an
> ep
> > is removed from the mp list, but its reference count is non-zero and
> thus
> > cannot be freed.
> >
> > I am not current though on the libFC patches that have gone in after
the
> > 1st week of December, so might be missing some patch (bug fix) that
> maybe
> > got exposed after applying only the rport retry on LS_RJT patch.
> >
> > -- abhijeet
> 
> The problem seems to be that outstanding exchanges on a rogue remote
port
> are not getting cleaned up when we unload the libFC module.
> 
> We add the remote port to the discovery object's peer list only when
the
> remote port moves from rogue -> real. During the transition though,
the
> rogue part is not part of any list. Thus, if a port is in this
transition
> state, and we try to remove the libFC module, the
fc_disc_stop_rports()
> function does not find the rogue port and so does not flush its retry
> work.
> 
> My guess is that the exch_mgr_rest(0,0) is resetting and freeing all
> exchanges, but just after that, retry work timeout fires for the rogue
> port and it creates another exchange which hangs around.
> 
> What was the original motivation behind not keeping the rogue port in
the
> disc object's peer list? Can we add the rogue port to the disc list
right
> when it is created? Its trans_State would identify whether it's a
rogue or
> a real. This way, lookup will find all remote ports (rogue or real)
and
> clean up exchanges on all.
> 
> BTW, before applying Chris's LS_RJT retry patch, a reject to a Plogi
from
> a libFC initiator to another libFC initiator wasn't resulting in a
retry,
> so the rogue port would get deleted right away after 1 plogi try and
so we
> were not hitting this issue. After I applied the patch, and increased
the
> number of plogi retries, I started hitting this issue.
> 
> -- abhijeet

Also, IF its ok to add the rogue port to the discovery list right when
it is created, then the removal should also be changed so that port
hangs around in the list during the logoff process and is taken out from
the list in the fc_disc_rport_event() callback after getting a
RPORT_EV_STOP event from fc_rport_work (instead of being taken out right
in the beginning before initiating the logoff)

So, add early/remove late instead of the current model of add
late/remove early.



More information about the devel mailing list