[Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT fromcertain ELS

Love, Robert W robert.w.love at intel.com
Wed Jan 28 18:10:27 UTC 2009


Abhijeet Joglekar wrote:
>> -----Original Message-----
>> From: devel-bounces at open-fcoe.org
>> [mailto:devel-bounces at open-fcoe.org] On Behalf Of Robert Love Sent:
>> Tuesday, January 27, 2009 2:01 PM 
>> To: Vasu Dev
>> Cc: devel at open-fcoe.org
>> Subject: Re: [Open-FCoE] [PATCH v2] libfc: rport retry on LS_RJT
>> fromcertain ELS 
>> 
>> On Tue, 2009-01-27 at 12:24 -0800, Robert Love wrote:
>>> On Thu, 2009-01-22 at 16:53 -0800, Vasu Dev wrote:
>>>> Abhijeet Joglekar wrote:
>> 
>> <snip>
>> 
>>>>> BTW, before applying Chris's LS_RJT retry patch, a reject to a
>>>>> Plogi from a libFC initiator to another libFC initiator wasn't
>>>>> resulting in a retry, so the rogue port would get deleted right
>>>>> away after 1 plogi try and so we were not hitting this issue.
>>>>> After I applied the patch, and increased the number of plogi
>>>>> retries, I started hitting this issue. 
>>>>> 
>>>> 
>>>> This patch increased the probability of hitting this issue but this
>>>> issue was already there due to untracked rogue rport on any list to
>>>> later purge them when libfc stack is unloading. I mean we were
>>>> calling fc_rport_error in case the fc_frame_alloc or elsct_send
>>>> failed even before this latest patch from Chris.
>>> 
>>> I'm a bit confused about the scenario. The "transition state" that
>>> you guys have been talking about, are you talking about the rogue
>>> state (1) or the time between an RTV response and a
>>> fc_remote_port_add() (2)? 
>>> 
>>> (1)
>>> If it's the rogue state then rogue ports are bound to exchanges and
>>> unloading the module should cause an EM reset to send the CLOSED
>>> event to the rport. Is there a reference counting problem here?
>>> 
>>> (2)
>>> If you mean after a valid RTV response, but before the
>>> fc_remote_port_add() to the FC transport class then the retry timer
>>> shouldn't fire. The rogue rport wouldn't be bound to anything, but
>>> we'd be in process context and we'd then try adding to the
>>> transport class. 
>>> 
>>> If this is the scenario, then I think we care about fc_host locking
>>> and not the disc->rports list. It would be a timing issue as to
>>> whether the real rport was added to the transport before the
>>> fc_host was freed or after. For either case I would guess that
>>> there is locking in the FC transport to prevent problems, but maybe
>>> there is a defect. 
>>> 
>>> I have a patch-set that adds the rogue rport to the disc->rports
>>> list just after it's created, but I'm not sure that it solves your
>>> problem. 
>>> 
>>> Can you help me understand the scenario a little better?
>>> 
>> I talked to Vasu and he explained that the critical piece of
>> information is that a timeout is occuring and therefore there is
>> scheduled work, but the rport isn't bound to anything while the
>> timer is ticking. I have patches that add rogue rports to the
>> disc->rports list, and they'll likely fix the problem, but I haven't
>> been able to reproduce the scenario yet. I'm not sure I'll be able
>> to reproduce this without hacking the code a bit to force a retry.
>> 
> 
> 
> Sorry, couldn't get back earlier, I was sick and not checking emails
> much last 2 days.
> 
> The above description is accurate - the remote port is not tracked
> while in rogue state, and that coupled with the fact that there were
> timeouts pending for it (in my case, Plogi retries), meant that when
> I unloaded the module, retry exchanges didn't get cleaned up.
> 
> To reproduce the problem, try this -
> 
> 1) Have 2 libFC initiators talk to each other
> 2) Increase the retry value to something large (set to -1 for infinite
> retries)
> 3) This would have the 2 libFC initiators keep Plogi'ing to each other
> 4) Now try to unload the libFC module
> 
> If you have the patch ready, please send it out. I was going to work
> on this today, but instead, I will give your patch a shot first.
> 
I've got some patches almost ready to be sent out. I thought everything
was working last night, but I think I just exposed another defect. I'm
trying to resolve the second problem right now (if I'm right it's minor).
I'll send the patches out as an RFC later today. I'll probably want to
think a bit more about the implications of adding the rogue ports to the
list before I drop the RFC, but the patches should be good enough for
you to test.

> -- abhijeet
> _______________________________________________
> devel mailing list
> devel at open-fcoe.org
> http://www.open-fcoe.org/mailman/listinfo/devel




More information about the devel mailing list