[Open-FCoE] [PATCH 00/28] rport/lport/ns rework

Mike Christie michaelc at cs.wisc.edu
Wed Oct 1 20:23:45 UTC 2008


Robert Love wrote:
> The following series implements a change to the rport and lport state machines and
> applies a locking strategy to that state machine.
> 
> 1) A callback function is added to the LP for RP events (READY, FAILED, LOGO)
> 2) GNN_ID and GPN_ID are removed, rport state machine starts at PLOGI
>    - uses a dummy rport until PLOGI response
> 3) A READY state is added to the RP state machine
> 4) The RP and LP state machines locking has been given a strategy
> 5) RP only has one error/retry handling routine
> 6) LP only has one error/retry handling routine
> 7) Naming conventions are changed
>     rport (remote port) = fc_rport pointer
>     rdata (remote port private) = fc_rport_libfc_priv pointer
>     lport (local port) = fc_lport_pointer
> 8) All interaction with scsi_transport_fc happens in fc_rport.c. Users of
>    the RP state machine only create dummy rports and call rport_login(),
>    the RP state machine deals with adding to the transport layer and
>    deleting correctly.
> 
> There are some issues that I've found while working on this patch that will
> need to be fixed.
> 
> 1) probably not cancelling work threads appropriately
> 2) What happens if discovery fails? Currently it's silent.
>    Should discovery be pulled into the lport state machine?
> 3) reset is broken
> 4) There needs to be a little more commenting about the locking
> 5) Some of the #define variables need to be named better. In particular
>    I don't like LPORT_RPORT_EV_* or the function lport_rport_event().
> 

6). I think my mutex idea was a dud or maybe it needs two more changes. 
The mutex is a problem currently because fc_exch.c can call back into 
fc_rport.c or fc_lport.c by calling the ep->resp function. This is 
causing problems when fc_exch_timeout calls the resp handler from the 
timer context. We talked about this before and we can handle it by 
queueing the timeout in delayed work.

A new problem is due to the termiate_rport_io callback. If 
fc_rport.c:fc_rport_terminate_io is run from scsi_transport_fc.c's host 
work queue (work is queue by calling fc_queue_work), then it calls
exch_mgr_reset and this calls a ep->resp function which calls back into 
fc_rport.c then it could try to grab the rp_mutex. If another thread is 
calling fc_remote_port_add on the same rport, that thread will be 
holding the rp_mutex, and so when fc_remote_port_add calls fc_flush_work 
we will lock up because the flush call will be waiting on the workqueue 
that is calling fc_rport_terminate_io which is waiting for the mutex we 
are holding.



More information about the devel mailing list