[Open-FCoE] [RESUBMIT-v2 PATCH] libfc, fcoe: fixed locking issues with lport->lp_mutex around lport->link_status

Mike Christie michaelc at cs.wisc.edu
Thu Apr 16 00:27:54 UTC 2009


Vasu Dev wrote:
> 
> I don't see this much difference with or w/o qfull on above test, I see
> hardly any difference on slower tgts (2%) but yes big drop on RAM based
> fast tgt when qfull hits with drop as much as 50% on above test. Soth


I am just using a cisco switch with scst's qla2xxx target driver with a 
normal old sata disk for the scst backing storage. The qla2xxx driver is 
a 2 gig card.


> there is drop when qfull hits which make sense with faster tgt with your
> above test sending 256K blocks writes from 128 threads with current
> cmd_per_lun=32. Can currently available FC line rates keep up with this
> pace with this test's IO size with cmd_per_lun=32 limit?

It is only 32 commands that will probably be 256K to 512K (depends on 
how much merging we get).

Do you mean the FC connection or the FCOE/ethernet one? For normal old 
FC, 32 should be fine.  With other midrange targets and nornal old FC 
with lpfc/qla2xxx this works fine and was giving the best throughput.

If I take fcoe out of the picture and just do normal old FC with lpfc 
connected to scst, 32 works fine.

Also if I use lpfc with fcoe support, and attached to the cisco setup it 
works fine with a queue_depth of 32 or 30 (30 is the default).

> 
> I checked lpfc and qla FC HBA are using cmd_per_lun as 3, so tried same



They only set 3 as the initial limit. When scsi_scan adds a disk it will 
use cmd_per_lun for the initial setting, but then scsi-ml will call into 
the slave_configure callback where the LLD increases if it can.

There is also ramp up code that can increase it. It depends on the 
driver, but it is usually not going to be 3 for most disks in decent 
targets in normal setups. I think for something like clusters or tape it 
is a different story.

Do a
cat /sys/block/sdX/device/queue_depth
to see what we ended up at.

I think for qla2xxx, we will start out with 32 by default. And for lpfc 
we get 30. It might depend on the kernel, but for disks we normally do 
not start with 3.


> and with that noticed qfull condition never hits and performance
> improved significantly on RAM bases tgt, they performed +35% better
> relative to their performance with cmd_per_lun=32 with ignoring qfull
> check for your above test.

What type of numbers are you getting and what is the setup?


> 
> So may it is time to tune cmd_per_lun and it seems cmd_per_lun=3 would
> be better to align with other FC HBA setting and current FC netdev line
> rate. What you think ? Possibly also tune can_queue value from 1024 ? 
>   

cmd_per_lun of 3 is fine, but then you want to add some other code like 
lpfc and qla2xxx do. You also want something like a module param to 
start the queue_depth higher when the slave_confiure is called. See 
qla2xxx's ql2xmaxqdepth for example and its slave_configure use.




>>> We are hitting the qfull test (if I just remove it in fc_fcp.c then I 
>>> get 192 MB/s again) which is probably causing queuecommand to return 
>>> host busy. That may in turn cause IOs to sit in the queue until the plug 
>>> threshhold is hit or until a IO is completed which would also cause the 
>>> blk/scsi queue to be run again. Without the patch we would fill up the 
>>> fcp/fcoe queue until the scsi_host_template->can_queue (we are using 
>>> 1024 right now) is hit or until the device queue depth is hit (I have 
>>> this set to just 32). With this patch I only get a couple IOs running.
>>>
>>> I think if you want to return SCSI_MLQUEUE_HOST_BUSY when qfull is set, 
>>> then you want to add some code to unplug the queue when it is not set.
> 
> I'm not sure what you meant by unplug the queue and how to do that ?
>  


See scsi_lib.c:scsi_run_queue. There is actually not nice interface to 
do it now. You basically want to do something like scsi_run_queue. You 
would want to export scsi_run_queue or add some code to wake the block 
softirq to run scsi_run_queues.

>>  
>>> When it clears it does not mean that has a command has completed  (a 
>>> command completion will force the blk/scsi queues to be run (this is 
>>> will cause the queuecommand to be called)), so you will not get the 
>>> blk/scsi queues running again quickly.
> 
> fcoe_watchdog will flush fcoe_pending_queue.qlen backlog and that will
> cause pending cmd completion but this timer is slow as Joe mentioned
> below so once qfull hits this leads to slower performance but can mostly
> avoid qfull condition with cmd_per_lun=3 as explained above.
> 
> In any case eh will also cause IO completion, will that force blk/scsi
> queues to run or any explicit api to trigger this quickly once
> SCSI_MLQUEUE_HOST_BUSY returned.


> 
>>> Another alternative is to not return SCSI_MLQUEUE_HOST_BUSY when qfull 
>>> is set. You can then just add some code in the t.eh_timed_out to check 
>>> if qfull was hit recently and if it was then give the command some more 
>>> time.
> 
> This solution won't prevent scsi-ml from sending more commands down to
> fcoe than FC line rate can handle, so this will cause
> fcoe_pending_queue.qlen backlog build up and frequent defer request in
> added t.eh_timed_out. However this can be added but preventing more
> fcoe_pending_queue.qlen backlog build will throttle traffic at source
> and would avoid need for extra steps such as this in t.eh_timed_out.
> 
> How about reducing can_queue along with SCSI_MLQUEUE_HOST_BUSY return,
> however that won't help for this single lun test here, so can we
> dynamically adjust cmd_per_lun also? If we can then that would be ideal

cmd_per_lun should probably be lower like 3. It should be static. You 
want to set the devices queue_depth dynamically. Also add some ramp up 
code like the other drivers. For the ramp up you want to set the 
queue_depth using the adjust queue depth function like is done with the 
slave_configure.

Also you will want to add some code so that you can start the initial 
queue_depth we start ramping up from, because starting from 3 is too 
low. The slave_configure would set this initially.

I thought we used to end up only setting it to 3 for tape or when 
tagging is not supported, but I am not sure anymore. Looking over lpfc 
and qla2xxx it does not look like the case.


> instead full stop by use of SCSI_MLQUEUE_HOST_BUSY return.
> 
> The fcoe still need to be fixed for can_queue ramp up, how about adding
> code in fcoe_watchdog to ramp up can_queue when qfull=0 on every run
> but still can_queue ramp up logic need to be fixed in libfc first.
> 
> If I remember correctly, Mike you mentioned once about implementing
> generic solution on can_queue ramp up and down for all HBA, not sure if
> I missed any recent update on that on linux-scsi mail list or somewhere
> else.


Yeah, I was going to do a rampup/rampdown for shost->can_queue for iscsi 
(it should also work for fcoe), then for all drivers that support it do 
a rampup/rampdown for starget->can_queue.

There was also another separate part for ramp up needed when you get a 
QUEUE_FULL. The thread got muddled. But none of it is not done. I just 
started the code to push the QUEUE_FULL ramp up to userspace.


The ramp up/ramp down solution for devices sounds good.


And lowering the queue_depth to 3 fixes the perf problem for me, but it 
just does not seem right compared to the other FC drivers and fcoe 
drivers like qla2xxx and lpfc. Returning SCSI_MLQUEUE_HOST_BUSY should 
not be this expensive. I feel like we are fixing a problem (we do need 
the rampup code like other FC drivers), but by doing so we are bypassing 
a bug in scsi-ml or maybe in the fcoe code.

I am fine with your idea. I would just like to know exactly what is 
going on, because the rampup/rampdown code is not going to be perfect 
and we are going to hit qfull eventually, and I do not want to get the 
performnace hit.



More information about the devel mailing list