[Open-FCoE] [RESUBMIT-v2 PATCH] libfc, fcoe: fixed locking issues with lport->lp_mutex around lport->link_status

Vasu Dev vasu.dev at linux.intel.com
Sat Apr 18 01:44:27 UTC 2009


Mike, 

	You have hinted some very good pointers to get queue_depth issue fixed
in fcoe and this will be very helpful Thanks. I couldn't spend much time
on this and will be looking into this to fix fcoe for now until some
common scsi-ml solution is available for all LLD to use. I noticed
similar discussion started on linux-scsi mail list but doesn't seem any
common solution will be available any time soon as indicated in that
discussion that same thing was discussed 2 years ago, therefore will go
with fcoe solution for now, going for a common alog for all FC devices
could be another option in its FC transport class but lets get this
fixed on fcoe first.

	Vasu


On Wed, 2009-04-15 at 19:27 -0500, Mike Christie wrote:
> Vasu Dev wrote:
> > 
> > I don't see this much difference with or w/o qfull on above test, I see
> > hardly any difference on slower tgts (2%) but yes big drop on RAM based
> > fast tgt when qfull hits with drop as much as 50% on above test. Soth
> 
> 
> I am just using a cisco switch with scst's qla2xxx target driver with a 
> normal old sata disk for the scst backing storage. The qla2xxx driver is 
> a 2 gig card.
> 
> 
> > there is drop when qfull hits which make sense with faster tgt with your
> > above test sending 256K blocks writes from 128 threads with current
> > cmd_per_lun=32. Can currently available FC line rates keep up with this
> > pace with this test's IO size with cmd_per_lun=32 limit?
> 
> It is only 32 commands that will probably be 256K to 512K (depends on 
> how much merging we get).
> 
> Do you mean the FC connection or the FCOE/ethernet one? For normal old 
> FC, 32 should be fine.  With other midrange targets and nornal old FC 
> with lpfc/qla2xxx this works fine and was giving the best throughput.
> 
> If I take fcoe out of the picture and just do normal old FC with lpfc 
> connected to scst, 32 works fine.
> 
> Also if I use lpfc with fcoe support, and attached to the cisco setup it 
> works fine with a queue_depth of 32 or 30 (30 is the default).
> 
> > 
> > I checked lpfc and qla FC HBA are using cmd_per_lun as 3, so tried same
> 
> 
> 
> They only set 3 as the initial limit. When scsi_scan adds a disk it will 
> use cmd_per_lun for the initial setting, but then scsi-ml will call into 
> the slave_configure callback where the LLD increases if it can.
> 
> There is also ramp up code that can increase it. It depends on the 
> driver, but it is usually not going to be 3 for most disks in decent 
> targets in normal setups. I think for something like clusters or tape it 
> is a different story.
> 
> Do a
> cat /sys/block/sdX/device/queue_depth
> to see what we ended up at.
> 
> I think for qla2xxx, we will start out with 32 by default. And for lpfc 
> we get 30. It might depend on the kernel, but for disks we normally do 
> not start with 3.
> 
> 
> > and with that noticed qfull condition never hits and performance
> > improved significantly on RAM bases tgt, they performed +35% better
> > relative to their performance with cmd_per_lun=32 with ignoring qfull
> > check for your above test.
> 
> What type of numbers are you getting and what is the setup?
> 
> 
> > 
> > So may it is time to tune cmd_per_lun and it seems cmd_per_lun=3 would
> > be better to align with other FC HBA setting and current FC netdev line
> > rate. What you think ? Possibly also tune can_queue value from 1024 ? 
> >   
> 
> cmd_per_lun of 3 is fine, but then you want to add some other code like 
> lpfc and qla2xxx do. You also want something like a module param to 
> start the queue_depth higher when the slave_confiure is called. See 
> qla2xxx's ql2xmaxqdepth for example and its slave_configure use.
> 
> 
> 
> 
> >>> We are hitting the qfull test (if I just remove it in fc_fcp.c then I 
> >>> get 192 MB/s again) which is probably causing queuecommand to return 
> >>> host busy. That may in turn cause IOs to sit in the queue until the plug 
> >>> threshhold is hit or until a IO is completed which would also cause the 
> >>> blk/scsi queue to be run again. Without the patch we would fill up the 
> >>> fcp/fcoe queue until the scsi_host_template->can_queue (we are using 
> >>> 1024 right now) is hit or until the device queue depth is hit (I have 
> >>> this set to just 32). With this patch I only get a couple IOs running.
> >>>
> >>> I think if you want to return SCSI_MLQUEUE_HOST_BUSY when qfull is set, 
> >>> then you want to add some code to unplug the queue when it is not set.
> > 
> > I'm not sure what you meant by unplug the queue and how to do that ?
> >  
> 
> 
> See scsi_lib.c:scsi_run_queue. There is actually not nice interface to 
> do it now. You basically want to do something like scsi_run_queue. You 
> would want to export scsi_run_queue or add some code to wake the block 
> softirq to run scsi_run_queues.
> 
> >>  
> >>> When it clears it does not mean that has a command has completed  (a 
> >>> command completion will force the blk/scsi queues to be run (this is 
> >>> will cause the queuecommand to be called)), so you will not get the 
> >>> blk/scsi queues running again quickly.
> > 
> > fcoe_watchdog will flush fcoe_pending_queue.qlen backlog and that will
> > cause pending cmd completion but this timer is slow as Joe mentioned
> > below so once qfull hits this leads to slower performance but can mostly
> > avoid qfull condition with cmd_per_lun=3 as explained above.
> > 
> > In any case eh will also cause IO completion, will that force blk/scsi
> > queues to run or any explicit api to trigger this quickly once
> > SCSI_MLQUEUE_HOST_BUSY returned.
> 
> 
> > 
> >>> Another alternative is to not return SCSI_MLQUEUE_HOST_BUSY when qfull 
> >>> is set. You can then just add some code in the t.eh_timed_out to check 
> >>> if qfull was hit recently and if it was then give the command some more 
> >>> time.
> > 
> > This solution won't prevent scsi-ml from sending more commands down to
> > fcoe than FC line rate can handle, so this will cause
> > fcoe_pending_queue.qlen backlog build up and frequent defer request in
> > added t.eh_timed_out. However this can be added but preventing more
> > fcoe_pending_queue.qlen backlog build will throttle traffic at source
> > and would avoid need for extra steps such as this in t.eh_timed_out.
> > 
> > How about reducing can_queue along with SCSI_MLQUEUE_HOST_BUSY return,
> > however that won't help for this single lun test here, so can we
> > dynamically adjust cmd_per_lun also? If we can then that would be ideal
> 
> cmd_per_lun should probably be lower like 3. It should be static. You 
> want to set the devices queue_depth dynamically. Also add some ramp up 
> code like the other drivers. For the ramp up you want to set the 
> queue_depth using the adjust queue depth function like is done with the 
> slave_configure.
> 
> Also you will want to add some code so that you can start the initial 
> queue_depth we start ramping up from, because starting from 3 is too 
> low. The slave_configure would set this initially.
> 
> I thought we used to end up only setting it to 3 for tape or when 
> tagging is not supported, but I am not sure anymore. Looking over lpfc 
> and qla2xxx it does not look like the case.
> 
> 
> > instead full stop by use of SCSI_MLQUEUE_HOST_BUSY return.
> > 
> > The fcoe still need to be fixed for can_queue ramp up, how about adding
> > code in fcoe_watchdog to ramp up can_queue when qfull=0 on every run
> > but still can_queue ramp up logic need to be fixed in libfc first.
> > 
> > If I remember correctly, Mike you mentioned once about implementing
> > generic solution on can_queue ramp up and down for all HBA, not sure if
> > I missed any recent update on that on linux-scsi mail list or somewhere
> > else.
> 
> 
> Yeah, I was going to do a rampup/rampdown for shost->can_queue for iscsi 
> (it should also work for fcoe), then for all drivers that support it do 
> a rampup/rampdown for starget->can_queue.
> 
> There was also another separate part for ramp up needed when you get a 
> QUEUE_FULL. The thread got muddled. But none of it is not done. I just 
> started the code to push the QUEUE_FULL ramp up to userspace.
> 
> 
> The ramp up/ramp down solution for devices sounds good.
> 
> 
> And lowering the queue_depth to 3 fixes the perf problem for me, but it 
> just does not seem right compared to the other FC drivers and fcoe 
> drivers like qla2xxx and lpfc. Returning SCSI_MLQUEUE_HOST_BUSY should 
> not be this expensive. I feel like we are fixing a problem (we do need 
> the rampup code like other FC drivers), but by doing so we are bypassing 
> a bug in scsi-ml or maybe in the fcoe code.
> 
> I am fine with your idea. I would just like to know exactly what is 
> going on, because the rampup/rampdown code is not going to be perfect 
> and we are going to hit qfull eventually, and I do not want to get the 
> performnace hit.




More information about the devel mailing list