[Open-FCoE] [RFC PATCH 1/2] fcoe, libfc: adds per cpu exches pool within an exchange manager(EM)

Vasu Dev vasu.dev at linux.intel.com
Wed Jul 8 18:23:05 UTC 2009


On Tue, 2009-07-07 at 11:57 -0700, Joe Eykholt wrote: 
> Vasu Dev wrote:
> > Adds per cpu exches pool for these reasons:-
> > 
> >  1. Currently an EM instance is shared across all cpus to manage
> >     all exches for all cpus. This required em_lock across all cpus
> >     for an exch alloc, free, lookup and reset for each IO or its
> >     frame and that made em_lock expensive, so instead having per
> >     cpu exches pool with their own per cpu pool lock will likely
> >     reduce locking contention in fast path for an exch alloc,
> >     free and lookup.
> > 
> >  2. Per cpu exches pool will likely improve cache hit ratio since
> >     all frames of an exch will be processed on the same cpu on
> >     which exch was originated, next patch will do this change
> >     using added per cpu exches pool.
> > 
> > The EM exch id range is equally divided across all cpus to allocate
> > exches pool per cpu. The exch id space is divided such that lower bits
> > of exch id carries cpu number on which exch was originated, this will
> > help in easily directing all frames of an exch to same cpu on which
> > exch was originated by simple bitwise AND operation between exch id
> > of a incoming frame and cpu_mask. This required cpu_mask and cpu_order
> > updated to max possible cpus nr_cpu_ids rounded up to 2's power and
> > later used in next patch in mapping between exch id and exch ptr
> > index in exches pool array.
> > 
> > Increased default FCOE_MAX_XID to 0x1000 from 0x07EF to make more exch
> > id available per cpu after above described exch id range division
> > across all cpus.
> > 
> > This patch sets up per cpu exches pool struct fc_exches_pool with
> > all required fields to manage exches in pool as prep work to next
> > patch fully making use exches pool and removing em_lock.
> > 
> > RFC notes:-
> > 
> >  1. AFAIU these change should not break fnic but to be sure, testing
> >     is required with fnic. I'd appreciate if someone with fnic could
> >     test these patches with fnic. I think fnic will also need
> >     additional changes to fully leverage exches pool implementation.
> 
> I'll try it sometime.
>   I don't think fnic will leverage this because
> all FCP exchanges are offloaded, and it just uses one worker thread
> for ELS exchanges.
> 
> >  2. These patches works fine when offloads are not in use but with
> >     offload the large size IOs > 1M writes causing BUG_ON in kfree
> >     called from ixgbe driver. I'm debugging this bug and in the
> >     meanwhile I'd appreciate any review comments on these patches.
> 
> Does this patch work by itself? 

Yes, effectively changes in this patch are only prep work for next patch
and added exch pool code is used in next patch, so no issue with this
patch.

In fact these both patches works fine when ixgbe offloads are disabled,
but with ixgbe offload enabled only small read, writes and large read
works fine. The large write caused persistent BUG_ON in kfree when
transmitted skb was freed from ixgbe driver. However that was only with
previous 2.6.30 fcoe-next kernel. I repeated same large write test with
offload enabled with latest fcoe-next 2.6.30-rc2 and I don't see that
BUG_ON any more. I was hitting BUG_ON in slub.c kree() at:-

        page = virt_to_head_page(x);
        if (unlikely(!PageSlab(page))) {
                BUG_ON(!PageCompound(page));
                put_page(page);
                return;

Above findings makes me believe that it was specific to other kernel
changes between .30 and rc2 or ixgbe driver, so I doubt issue I was
referring was caused or unsurfaced by these patches.

> I don't see any change in the
> way XIDs are looked up but I may have missed something.
> If it doesn't work without both patches, then after the
> RFC period they should be put together as a single patch so they work
> with bisection.

This patch works fine, so let us keep this as separate patch as prep
work, so that next patch won't be complex in reviewing which is already
~200 lines of change.

> > 
> > These patches are based on fcoe-next tree top commit:-
> > 
> >  commit a92e688d433a84a689604ed6ec3e2c787271781d
> >  Author: Vasu Dev <vasu.dev at intel.com>
> >  Date:   Wed Jun 24 21:15:45 2009 -0700
> >  fcoe, libfc: adds offload EM per eth device with only single xid range per
> > EM
> > 
> > Signed-off-by: Vasu Dev <vasu.dev at intel.com>
> > ---
> > 
> >  drivers/scsi/fcoe/fcoe.h      |    2 +-
> >  drivers/scsi/libfc/fc_exch.c  |   42 +++++++++++++++++++++++++++++++++++++++--
> >  drivers/scsi/libfc/fc_lport.c |   23 ++++++++++++++++++++++
> >  include/scsi/libfc.h          |    2 ++
> >  4 files changed, 66 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/scsi/fcoe/fcoe.h b/drivers/scsi/fcoe/fcoe.h
> > index 6905efc..5ef7556 100644
> > --- a/drivers/scsi/fcoe/fcoe.h
> > +++ b/drivers/scsi/fcoe/fcoe.h
> > @@ -38,7 +38,7 @@
> >  #define FCOE_MAX_OUTSTANDING_COMMANDS	1024
> >  
> >  #define FCOE_MIN_XID		0x0000	/* the min xid supported by fcoe_sw */
> > -#define FCOE_MAX_XID		0x07ef	/* the max xid supported by fcoe_sw */
> > +#define FCOE_MAX_XID		0x1000	/* the max xid supported by fcoe_sw */
> 
> Very minor point:  the XID at MAX_XID actually gets used; it's a maximum, not a limit.
> Having a pool of 4097 exchanges seems odd (pun intended).  It may lead people
> to believe that it's one more than the max. 

You are right, I'll fix this to 0xFFF. 

However pool size would have computed to same value for both 0x1000 and
0xFFF with added pool_xid calculation: 
   
    pool_xid = (mp->max_xid - mp->min_xid + 1) / (lp->cpu_mask + 1);

However only if nr_cpu > 1.

> Understanding this is important
> for offloads.
> 
> >  unsigned int fcoe_debug_logging;
> >  module_param_named(debug_logging, fcoe_debug_logging, int, S_IRUGO|S_IWUSR);
> > diff --git a/drivers/scsi/libfc/fc_exch.c b/drivers/scsi/libfc/fc_exch.c
> > index f42695e..609590d 100644
> > --- a/drivers/scsi/libfc/fc_exch.c
> > +++ b/drivers/scsi/libfc/fc_exch.c
> > @@ -47,6 +47,14 @@ static struct kmem_cache *fc_em_cachep;        /* cache for exchanges */
> >   * fc_seq holds the state for an individual sequence.
> >   */
> >  
> > +struct fc_exches_pool {
> 
> I suggest we shorten this to fc_exch_pool.

I'll change this to fc_exch_pool.

> 
> > +	u16		next_xid;	/* next possible free exchange ID */
> > +	spinlock_t	lock;		/* exches pool lock */
> > +	struct list_head	ex_list;	/* allocated exchanges list */
> > +	u32		total_exches;	/* total allocated exchanges */
> 
> total_exches could be u16 as well, and if put just after next_id would save
> 4 bytes per pool.
> 

Good improvement, I'll change as suggested.

> > +	struct fc_exch **exches;	/* for exch pointers indexed by xid */
> 
> Since exches are always just after the pool,
> this could be eliminated using an inline, saving 8 bytes per pool.
> This would also be faster to use since it's one less memory reference.
> 
> static inline struct fc_exch **fc_exch_pool_ptr(struct fc_exch_pool *pool)
> {
> 	return (fc_exch **)(pool + 1);
> }
> 

Good suggestion again, I'll change as suggested above to save ptr size
in fc_exch_pool, added  ptr arithmetic will be faster than indirection
as as you also said above. This func will be used to access exch ptr at
specific index in the pool, so I'll revise this func something like
this:

static inline struct fc_exch **fc_exch_ptr(struct fc_exches_pool *pool,
u16 index)
{
        return (struct fc_exch **)(pool + 1) + index;
}

> Another thought: the array of exchange pointers could be global,
> simplifying indexing, and just use a per-pool lock.  However, this
> would be bad for caching, unless the upper bits of the XID range
> were used as the CPU index.  So, what you have might be best.
> 

I agree with you, I think indexing is not much so complex since this
requires only one shift by cpu_order and em->min_xid adjustment.

> > +};
> > +
> >  /*
> >   * Exchange manager.
> >   *
> > @@ -66,6 +74,8 @@ struct fc_exch_mgr {
> >  	u32	total_exches;		/* total allocated exchanges */
> >  	struct list_head	ex_list;	/* allocated exchanges list */
> >  	mempool_t	*ep_pool;	/* reserve ep's */
> > +	u16		pool_max_xid;	/* max exchange id per pool */
> > +	struct fc_exches_pool *pool;	/* per cpu exches pool */
> >  
> >  	/*
> >  	 * currently exchange mgr stats are updated but not used.
> > @@ -1742,6 +1752,7 @@ static void fc_exch_mgr_destroy(struct kref *kref)
> >  	 */
> >  	WARN_ON(mp->total_exches != 0);
> >  	mempool_destroy(mp->ep_pool);
> > +	free_percpu(mp->pool);
> >  	kfree(mp);
> >  }
> >  
> > @@ -1761,6 +1772,10 @@ struct fc_exch_mgr *fc_exch_mgr_alloc(struct fc_lport *lp,
> >  {
> >  	struct fc_exch_mgr *mp;
> >  	size_t len;
> > +	u16 pool_xid;
> > +	size_t psize;
> > +	unsigned int cpu;
> > +	struct fc_exches_pool *pool;
> >  
> >  	if (max_xid <= min_xid || max_xid == FC_XID_UNKNOWN) {
> >  		FC_LPORT_DBG(lp, "Invalid min_xid 0x:%x and max_xid 0x:%x\n",
> > @@ -1793,10 +1808,31 @@ struct fc_exch_mgr *fc_exch_mgr_alloc(struct fc_lport *lp,
> >  	if (!mp->ep_pool)
> >  		goto free_mp;
> >  
> > +	/*
> > +	 * Setup per cpu exches pool with entire exchange
> > +	 * id range equally divided across all cpus.
> > +	 */
> > +	pool_xid = (mp->max_xid - mp->min_xid + 1) / (lp->cpu_mask + 1);
> > +	mp->pool_max_xid = pool_xid - 1;
> > +
> > +	/*
> > +	 * Allocate and initialize per cpu exches pool
> > +	 */
> > +	psize = pool_xid * sizeof(struct fc_exch *) + sizeof(*pool);
> > +	mp->pool = __alloc_percpu(psize, __alignof__(sizeof(struct fc_exch *)));
> 
> Don't take sizeof inside the alignof.  Just do: __alignof__(struct fc_exch *)
> If we have sizeof in there, it would be equivalent to __alignof__(size_t),
> which could have different alignment requirements than a pointer.
> 
> Actually, the alignment we want is that of the pool structure:
> 	__alignof__(struct fc_exches_pool)
> 

I'll fix this.

> 
> > +	if (!mp->pool)
> > +		goto free_mempool;
> > +	for_each_possible_cpu(cpu) {
> > +		pool = per_cpu_ptr(mp->pool, cpu);
> > +		spin_lock_init(&pool->lock);
> > +		INIT_LIST_HEAD(&pool->ex_list);
> > +		pool->exches = (struct fc_exch **) (pool + 1);
> 
> I think we're not supposed to put a space after a cast.
> 

I'll fix this, though chekpatch was okay with that or let say you work
better than checkpatch in catching such style things :-).

> > +	}
> > +
> >  	kref_init(&mp->kref);
> >  	if (!fc_exch_mgr_add(lp, mp, match)) {
> > -		mempool_destroy(mp->ep_pool);
> > -		goto free_mp;
> > +		free_percpu(mp->pool);
> > +		goto free_mempool;
> >  	}
> >  
> >  	/*
> > @@ -1807,6 +1843,8 @@ struct fc_exch_mgr *fc_exch_mgr_alloc(struct fc_lport *lp,
> >  	kref_put(&mp->kref, fc_exch_mgr_destroy);
> >  	return mp;
> >  
> > +free_mempool:
> > +	mempool_destroy(mp->ep_pool);
> >  free_mp:
> >  	kfree(mp);
> >  	return NULL;
> > diff --git a/drivers/scsi/libfc/fc_lport.c b/drivers/scsi/libfc/fc_lport.c
> > index ef6d2bf..be3370b 100644
> > --- a/drivers/scsi/libfc/fc_lport.c
> > +++ b/drivers/scsi/libfc/fc_lport.c
> > @@ -1620,6 +1620,29 @@ int fc_lport_init(struct fc_lport *lport)
> >  		fc_host_supported_speeds(lport->host) |= FC_PORTSPEED_10GBIT;
> >  
> >  	INIT_LIST_HEAD(&lport->ema_list);
> > +
> > +	/*
> > +	 * Update cpu_mask and cpu_order, the cpu_mask is
> 
> s/Update/Initialize/
> s/, the/.  The/
> 

I'll fix this.

> > +	 * set for nr_cpu_ids rounded up to order of 2's
> > +	 * power and order is stored in cpu_order as
> > +	 * this is later required in mapping between an
> > +	 * exch id and array index in per cpu pool of
> > +	 * exches.
> > +	 *
> > +	 * This round up is required to align cpu_mask to
> > +	 * exchange id's lower bits such that all incoming
> > +	 * frames of an exchange gets delivered to same cpu
> > +	 * on which exchange originated by simple bitwise
> > +	 * AND operation between cpu_mask and exchange id.
> > +	 */
> > +	lport->cpu_mask = 1;
> > +	lport->cpu_order = 0;
> > +	while (lport->cpu_mask < nr_cpu_ids) {
> 
> Side note:  nr_cpu_ids is not quite the number of CPU IDs.
> It must be the max CPU ID plus one.  Minor distinction but the
> usage is correct here.
> 
> We should limit the number of pools somehow so that if nr_cpu_ids
> is set to 4096 (if MAXSMP is selected) we still divide the
> exchange space sanely.
> 

I'm using __alloc_percpu() to allocate pool to get max cache benefit and
not sure if we could allocate fewer per cpu pool this way. Instead
increasing MAX_XID should work fine for large cpus system but that won't
limit pools to fewer cpus. If we limit pools to fewer cpus then we run
into another issue of how to allocate exches on cpus not having their
own pool allocated, this would require such cpus to use another cpus
pool and in turn locking contention on pool lock in allocating exches
for those cpu. Also  hashing with cpu_mask will never make use of cpus
without pool for all incoming frame processing.

So I think allocating pools for all cpu will work fine and anyway pool
are not allocating complete all exch memory, just mainly their pointers,
it will be only approx 32KB pool memory for 4K exches configuration on
64bit system.

> If there are 32 processors and 4K XIDs, that's only 128
> exchanges per pool.  Maybe that should be the limit or it
> should depend on the size of the XID range to begin with.
> 
> Also, note that cpu_mask and cpu_order could be globals (if renamed)
> initialized when libfc is loaded.  I don't know if that's better or not.
> As they are, they're the same for all lports.
> 

Make sense to make them global, so that these will not get replicated
per lport and EM. I'll rename them as fc_cpu_mask and fc_cpu_order along
with making them global.

> > +		lport->cpu_mask <<= 1;
> > +		lport->cpu_order++;
> > +	}
> > +	lport->cpu_mask--;
> > +
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL(fc_lport_init);
> > diff --git a/include/scsi/libfc.h b/include/scsi/libfc.h
> > index 3f1993a..6c69b3a 100644
> > --- a/include/scsi/libfc.h
> > +++ b/include/scsi/libfc.h
> > @@ -703,6 +703,8 @@ struct fc_lport {
> >  	struct libfc_function_template tt;
> >  	u8			link_up;
> >  	u8			qfull;
> > +	u16			cpu_mask;
> > +	u16			cpu_order;
> >  	enum fc_lport_state	state;
> >  	unsigned long		boot_time;
> >  
> > 
> > _______________________________________________
> > devel mailing list
> > devel at open-fcoe.org
> > http://www.open-fcoe.org/mailman/listinfo/devel
> 
> Looks good.  Thanks for doing this, Vasu.
> 
> 	Joe
> 

Thank for all good quality review comments.
Vasu





More information about the devel mailing list