mcbris

1 Rookie

•

77 Posts

11

43688

August 6th, 2013 13:00

Ask the Expert: Isilon Performance Analysis

Welcome to this EMC

YOU MAY ALSO BE INTERESTED ON THESE ATE EVENTS...

Ask the Expert: EMC Isilon Scale-out Data Lake

Ask the Expert: Are you ready to manage deep archiving workloads with Isilon’s HD400 node and OneFS 7.2.0? Find out more…

Ask the Expert: SMB Protocol on an Isilon Cluster

Support Community Ask the Expert conversation. This is an opportunity to learn about and discuss the best practices for Isilon performance analysis, including:

Client work-flow considerations
Network considerations
OneFS identify and break down causes of latency or contention
Tools that help identify work-flow bottle-necks
Sizing and tuning protocol OPS to disk IOPS

This discussion begins on August 12 and concludes on August 23. Get ready by bookmarking this page or signing up to receive email notifications.

Your host:

John Cassidy has spent three decades developing, supporting complex solutions and simplifying problems; He brings the following to bear in OneFS Performance or Complex work-flow issues:

* Work-Flow profiling

* Simplification methods

* Measurement tools

* Deterministic Results

Responses(54)

Peter_Sero

1.2K Posts

0

August 15th, 2013 23:00

John, thanks for starting one more wonderful reply, but is there a technical or transmission problem?

Your post ends somewhat abruptly; and maybe also another post where you mentioned kern.maxvnodes seems to be missing...(?)

-- Peter

SandiG

12 Posts

0

August 16th, 2013 07:00

Hello John,

I had a colleague who had a performance issue that was mostly narrowed down to misaligned host partitions. After realigning the file system the performance was much improved, Can you help me understand why this makes such a significant difference?

cassij

19 Posts

0

August 16th, 2013 09:00

The typical reason a misaligned partition on a GuestOS creates a performance issue is that the blocks from Partition Table one as they are executed on a datastore straddle two stripe units of a protection group

With +2:1 protection on a 5 node cluster your N+M layout will be 8+2. 8 datablocks and 2 parity blocks

[ 128KB ] - SU1 <---+ If your P1 partition starts in the middle of stripe unit 1 and you perform a 128KB I/O. This IO straddles

[ 128KB ] - SU2 <---+ SU-1 and SU-2. Meaning you see 2 I/Os instead of one.

[ 128KB ] - SU3

....

[ 128KB ] -SU8

[ 128KB ] - P1

[ 128KB ] - P2

cassij

19 Posts

0

August 16th, 2013 09:00

What performance test(s) are you running and what result are you seeing ?

cassij

19 Posts

0

August 16th, 2013 11:00

If you fall below 2% nothing would happen. If you fall to the minimum supported level for GNA 1.5%, GNA will be disabled. We will continue to write to the SSD's until they are full however, we will no longer record the GNA metadata blocks. We will continue to READ metadata blocks stored for acceleration.

AndrewChung

132 Posts

0

August 16th, 2013 13:00

So are you saying that once GNA is enabled, even if we fall below 1.5% of total space, you essentially still get GNA? I had thought that when you go below the threshold the metadata on the SSD would be evicted. You are saying that is not the case?

What do you mean exactly when you say that "we will no longer record the GNA metadata blocks"?

cassij

19 Posts

0

August 16th, 2013 14:00

Peter,

I edited comment # 15, hope it has connected vnode context now. I am not sure what happened to my prior post.

Best,

John

cassij

19 Posts

0

August 16th, 2013 15:00

Andrew,

The following white paper talks more to GNA and other OneFS storage topics.

http://www.emc.com/collateral/hardware/white-papers/h8321-wp-smartpools-storage-tiering.pdf

Metadata blocks are files, directories and OneFS internal file system structures. GNA is a separate metadata block that allows nodes that don't have SSD's physically in them to store metatdata on nodes that do. Think of GNA as another layer were we are offering a means to leverage SSD HW from nodes that don't have SSDs. The challenge as you grow the cluster is simply that with GNA enabled if we have too many nodes that are not sharing the work, we create a situation were many non-SSD nodes bullied which does impact overall cluster performance. When GNA falls to the minimum supported level meaning we no longer have the right ratio of SSD's to SATA/SAS i.e. we added more nodes that didn't have ssd's.

We stop creating new GNA copy block as you create more files or directories on nodes that don't have SSD's.

Your metadata will still be written to the SSD's as per your metadata policy, we just won't create the GNA copies. You gain no benefit to new files or dirs that you create on non-SSD nodes in that GNA copies are not being written. Meaning all the metadata on non-SSD nodes are written to SAS/SATA disks.

Last point, we don't delete existing GNA BLOCKS, meaning you still get benefit from old blocks.

Best,

John

Peter_Sero

1.2K Posts

0

August 19th, 2013 00:00

Thanks a lot, John.

How do the vnodes fit into the Level1/Level2 caching mechanism?

It is advisable to "tune" the vnode space?

Considering the SSD/GNA aspect (not so much caching) for large

tree walks, why is there no speed up when having only

the directory's metadata (= links and filenames) - but not the files' metadata - on SSD?

Best,

Peter

Peter_Sero

1.2K Posts

0

August 19th, 2013 00:00

Using and enjoying GNA, we found that 2% is quite a "safe" limit.

For example, we see a 1% ratio of (consumed)SSD : (consumed)HDD

for a pool with an average file size of 0.25 MB

(actually on a X200/SATA+SSD pool with metadata on for all files; so not really GNA).

Summarizing several real GNA observations (X200/SATA+SSD -> 108NL),

the SSD:HDD ratio seems to drop down to 0.2% for an average file size of about 2 MB,

and to 0.1% for 50 MB and larger.

That's on 6.5.5; probably on 7.0 it comes out a bit different.

-- Peter

cassij

19 Posts

0

August 20th, 2013 15:00

Peter,

VNODES are part of the operating system side of OneFS that is seeded from BSD7.x code base from 6.5.x to 7.0.x. VNODES play a role in L1 and L2 in that what is being tracked are file and directory or symlink level details. My point in mentioning kern.maxvnodes is that is does play a performance role in large file sets in that the more direct pointers to data blocks that you can cache in memory the more optimal the performance. This kernel parameter does scale the number of elements as memory increases in a system

4GB = 250,000

8GB = 500,000

16GB = 1,000,000

32GB+ = 2,000,000

per node. Tuning is not recommended in that on smaller memory foot prints memory as a result of storing too many vnodes will reduce other memory areas due to pressure for available pages. In the larger memory foot-prints you need to be concerned about page operations that effectively poison cache and burn CPU cycles with a large list of cache lines.

e.g. you have a cluster with 500,000,000 or 500million file/dir objects. If one of your shares attached to a top level \\ALLMYFILES and all users were allowed Microsoft Index Scanner or Anti-VIRUS or iTunes Media scan ... These processes would start to walk the 500million file/dir objects and you would reach high water marks were managing a really large cache structure becomes expensive, increasing leads to more poisoning not more performance.

Tuning beyond 2million object per node requires some thought and in most cases the ratio between available memory to number of cache elements is adequate for active file set. Meaning, when you have 500,000,000+ files or directories it is more likely that the active file set is a smaller number say 8 million files in any given day.

Best,

John

cassij

19 Posts

0

August 20th, 2013 15:00

Peter,

There are differences between 6.5 and 7.x and likely 7.2 just around the corner. In 7.x we added METADATA-WRITE, this is the ability to put all metadata from work-flow DIRS/FILES/LINKS onto the SSD's to optimize METADATA WRITE operations like file create, rename, change attributes: size, acls, uid,gid, anti-virus scan hash...

Internally OneFS stores certain types of metadata that is part of the file system, b-tree's (a part of DIRS/SNAPSHOTS/OneFS structures) and extension blocks as needed when large numbers of snapshots are retained.

As we move forward in product development we will balance metadata structures that live on SSD and better balance your application I/O performance with optimal I/O execution. Meaning, in order to create a file; this involves many OneFS metadata structures updates especially as you add protection layers (SNAPshots, SyncIQ). We will be tuning the engine to leverage the most out of SSDs on your behalf.

The issue with 1.5% and 2% SSD ratio or atleast 20% of all nodes with SSD in GNA is balanced between the following:

1) We want to ensure that the amount of free blocks from SSD's is adequate for metadata needs. A file INODE at it's smallest will be 512bytes. However, as we add ACL->ACE AIMA items this will grow the inode. Making INODES a little variable in size.

isi status -d # gives you a high level overview of usage.

isi statistics drive -nall --orderby=inode --long --top # shows you what realtime drives have the most Inodes.

sysctl efs.lbm.drive_space # gives you a very granular level blkfree, usedino.

2) You want to ensure that no given nodepool nodes with SSD become bullied to the point were that node begins to slow down overall cluster performance.

e.g. X200, 6SSD and 6 SATA # this is not an invalid config some of you may have it. But the sizing exercise considered the following:

The 6 SATA drives will make up some of the data set of the cluster. However if GNA is enabled and say 3 of the X200 allow you to meet the 1.5% ratio rule. The concern from very heavy metadata write work-flow would be that you can drive the X200 to the point that all 4 CPU's are taxed at 100% actual in terms of user,system, irq time (no IDLE). This could starve the cluster from data on the 6 SATA spindles. Meaning that correct sizing of SSD ratio to work-flow is needed to avoid having to few nodes. 4 nodes with 4 SSD may be the better balance act to keep the performance curve in that 'sweet spot'.

Best,

John

Peter_Sero

1.2K Posts

0

August 20th, 2013 23:00

John,

thanks a lot for the new answers!

> isi statistics drive -nall --orderby=inode --long --top

On our NL pool we see the Used% pretty balanced around 73%

(we always made sure that MultiScan succeeds after

disk swaps, node reboots etc.).

But the Inodes range from 590K to 2.4M per disk

in that pool; is this kind of unbalance normal or should

we take care of it?

-- Peter

cassij

19 Posts

0

August 21st, 2013 15:00

Peter,

"But the Inodes range from 590K to 2.4M per disk in that pool; is this kind of unbalance normal or should we take care of it?"

Ideally, the Inodes should be balanced. The smartpools process if you have license runs every day by default at 22:00HR. This process should be balancing the inodes for you. In the life of a filesystem, you will delete files or purge aged files, meaning it's possible that you see this from time to time. However, jobs like

smartpools

setprotectplus

autobalancelin

When run will balance the inodes on disk out.

If it were to be an issue you might notice that some drives from isi statistics drive -nall --orderby=Queued | head -14 show some drives as unusually busy. If the size of the In or Out I/O divided by the operations <4k it's likely that this is namespace_ meta operations.

Quick way to see namespace count

isi perfstat

Looking at the Files #/s Create/Remove are metadata-writes and Lookup are reads.

Best,

John

Peter_Sero

1.2K Posts

0

August 22nd, 2013 03:00

> smartpools

> setprotectplus

> autobalancelin

> When run will balance the inodes on disk out.

These job have been running successfully after past disk replacements and

addition of a new node, and have balanced the data usage pretty well.

But looking at the inode number per disk, one can clearly

detect disks that have been replaced - those are holding much

fewer inodes. Seems like the inodes have been written through

normal traffic, but not through explicit balancing.

Does this indicate that something is broken here?

Load distribution across disks is quiet even, though.

(Maybe this issue beyond the scope of this discussion;

any advise on further steps, like wether to open a case,

will be appreciated - thanks!)

-- Peter

1
2
3
4

View All

No Events found!