prakashpatidar

4 Posts

41633

August 28th, 2013 01:00

Isilon storage performance issue

Hi all,

We have setup of four X 200 storage nodes, where each node has 6 Gig RAM,12 disks of 7200 RPM SATA (except storage node #4 , which has 11 disks), total 47 disk.

We are accessing storage cluster using 4 compute nodes via NFS protocol.Front end networking from compute node to storage cluster is through 1Gbps switch (each compute node can read and write data to storage cluster using 1Gbps bandwidth) ,backend networking is done using QDR infiband.

We did performance evaluation to get Maximum random read IOPS for 128K chunk read(we choose 128K , because our elasticsearch(application) read data in 128 K chunk).

We have use FIO tool to check performance.

Our observations for the same is :

1.With 20 FIO jobs(20 parallel reading thread) , running on single compute node, gives us max 850 IOPS of 128 K chunk (max out network utilization of 1 Gbps) , when we run same FIO job for 2 compute nodes in parallel(total 40 reading thread, 20 on each) , performance drops to 673 & 691 respectively , total 1364 IOPS. When we run same FIO job for 4 compute nodes in parallel(total 80 reading thread, 20 on each) , performance further drops to 252,255,567,550,total 1624 IOPS. Whereas with 8K random read chunk from 4 compute nodes(80 reading threads,20 on each compute node) it was giving 4602 IOPS.

My question :

Why we are stuck at max 1624 IOPS in case of 128 K chunk random read whereas total 47 disk can give much higher IOPS(as with 8k random read chunk ,it is giving 4602 IOPS)

I understand , when we read 128K chunk , we effectively reading more, but in this case : neither network from compute to storage nor CPU on storage and compute node was bottleneck. In all above cases, we used 8K disk block size on isilon ,random access pattern to optimize IOPS , protection was +2:1.

We also tried to increase disk block size in isilon but with increase in disk block size , the performance was decreasing.

It will be nice if you people will recommend tuning parameters to optimize IOPS for 128 K chunk.

Regards,

Prakash

Responses(11)

Peter_Sero

1.2K Posts

0

January 28th, 2014 00:00

Hi Damal

this is roughly how I use these commands to investigate

performance issues in new situations:

> isi statistics pstat

Bottom half: Overall view to see what's going on.

Compare Network -- Filesystem -- Disk throughputs

to see wether they are consistent with what is expected

for the workflow as far as it is known. In a typical NAS workflow,

network throughput should match filesystem througput.

Write throughput should match disk write throughput (mind the protection overhead).

Read throughput matches disk read throughput for uncached workflows,

or can be much higher which good caching.

High disk activity without network or filesystem activity indicates

some internal job is running (restripes etc).

High CPU without any network/filesystem/disk

activity would be very strange, for example

(some process running wild, etc)

Just illustrating how things can be learned from pstat.

Top half of pstat is protocol specific,

might need to run separately for NFS and SMB.

Again, quick check for consistency:

Does the observed mix of reads/writes/ etc. make sense

in the light of the assumed workflow?

> isi statistics system --nodes --top

Quick check wether the load is well balanced across the cluster,

and how the protocols are used (SMB vs NFS etc).

Can also indicate where physical network bandwidth hits the max.

> isi statistics client --orderby=Ops --top

Who is causing the load?

--orderby=In or Out or TimeAvg for throughputs or latencies resp.

> isi statistics client --orderby=Ops --top --long

With --long there are InAvg and OutAvg, which denote the request ("block") sizes

(NOT the average of In and Out rates!!). Small request sizes

often indicate suboptimal configs on the clients side.

> isi statistics drive -nall -t --long --orderby=OpsOut

Do the disk activities max out? Also --orderby=OpsIn or TimeInQ

Do the disk activities match the assumed workload:

Small SizeIn and SizeOut request size indicate metadata ops

or small random IO ops; pure streaming reads/write are usually upto 64k.

Disclaimer: Those are just by personal favorites (plus isi statistics heat),

and I might err on interpretations; I am not aiming at convincing anyone

Cheers

-- Peter

Peter_Sero

1.2K Posts

1

August 28th, 2013 04:00

Have you seen this? I'd recommend to have look:

Ask the Expert: Isilon Performance Analysis

Maybe a few things can be checked in advance (before tracking things down to disk level):

- double check that no background jobs are running and stealing CPU or IOPS

- with four clients, is the network traffic well balanced across the four Isilon nodes?

- are the actual NFS read/write sizes large enough for 128K? (server and client negotiate a match within their limits.)

- is the random access pattern really in effect?

- for 128K reads, one could also try the concurrency pattern...

-- Peter

prakashpatidar

4 Posts

0

August 28th, 2013 06:00

Thanks Peter for your valuable response!

-No background job was taking CPU/IOPS

-Need to check (Is looking client connection on dashboard of Management UI is right way?We are using smartconnect (with basic licence, I think , it is round robin policy)

-we did mount NFS on compute node using rsize=131072,wsize=131072 (need to check whether negotiated value is also 128k)

-we did setting in management UI, can you suggest ways to check whether it is really in effect or not?

-Need to check it (will update you once it is done)

Peter_Sero

1.2K Posts

1

August 29th, 2013 02:00

> -Need to check (Is looking client connection on dashboard of Management UI is right way?We are using smartconnect (with basic licence, I think , it is round robin policy)

The WebGUI is ok, but IMO to slow for live monitoring.

On the command line interface (CLI):

isi nfs clients ls

isi perfstat

Other useful views for live monitoring (just to start with...)

isi statistics system --nodes --top

isi statistics client --orderby=Ops --top

isi statistics heat --top

isi statistics pstat

(these are 6.5 cmds, on 7.0 the syntax might vary, or use isi_classic instead)

> -we did setting in management UI, can you suggest ways to check whether it is really in effect or not?

isi get "filename"

isi get -DD "filename"

the latter shows (in very verbose form, but not so easy to count

the number of disks used) the actual layout of the file on the cluster disks.

Usually "streaming" access files should spread onto more disks,

but on small (or fragmented?) clusters the difference between

streaming/random/concurrency might appear minimal.

isi set -l {concurrency|streaming|random} -r g retune "filename"

will actually change the layout if needed. (I trust this more

that the WebUI). Even if finished, it might take some

more seconds until changes show up with isi get -DD

And if you have SmartPools enabled, make sure

you allow settings per file instead of SmartPools

ruling everything. (Or do use SmartPools, but then

you would need to run a SmartPools job each time to

implement changes.)

The access pattern (as listed by isi get) also affects

prefetching; try out all three choices. You are doing random

IO, but on 128K chunks -- certainly larger than 4k or 8k.

And you have many concurrent accesses, so it's

hard to predict. It is even possible to fine-tune

prefetching beyond the presets for the three patterns,

but I would keep that for later...

-- Peter

prakashpatidar

4 Posts

2

August 29th, 2013 08:00

Thanks Peter,

I will check the details and update you.(currently setup is not with me)

meantime I have some question to understand performance bottleneck , It will be nice if I get answers of below questions:

Considering I have four X200 nodes , and filesystem bock size is 8K and data protection is +2:1(2 disk or 1 node failure)

1.If I write file of 768 KB(128X6) size, will it write 6 data stripe and 2 parity stripe(128K size of each stripe and form a protection group of 8 stripe) on 4 nodes in a way where each node will have 2 stripes (at least one data stripe on each node ).

2.When it writes 128K stripe unit (16 block) on given node , whereas node has 12 disk in my case , how many disk will it use? Will it use all 12 disk or write 128K on one disk only and for next stripe unit at this node use second disk and so on...

3.What will happen if my file size is exactly 128K , will it write 128K stripe on 3 nodes(mirror to ensure 2 disk or 1 node failure) and create overhead of 2x?

4.If my application open a file and write on 128K chunk on NFS protocol and appending multiple 128K chunks to the same file , does OneFS use stripping for every 128K chunk or wait for multiple 128K chunk and do file striping latter to define file layout properly?

5.If I change filesystem block size from 8K to 32K , does it means my stripe unit will be now of 16X32K

6.OneFS uses 16 contiguous block to create one stripe unit , can we change 16 to some other value?

7.Can we access Isilon storage cluster from compute node (install RHEL) using SMB protocol, as I read in performance benchmark from storage council that SMB performance is almost double compare to NFS in terms of IOPS?

Thanks & Regards,

Prakash

Peter_Sero

1.2K Posts

4

August 30th, 2013 03:00

A couple of thoughts and suggestions:

http://www.emc.com/collateral/hardware/white-papers/h10719-isilon-onefs-technical-overview-wp.pdf

is really worth reading to learn more about the filesystem layout. And questions similar to yours

have been discussed here recently

Isilon overhead for 174kb files

How many sizes does Isilon consume when writing 128kb file with N+2:1 and N+3:1

With all information, it is really fun to examine a file's disk/block layout as reported by isi get -DD "file".

Furthermore:

> 5.If I change filesystem block size from 8K to 32K , does it means my stripe unit will be now of 16X32K

I don't think you can do so - which exact setting are you referring to?

> 6.OneFS uses 16 contiguous block to create one stripe unit , can we change 16 to some other value?

Couldn't imagine, but the access pattern parameter controls whether a larger or lower number of disks per node are being used (under the constraint of the chosen protection level).

> 7.Can we access Isilon storage cluster from compute node (install RHEL) using SMB protocol, as I read in performance benchmark from storage council that SMB performance is almost double compare to NFS in terms of IOPS?

In benchmarks SMB IOPS appear higher than NFS IOPS because the set of protocol operations is different, even for identical workloads, not to mention different workloads used. You cannot compare the resulting values...

For your original test, you might max out with the disk IOPS (xfers), but you could also get stuck at a certain rate of your "application's IOPS " while seeing few or no disk activity at all(!) -- because your data is mostly or entirely in the OneFS cache . Check the "disk IOPS" or xfers, including ave size per xfer, with

isi statistics drive -nall -t --long --orderby=OpsOut

and cache hit rates for data (level 1 & 2) with:

isi_cache_stats -v 2

In case of very effective caching the IOPS will NOT be limited by disk transfers (so all that filesystem block size reasoning doesn't apply).

Instead the limit is imposed by CPU usage, or network bandwidth, or by protocol (network + execution) latency even

if CPU or bandwidth < 100%.

In the latter case, doing more requests in parallel should be possible (it seems you are right on that track anyway with multiple jobs).

To check protocol latencies, use "isi statistics client" as before and add --long:

isi statistics client --orderby=Ops --top --long

This will show latency times as: TimeMax TimeMin TimeAvg (also useful for --orderby=... !)

Good luck!

Peter

Rdamal

165 Posts

0

January 23rd, 2014 14:00

Peter, that is a great explanation.

Could you explain a little more on the commands. I mean what do we have to look for that says something is wrong, when we type in the commands.

isi statistics system --nodes --top

isi statistics client --orderby=Ops --top

isi statistics pstat

isi statistics drive -nall -t --long --orderby=OpsOut

isi statistics client --orderby=Ops --top --long

Thanks,

Damal

Rdamal

165 Posts

0

January 28th, 2014 11:00

Peter, thank you very much for the explanation. I hope other people who visit this page will make corrections, if required

> With --long there are InAvg and OutAvg, which denote the request ("block") sizes (NOT the average of In and Out rates!!). Small request sizes often indicate suboptimal configs on the clients side.

When you say small, any particular value that we have to look for ? Any recommendations on what configurations has to be changed on client side ?

Peter_Sero

1.2K Posts

2

February 4th, 2014 03:00

Hi Damal:

for NFS, it's basically the max and preferred read and write sizes

(512KB, 128KB, 512KB, 512KB resp on Isilon side). Just

make sure that clients do not limit requests to smaller sizes via the

NFS mount params rsize and wsize (as max possible sizes).

These might me set in the clients stab or auto mount map,

some systems might have other places to set global defaults.

Random small IOs will result in smaller request sizes, of course,

as they can't get coalesced on the client side.

SMB1 is limited to 64KB requests which always has been pretty bad,

while recent SMB allows for much larger requests. "Secure packet signing"

reduces the allowed request sizes though. For potential restrictions on the client

side, please refer to the Windows (or other clients OS, resp. Samba) docs.

Cheers

-- Peter

Rdamal

165 Posts

0

February 4th, 2014 08:00

Peter, thank you for the responses. It's gives a better insight on handling issues.

Best Regards,

Yoga

asafayan1

31 Posts

0

November 17th, 2014 12:00

Fantastic summary Peter. Thank you very, very much.

Amir

View All

No Events found!