selu82

8 Posts

6273

December 12th, 2014 05:00

Heavy CPU load

Hi folks,

we are facing heavy CPU load through our business time.

I have read the large perfomance thread and i am Aware of the 1/3 network 1/3 client 1/3 cluster performace layout.

But when i see a 6 node cluster is @>95% all the day it is a hard discussion with the client and the network guys....

The specific Cluster is a 6 node x200 OneFS 7.0.2.7 without patches smb2 only.

We are providing office file space no media streaming etc....

Right now we try to Monitor with this isi statistics:

HDD stats:

isi statistics query --nodes=all --stats=node.disk.xfers.rate.avg

isi statistics drive --nodes=all --type=sata

SMB stats:

isi statistics pstat --protocol=smb2

isi statistics query --nodes=all --stats=node.clientstats.active.smb2

isi statistics query --nodes=all --top -i 1 --stats=node.clientstats.connected.smb

ifs stats:

isi statistics heat --nodes=all --totalby=path

Client stats:

isi statistics Client --nodes=all --oderby=ops

CPU stats:

isi statistics System --nodes

while true; do clear; date; isi_for_array -s top | grep -A 1 -i 'lwio'; sleep 5;done

What we see is a medium IOPS load on the disks / medium to high queue stats on the disks.

We see heavy smb2 stats with pstat. Peaks with 20k ops/sec with 10-13k create ops/sec.

At the same time lwio is @300 - 350 % on all CPUs.

Can anyone tell me commands to dig deeper ?

Besides of that does anyone has a idea which workflow could create such heavy smb2 stats ?

P.S

InsightIQ is no option right now

/Sebastian

Responses(11)

peglarr

99 Posts

0

December 13th, 2014 06:00

Have a look at your clients with

isi statistics client with the following

--protocols=smb2,lsass_in,lsass_out

--classes=create,delete,file_state

--orderby

--output

You can certainly add to this as you see fit.

Curious why InsightIQ "isn't an option." It's always an option - get an eval license if needed from your SE.

Remember Isilon is designed to scale out - if you're trying to service a workloadfor your business users that is immutable and driving the cluster CPU to saturation, it's time to scale out.

Peter_Sero

1.2K Posts

0

December 12th, 2014 06:00

Those "creates" should show up in your isi statistics client & heat outputs.

Client statistics will help breaking it down per user and PC,

while heat statistics can go down do directory and file level.

I'd guess some app/download/update is just running wild,

possibly for multiple users.

hth

-- Peter

cadiletta

106 Posts

0

December 12th, 2014 09:00

Your statistics monitoring looks great. From the description it looks like the network load is heavy, but not abnormally so. The couple other areas to check on status would be the current job engine status and tasks, any other modules running such as a SyncIQ job, Snapshot job or other operation? How is the storage capacity looking, and balanced across the nodes?

If there's nothing else seen there, I might browse into the log files and take a look at the messages log to see if there's anything unusual there - such as a drive stall or other behavior not obvious from the statistics. /var/log/messages on any node. Not saying there should be something in there, just worth a look in case.

peglarr

99 Posts

1

December 15th, 2014 08:00

Hi Sebastian, thanks for your reply. One thing to consider are the recommendations on page 11 of the IIQ 3.1 Installation Guide, w.r.t. configuring the name or IP. As an alternative, you can specify the SIP (SmartConnect IP) instead of the SC zonename. This reduces the load on your DNS server as it avoids name resolution for each query.

In addition, for your SIQ jobs, have you considered using the 'continuous' policy for certain directories? If you are running policy every 5 minutes, it might be easier to just set continuous mode. This avoids running the SIQ policy with nothing to do (i.e. no changes over the last 5-minute interval.) Depends on your own preference and RPO, really. Remember each SIQ job will invoke snapshot on the source cluster, so with your 5 minute policy you are accumulating 12x24 = 360 snaps per day. That's quite a bit more than 3 :-)

Cheers

Rob

selu82

8 Posts

0

December 15th, 2014 08:00

Thank you for the input.

I am going to monitor the background jobs too.

We got 3 snapshots a day and a syncIQ job every 5 minutes.

Maybe we should monitor isi_job_d and isi_migrate ?

Also i am going to monitor the client stats as Rob wrote.

Do you think an interval of -i 1 is to heavy ?

InsightIQ is no option because it is spamming our DNS servers with queries.

We have tried to get it up and running in our enviroment by configuring the smartconnect-zone of the clusters which results in heavy DNS traffic :-( .

Configuring a node IP is not the best option because when the specific node is down there is no reporting :-(

Another problem is that a max. of 8 clusters could be monitored by one insightIQ instance.

We have a much higher cluster count.

/Sebastian

selu82

8 Posts

0

December 15th, 2014 23:00

Hi Rob,

i will hold the SIP solution in mind. It i not that easy to set it up in our network. But never say never :-)

Right now i have to figure out a way without IIQ.

During the mirgration projekt to isilon the emc² solution architect mentioned that a synciq job every 5 mins produces heavy workload. But thats a detail we are not able to Change.

How could i monitore the synciq Impact - isi_migrate job on every node ?

The continous-mode is a very interesting detail you wrote about. I will check it out if this is a way for us with onefs 7.0.2.7.

/Sebastian

peglarr

99 Posts

1

December 16th, 2014 06:00

Hi Sebastian,

Sorry to hear it's difficult to setup an IP address in your network.

Anyway, the best way to measure node resource consumption is what you are already doing - isi statistics ...

Continuous mode SIQ is a feature in OneFS 7.1 and newer.

But remember the key architectural point - OneFS is designed to scale out. If your current cluster is reaching saturation of resources, due to workload growth to support a growing business, it's time to scale it out.

Cheers

Rob

selu82

8 Posts

0

December 16th, 2014 06:00

Very useful post.

Thanks a lot.

Peter_Sero

1.2K Posts

2

December 17th, 2014 07:00

Sebastian, did you find out where the 10000+ creates/sec come from?

That's actually a pretty high rate for 6 x X200 cluster if metadata

is on SATA drives, so I wondered whether the files get created at all...

If the files are created and kept, you might run into some

surprise when crawling through the filesystem at a later point.

BUT if the files are NOT actually created

(say due to permission or path error),

then the create ops DO show up in client/protocol statistics,

but NOT in filesystem "heat" statistics -- pretty easy

to check this way. Such finding would indicate a useless

operation storm from the clients, and would be pretty

much in line with the observed HIGH CPU load

while disk IOPS rate stays MEDIUM.

Curious

-- Peter

selu82

8 Posts

0

December 21st, 2014 23:00

HI Peter,

thx for that info.

I tried

isi statistics Client --nodes=all --protocols=smb2,lsass_out --classes=create,delete,file_state --oderby=Ops

for some reason i get an error wehn i try to add lsass_in (invalid statistics key:'node.clientstats.proto.lasass_in'

i compared the above Output with

isi statistics heat --nodes=all --top i only see the classes other and namespace_read.

This is a indicator that the clients are spamming the System with create Ops which will never be processed in the filesystem ?

/Sebastian

Peter_Sero

1.2K Posts

1

December 25th, 2014 10:00

Hi Sebastian

(yeah lsass_in doesn;t work but is not relevant here)

If the creates are shown in client statistic but not heat statistics,

that means the protocol requests are sent but cannot be executed

within the filesystem. As I mentioned, either a permission

problem, or attempts to create files in non-existent directories.

To get the filenames/paths in question, capture some network packets

and decipher the SMB requests...

Charting the traffic by client over time may also

give some (indirect) clue which users/apps/updates

are causing the storm, by comparing with schedules

of recent software installs, for example.

Cheers

-- Peter

View All

No Events found!