Isilon high LSASS CPU utilization

Question

Anyone had issues with high CPU usage by LSASS on an Isilon cluster? We had an issue that we thought was related to a failing IB/NVRAM card but still on this cluster we're seeing high LSASS CPU utilization on the cluster and simply cannot understand why.

Right now it's running at 137% WCPU when running the "top" command. Then I see it drop off to nothing, then build itself back up over 100% cpu utilization, hit a top mark again and start all over. This is in direct correlation to how much load the NFS process is putting on the cluster.

a) Is this a problem or just normal and

b) Should we be concerned about this?

kipcranford · Accepted Answer

Now my node is displaying about 250% WCPU during the same test. Again, nothing wrong, just loaded.

You can also look at "top -H" on the node, which will break out each thread (so you'll see all the individual lwio threads at work). This is probably more academically interesting, but it will allow you to see individual thread states, and this could help narrow down a problem if one exists.

And yet another way to look at top is with -P. Here's what I see on a node doing my current test:

CPU 0: 35.4% user, 0.0% nice, 56.7% system, 7.5% interrupt, 0.4% idle

CPU 1: 39.0% user, 0.0% nice, 55.1% system, 5.9% interrupt, 0.0% idle

CPU 2: 41.3% user, 0.0% nice, 55.9% system, 2.8% interrupt, 0.0% idle

CPU 3: 34.6% user, 0.0% nice, 57.1% system, 8.3% interrupt, 0.0% idle

Mem: 933M Active, 25G Inact, 20G Wired, 12G Buf, 1033M Free

Swap:

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND

16685 root 32 98 r150 327M 55336K ucond 1 287:30 220.21% lwio

You can see that the ~220% is pretty much the sum of the system per-CPU percentage, which I believe is why FBSD can go over 100% utilization. Again, not solving your problem just reiterating that >100% WCPU isn't a problem necessarily

Brian_Coulombe_ · Answer

Yes, however when the IB/NVRAM card had the first indication of a problem (causing the node to panic reboot and EMC say "You need to replace the card...NOW!"), LSASS WCPU use was at one point 430% before the card was replaced.

We do have a performance hit but I think this has to do with more and more NFS clients being added to the cluster and the cluster not being upgraded/expanded to accommodate.

sjones51 · Answer

Hi Brian,

The LSASS CPU usage sounds normal under high load as you have described. It would only be an issue if you were seeing a performance hit. Additionally, it would not be related to an IB card failure as IB handles backend communication, not front end client communication. If the IB card were to have failed completely, the node would go offline. Clients would still be connected, but they would have issues accessing the file system unless the IP address moved.

If you are having performance issues or trouble with clients accessing the file system, I would recommend opening a ticket with Isilon Support.

https://support.emc.com/servicecenter/createSR

sjones51 · Answer

Yes, high demand will cause high CPU usage. A bad IB/NVRAM card can cause a node panic. There is very little chance that they are related though. Sounds like you have two separate issues.

kipcranford · Answer

I'm running SpecSFS right now on a cluster (HD400 running OneFS 8.0.1) and the weighted CPU percentage is around 140% during the loading phases. This doesn't represent a problem per se, it's just an indicator of relatively high load which I know to be true. So what you're seeing *could* be normal.

On the other hand, there could be a problem as well if some aspect of the protocol server isn't working correctly, or isn't working efficiently. For example, in some earlier OneFS releases parts of lwio were inefficient when doing NFS4 identity management. This inefficiency would show up as higher than expected CPU utilization, which of course can just be exacerbated by higher load.

So, the fact that WCPU is over 100% isn't something to worry about in and of itself. This could just be from high load. Or, it's load combined with some known (or as yet unknown) inefficiency in lwio. At the end of the day, if your users are impacted then it's best to get support involved so that someone can do more profiling of the system in order to get to the root of the issue.

Isilon

Isilon high LSASS CPU utilization

Was this post helpful?