This post is more than 5 years old

1 Rookie

 • 

107 Posts

5354

September 1st, 2016 10:00

Isilon high LSASS CPU utilization

Anyone had issues with high CPU usage by LSASS on an Isilon cluster?  We had an issue that we thought was related to a failing IB/NVRAM card but still on this cluster we're seeing high LSASS CPU utilization on the cluster and simply cannot understand why.

Right now it's running at 137% WCPU when running the "top" command.  Then I see it drop off to nothing, then build itself back up over 100% cpu utilization, hit a top mark again and start all over.  This is in direct correlation to how much load the NFS process is putting on the cluster.

a)  Is this a problem or just normal and

b)  Should we be concerned about this?

125 Posts

September 2nd, 2016 09:00

Now my node is displaying about 250% WCPU during the same test.  Again, nothing wrong, just loaded.

You can also look at "top -H" on the node, which will break out each thread (so you'll see all the individual lwio threads at work). This is probably more academically interesting, but it will allow you to see individual thread states, and this could help narrow down a problem if one exists.

And yet another way to look at top is with -P.  Here's what I see on a node doing my current test:

CPU 0: 35.4% user,  0.0% nice, 56.7% system,  7.5% interrupt,  0.4% idle

CPU 1: 39.0% user,  0.0% nice, 55.1% system,  5.9% interrupt,  0.0% idle

CPU 2: 41.3% user,  0.0% nice, 55.9% system,  2.8% interrupt,  0.0% idle

CPU 3: 34.6% user,  0.0% nice, 57.1% system,  8.3% interrupt,  0.0% idle

Mem: 933M Active, 25G Inact, 20G Wired, 12G Buf, 1033M Free

Swap:

  PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND

16685 root            32  98 r150   327M 55336K ucond   1 287:30 220.21% lwio

You can see that the ~220% is pretty much the sum of the system per-CPU percentage, which I believe is why FBSD can go over 100% utilization.  Again, not solving your problem just reiterating that >100% WCPU isn't a problem necessarily

1 Rookie

 • 

107 Posts

September 2nd, 2016 08:00

Yes, however when the IB/NVRAM card had the first indication of a problem (causing the node to panic reboot and EMC say "You need to replace the card...NOW!"), LSASS WCPU use was at one point 430% before the card was replaced.


We do have a performance hit but I think this has to do with more and more NFS clients being added to the cluster and the cluster not being upgraded/expanded to accommodate.

252 Posts

September 2nd, 2016 08:00

Hi Brian,

The LSASS CPU usage sounds normal under high load as you have described. It would only be an issue if you were seeing a performance hit. Additionally, it would not be related to an IB card failure as IB handles backend communication, not front end client communication. If the IB card were to have failed completely, the node would go offline. Clients would still be connected, but they would have issues accessing the file system unless the IP address moved.

If you are having performance issues or trouble with clients accessing the file system, I would recommend opening a ticket with Isilon Support.

https://support.emc.com/servicecenter/createSR

252 Posts

September 2nd, 2016 09:00

Yes, high demand will cause high CPU usage. A bad IB/NVRAM card can cause a node panic. There is very little chance that they are related though. Sounds like you have two separate issues.

125 Posts

September 2nd, 2016 09:00

I'm running SpecSFS right now on a cluster (HD400 running OneFS 8.0.1) and the weighted CPU percentage is around 140% during the loading phases.  This doesn't represent a problem per se, it's just an indicator of relatively high load which I know to be true.  So what you're seeing *could* be normal.

On the other hand, there could be a problem as well if some aspect of the protocol server isn't working correctly, or isn't working efficiently.  For example, in some earlier OneFS releases parts of lwio were inefficient when doing NFS4 identity management.  This inefficiency would show up as higher than expected CPU utilization, which of course can just be exacerbated by higher load.

So, the fact that WCPU is over 100% isn't something to worry about in and of itself.  This could just be from high load.  Or, it's load combined with some known (or as yet unknown) inefficiency in lwio.  At the end of the day, if your users are impacted then it's best to get support involved so that someone can do more profiling of the system in order to get to the root of the issue.

No Events found!

Top