Unsolved
1 Rookie
•
13 Posts
0
823
April 25th, 2022 09:00
H500 cluster performance lacking.
We have two main datacenters, one in North Dakota where we have an H400 cluster, and we have an H500 cluster in our corporate location in Kansas. Our H400 performs flawless and reliably, while our H500 is struggling and has wild transfer rates on LAN transfers. We get 4-600MB/sec consistently in our H400 location, while in our H500 it's wildly intermittent. Transfers appear to happen in a sawtooth pattern where the xfer rates are up and down constantly during a transfer.
Here's what we know:
-Switch gear is exactly the same in both locations.
-We have performance issues within the Data Center where the servers/pcs are using the same switch as the Isilon... so minimal hops.
-We have the latest patch(es) on the H500 hardware to see if that was the issue.
- Machine to machine transfer on that same switch that the H500 uses are stellar. So we know the switches are working.
Help! Support has been severely lacking and we have people who are having a hard time getting work done...



CendresMetaux
1 Rookie
•
62 Posts
0
May 7th, 2022 00:00
While your at it, maybe you want to do a line by line config comparison "bad actor" behavior analysis on your frontend switching gear as well from the beginning as well. Check spanning tree config (priorities), check for any related static mac assignments (load balancing virtual ip sticking etc.)? do you use LACP, if yes, no mishaps and the right subtype plus config on both ends? Check for port flapping (on all ports! maybe its a consumer and not the cluster that misbehaves and brings noise into the network), check switch cpu/memory usage (good indicator on loops, flooding, transmit fails), check for rx/tx transmit errors on all ports.
Sawtooth all over the place in my experience - if the storage is as performant as this and not a usual suspect like small NASes, and silly storage misconfis can be ignored/excluded - tends to be originating from the network and networking components that do as they are (wrongly) told...
CendresMetaux
1 Rookie
•
62 Posts
0
May 7th, 2022 00:00
oh, perhaps there is a firewall between your consumers and the cluster access networks, or, are the IP pools in your consumer networks/subnets? If there is a firewall (or router) in between, check its performance headroom, or, just to make sure its not the bottleneck, create an export/IP pool on your consumer subnet and test this performance... Firewalls tend to take a deep hit if L4+ inspection (IPS/protocol analysis, inline AV scanning, etc.) is in play... ...Sometimes its the max L2/L3 throughput already.
CendresMetaux
1 Rookie
•
62 Posts
0
May 7th, 2022 00:00
If - at least for the moment - we assume that:
- Frontend switching has the exact same config as h400 regarding VLANs, SBR, IP pools and SmartConnect assignment rules etc.
- Nodes are all physically ok and OneFS software/config is to your spec and identical to h400 cluster and according to dell best practice
then I would have at first a deeper look at the backend switching config. Pull out a config dump on your h400 backend switches and compare it with whats on the h500 side. Compare both, Switches (tcp/ip or infiniband? both dcs the same?) and cluster backend config (primary, secondary and third subnet). Also check the physical components (which node goes into which switchport, no accidental crossings?)
If the backend is proper, then get back into frontend and OneFS analysis...
Just as a reference with our two h500 clusters located in two datacenters, we got around 4.5gigabyte read and 2.3gigabyte write performance when it was only 4 nodes, now with 8 nodes per cluster its almost twice as fast for both figures and we tend to first reach test equipment limits before the clusters become the bottleneck...
Phil.Lam
3 Apprentice
•
625 Posts
0
May 8th, 2022 21:00
have you done any metrics testing on H500 before it went into production? a quick test would be iperf3 test between client and H500. Do you have IIQ?