PowerScale: NDMP Performance Troubleshooting
Summary: When investigating Network Data Management Protocol (NDMP) performance issues on a PowerScale cluster, there are some key areas to investigate for possible causes.
Instructions
In the newer version of OneFS 9.x.x, several performance enhancements were made to NDMP. Verify the cluster's version and installed RUPs to ensure that the latest improvements are being applied.
Evaluating NDMP performance should be assessed by analyzing three key system resources:
- CPU Utilization
- Disk I/O
- Network infrastructure
CPU Performance Analysis
For each node that is reported to be running slowly, check the isi_hw_status and top outputs.
- Identify Virtual Cores
From isi_hw_status, calculate virtual cores:
Virtual Cores = CPUs × Cores per CPU × 2 (if Hyperthreading is enabled)
Example:
PROC: Single-proc, Dual-HT-core → 1 × 2 × 2 = 4 virtual cores
- Check Load Averages
From the top output, review the 1, 3, and 5-minute load averages:
load averages: 4.71, 3.48, 3.09
If the load average exceeds the number of virtual cores, CPU load might be a contributing factor to NDMP performance issues. The recommendation is to reduce the number of active processes or redistribute the load to less heavily used nodes.
Disk Performance Analysis
Steps:
- Review Drive Statistics
For each node that is reported to be running slowly, check the isi statistics drive, and examine the Queue column. A value:
- > 1.0 indicates queuing
- > 1.5 suggests significant performance degradation
Queued: 2.3 → High I/O wait on the spindle
- Check Storage Utilization
Ensure that disk usage is below 90%. High utilization can exacerbate performance issues.
Example:
Used: 63.2% <-- Within acceptable range
- Recommendations
If queuing is high, reduce I/O load, redistribute backups, or scale resources.
Network Performance Analysis (Three-Way NDMP Only)
Steps:
- Identify NDMP Connections
In the netstat output, locate the NDMP CONTROL connection (port 10000) and identify the corresponding DATA connection (typically listed above it).
Example:
tcp4 0 384563 172.19.220.31.23261 172.19.200.22.55621 ESTABLISHED ← DATA tcp4 0 0 172.17.2.91.10000 172.19.200.22.55424 ESTABLISHED ← CONTROL
- Analyze Send-Q
A high and stable Send-Q (for example, a six-digit value) indicates that data is being sent but not acknowledged, suggesting a bottleneck.
- Check Backup Server
- Recommendations
If the Data Management Application (DMA) is the bottleneck, the recommendation is to engage the DMA support team for further assistance.