: This topic is part of the Isilon Uptime Info Hub
Written by Rash Vyas, KB article 126793.
PowerScale OneFS Cluster Utilization: Keep below 90%
- A general guideline is to keep cluster utilization below 90%. If you have 8 TB and 10 TB drives in a node, then it’s good to keep utilization 80% or below to get better performance. Performance of the system may degrade beyond the recommended utilization level.
- When a drive fails at utilization above 90%, the system will take more time to smartfail (reprotect) a drive. A node failure at 90% and above may take a long time, and in some cases may not complete if there is not enough capacity.
- When capacity is above 90%, and a drive fails, the load will increase on the remaining drives and will further degrade performance.
- During a drive failure, the Job Engine goes into degraded status to run FlexProtect or FlexProtectLin, the PowerScale OneFS maintenance job to re-protect data on remaining drives. By default, no other maintenance jobs (for example, SnapshotDelete or Collect) could run with the exception that Support can modify the degraded status in certain cases for other jobs to run. This means that utilization can increase rapidly, as there will be the accumulation of snapshots pending deletion.
- Verify that nodes, node pools, and diskpools are below 90%. PowerScale OneFS has maintenance jobs to keep them balanced, but in some cases workflow requirements (for example, the use of filepool policies or node add or data deletion) could make either some nodes or node pools or disk pools go above 90%.
- You will prevent many problems by keeping PowerScale OneFS system utilization below 90%. Depending on your sales cycle, and the process it takes internally and externally to get nodes on-site and added to a cluster, start the conversation about adding node(s) when utilization reaches 80 to 85%.
- For more information, refer to the Best Practices Guide for Maintaining Enough Free Space on PowerScale OneFS clusters.
A node failure in PowerScale OneFS doesn’t start the smartfail (reprotect) process and an administrator has to initiate it.
Protection Level: Keep node pools at the recommended protection level
- OneFS uses the Reed Solomon algorithm for N+M protection. In the N+M data protection model, N represents the number of data-stripe units, and M represents the number of simultaneous node or drive failures—or a combination of node and drive failures—that the cluster can withstand without incurring data loss. N must be larger than M.
- OneFS 7.2.x and later recommends the protection level for each node pool in the Web Administration Interface under File System > Storage Pools > SmartPools. Re-evaluate the protection level every time you add a new node, and use the recommended protection level for each node pool.
- Changing the protection level can change capacity utilization because data needs to be protected at higher/new protection levels. If your utilization level is high, consult a Dell EMC Support expert to consider the impact on utilization.
- Nodes that are the same or equivalent (see the PowerScale OneFS Support and Compatibility Guide ) would be in the same node pool.
- Data will be re-protected at the new protection level after the successful completion of an PowerScale OneFS maintenance job
Target Code: Always stay on Target Code or Target -1
- Dell EMC PowerScale OneFS releases code as generally available (GA) after it has completed internal testing. Once code satisfies specific criteria, which includes production time in the field, deployments across all support node platforms, and other quality metrics, Dell EMC designates that code as Target code. To ensure that PowerScale OneFS clusters are running the most stable and reliable version of OneFS, upgrade to the latest available Target Code for the OneFS family that meets your business needs. If you can’t upgrade to Target, stay at least at Target -1 code. There would be some exceptions to these guidelines and PowerScale OneFS experts like Dell EMC Support and/or a Technical Account Manager (TAM) can recommend a specific code that may not be target code.
- Plan your upgrade and refer to KB article 178133: OneFS Upgrades - PowerScale OneFS Info Hub. Check with your account team or Dell EMC support to see if you can utilize the Dell EMC Remote Proactive Support (RPS) team to do a pre-health check and upgrade.
Patches: Review and install applicable patches
- Refer to the Current PowerScale OneFS Patches guide to find information about a patch for any version of OneFS.
- Look for patches applicable to your version of OneFS and workflow.
- Each patch lists a summary of the patch, what version of OneFS it applies to, and what MR version the bug is fixed in.
- Not all patches will be applicable to your workflow (For example, if you don’t use HDFS, you don’t need HDFS patches).
- Some patches may require a reboot of nodes, some patches may require just the restart of a few services, and some may be online.
- Download the patch file by clicking on the Patch-ID. In the zip file, there is a patch and README. The README file provides all details on the impact of patch install and procedure.
- You can also view this video to understand the patch and install process.
- Periodically check for new patches that apply to your system and are relevant to your workflow.
Firmware: Check firmware version every three month
- Dell EMC Dell EMC periodically releases node and drive firmware updates.
- Some firmware may have a patch or OneFS requirement. Always go through the Release Notes before upgrading firmware.
- Refer to Current PowerScale OneFS Software Releases to find the latest firmware versions.
: On 5th gen nodes (S210, X410, X210, NL410, HD400), there is BMC firmware and CMC firmware which needs to be upgraded as well. These updates include new features and resolve known issues that might be relevant to you. Node firmware requires a node reboot. BMC firmware requires a node reboot. Drive Firmware is mostly online, with some exceptions for certain drive models and/or OneFS versions.
Dell EMC Technical and Security Advisories: Do not ignore
- Make sure you are registered on the Dell EMC Web site to receive EMC Technical Advisories (ETAs) and EMC Security Advisories(ESAs).
- ETAs and ESAs are specific to OneFS versions, so see if you have an PowerScale OneFS cluster running that version of OneFS and if your PowerScale OneFS cluster is impacted.
- ETAs alert you about potential hardware or software issues that could cause serious negative impacts to a production environment, such as data loss, data unavailability, loss of system functionality, or anything that could result in a significant safety risk. The advisories include specific details about the issue and instructions to help prevent or alleviate the problem. To determine the impact of the ETA, read the severity rating description in the impact section of the ETA and impacted OneFS versions.
- ESAs alert you to potential security vulnerabilities and their remedies for DellEMC products. The advisories include specific details about the issue and instructions to help prevent or alleviate the problem. Common Vulnerabilities and Exposures (CVEs) identify publicly known security concerns. A Dell EMC ESA can address one or more CVEs.
ESRS/Alerts: Connect your PowerScale OneFS systems to Dell EMC through ESRS and configure events by email
- With the release of OneFS 7.1, PowerScale OneFS products can utilize ESRS for remote connectivity (video).
- ESRS also allows remote support to gather logs and connect to devices securely. You can manage access to devices using ESRS policy manager.
- Without ESRS and email notifications, you could miss out on important events and FCOs since Dell EMC may not have any information about the device in your data center.
- Ask your Dell EMC Support representative for more information on configuring ESRS (at no cost).
: The ESRS version must be at 2.24 or higher. PowerScale OneFS events can go out via email, SNMP, or ESRS. Not all events generate a Service Request, so it is important to configure email notification for events.
PowerScale OneFS Maintenance Job: Verify that it is running
OneFS uses the Job Engine to schedule maintenance tasks, known as jobs. Some of these jobs are critical to run (for example, SnapshotDelete to delete expired snapshots, MultiScan or Autobalance to keep all nodes utilization balanced, and FlexProtect to reprotect data from failed devices). You should regularly check that jobs are running by checking the Job Engine status using the isi job status command.
Read additional details about the PowerScale OneFS Job Engine Ask the Expert: The What, Why and How of the PowerScale OneFS Job Engine
. Also, refer to the following KB articles:
There are many configuration options available to the number of jobs, workers, priority, and impact of each job. Please change them only when directed by a Dell EMC engineer.
PowerScale OneFS Events: Verify that you and your cluster are receiving event alerts
The PowerScale OneFS cluster will create an event when it detects an issue. Not all events result in Dell EMC Services Requests (for example, quota exceed or SyncIq RPO). It is important to keep an eye on those events and ensure that you configured ways (For example, SMTP or SNMP) to receive those events.
- If you don't receive events, then check your SMTP and SNMP channels. If those are working, try resetting the CELOG (Clusterwide Event Log) database.
- Send a test event once a week to ensure your system is sending events. See KB article 19983: OneFS: How to reset the CELOG database and clear events in OneFS 7.x
- In OneFS 8.0 and above, a new feature is added that allows you to put the CELOG into maintenance mode to avoid receiving alerts or triggering Dial-Home service requests while tests or planned activities are being made on the PowerScale OneFS cluster. See KB article 22784: [OneFS 8.0+] How to place the CELOG into Maintenance mode
Cluster Health Check: Gather logs and check health using InsightIQ to monitor performance
InsightIQ is Dell EMC software available to monitor performance and file system statistics, understand how the system is performing during normal operation, and find out types of files being written on PowerScale OneFS. It is a single pane of glass for performance monitoring of different PowerScale OneFS systems.
: InsightIQ does require a license. Talk with your account representative to obtain a license.