Data Domain: Best Practices for Data Migration on PowerProtect Data Domain Systems Using Mtree Replication
Summary: This article discusses preparing to migrate data using Mtree replication (MRepl) from legacy PowerProtect Data Domain (PPDD) systems without internal QAT card support. For example, DD9500 and DD9800. It is crucial to consider the current system operation workload to avoid unexpected side effects that could negatively impact the data migration outcomes. This article helps plan migration operations that require a new Mtree Replication (MRepl) context configuration using legacy PPDD systems as source. ...
Instructions
With the introduction of 16G platforms, migrating specific MTrees from legacy PPDD to a newer system is a common requirement.
The migration process creates new Mtree replication contexts. Consider the following to ensure minimal disruption.
- Current system workload from backup operations
- Compression capabilities differences (for example QAT card support)
- Sudden incorporation of new Mrepl context configurations
- HW errors impacting the garbage collection (GC) process
To maintain data integrity and meet Service Level Agreements, the system might panic at certain operational thresholds.
The panic mechanism triggers self-corrective actions to ensure the system always operates reliably.
This discusses these considerations and guides how to prevent unexpected downtime that may interfere with migration plans.
Current System Workload from Backup Operations:
Focus initially on current system operations. Before the migration, monitor key metrics. These include ongoing workloads, CPU utilization, memory usage, network status, and hardware alerts.
The objective is to preserve the system's operation within normal parameters.
Compression capabilities differences:
While preparing for migration using Mtree replication (Mrepl), consider the disparity in compression capabilities between systems.
Some legacy systems lack an onboard compression card to assist with compression-related operations.
The DD9900, DD9400, or DD6900 systems allow attachment of an external QAT card to accelerate compression operations.
When a QAT card is not present (for example, DD9800, DD9500), it relies on CPU and memory resources for compression and decompression tasks.
When configuring new replication contexts without QAT support, the data must first be uncompressed.
This might result in a CPU usage spike during the replication initialization phase.
The source checks the destination to identify the type of compression card available.
When a 16G system (DD9910, DD9410, or DD6410) is the destination, the source must decompress data from the legacy 'gzfast' format. It must then compress it to the LZ format.
Gradually Incorporate New Mrepl Context Configuration:
During disaster recovery (DR), when replicating data from one Data Domain to another, replication jobs typically start after data ingestion has completed.
This ensures that the destination site receives all replicated data.
When new replication contexts are defined for migration, the source must handle a significant data during replication initialization.
This is because the destination lacks deduplicated data and optimization is not yet possible. This results in an increased load on the source system.
To mitigate this, when the system continues to process backup workloads (I/O), gradually incorporate replication contexts associated with the migration.
Define a low replication throughput to limit the resources allocated to these migration-related replication contexts.
Once replication begins to build optimizations on the destination and the operational parameters are validated, add more replication (migration) contexts. Or, modify the replication throughput on existing ones.
The objective is to avoid triggering the system's protection mechanisms. This leads to system panics which can impact migrations.
Remember that system performance references are calculated based on workloads in operation, not for new workloads.
Configure throttling gradually during migration scenarios.
The "replication throttle add" command can be used to schedule a specific point in time and allocate a defined bandwidth (in Mbps) for throttling.
Initiate new replication jobs with a limited available bandwidth (lower throttle). Then, assess the impact on system operation.
Once the replication job is in progress, the throttle can be increased to provide additional bandwidth.
It is also recommended to monitor system analytics, including CPU, Memory, and network consumption, available on DDSM.
HW errors impacting the garbage collection (GC) process:
Another factor that can potentially cause backup or replication performance degradation is associated with hardware failures, particularly during default garbage collection operations. Under normal operational conditions, the garbage collection mechanism on PPDD systems completes space recycling activities without impacting ingest, restore, or replication operations. In certain situations, the system offers options to define garbage collection throttling, providing system administrators with additional control over when the system's cleaning processes occur.
The default throttle configuration for garbage collection does not impact backups and restores. Most instances where an impact is observed are linked to hardware failures. For example, when certain drives require replacement, the system's ongoing I/O demands can slow down the storage of backups and restores, consequently affecting overall GC operations.
The Data Domain Operating System provides comprehensive alert mechanisms for such hardware issues, proactively raising alerts when these conditions are detected. This facilitates backup operators in promptly resolving hardware-related problems.
Another important factor to consider is that replication activities are equally important as the backup and restore. By design, each platform provides a fixed number of streams for each job and can process concurrent operations under the defined limits to meet Service Level Agreements (SLAs).
Conclusion:
Successful data migration using Mtree replication requires careful consideration of the following;
- Monitoring the current system workload from backup operations
- Understand legacy platforms such as DD9800 or DD9500
- Use a different compression algorithm (gzfast).
- When new MTree replication (MRepl) contexts are created on a system under operation, gradually incorporate the new Mrepl context configurations
- Closely monitor the impact of the new workloads on the system.
- Monitor potential hardware errors (which impact operations from the Garbage Collection process).
Following these best practices, minimizes disruptions and maintains system stability.
Implementing these recommendations helps avoid unexpected downtime and facilitates data migration.