Avamar: Packet Drops Due to MAC Flapping on Network Switch
Resumo: This article describes a packet loss issue observed when Avamar Data Store nodes configured with Linux active-backup network bonding are connected to Cisco Catalyst switches. The behavior may also occur with other switch models. The issue is caused by the switch’s MAC address learning behavior most likely related to the control‑packet learning feature not correctly handling the MAC address assignment used by Linux bonding, resulting in intermittent traffic loss. ...
Sintomas
Avamar Data Store nodes configured with Linux active-backup bonds (mode=1) experience intermittent packet loss when connected to Cisco Catalyst switches due to their MAC learning behavior. The issue manifests as:
- Burst packet drops (25-60%) during ping tests
- Packet loss during network failover or bonding events (example: When the secondary bond member is tested as primary)
- Individual interfaces show 0% packet loss when tested outside the bond
- Switch MAC address tables show both bond member MAC addresses appearing on the same port
Causa
Cisco Catalyst switches include a security feature called "control-packet learning" that affects how the switch learns and handles MAC addresses. When Linux active-backup bonds use a fixed MAC address (typically the primary eth MAC), the Catalyst switch may:
- See traffic from the active physical port but with a different MAC address
- Learn both bond member MACs on different ports
- Experience MAC flapping as the bond fails over or as the switch updates its MAC table
- Drop packets during MAC learning transitions
This behavior is specific to Cisco Catalyst switches compared to Cisco Nexus switches handle Linux bonding MAC assignments differently and do not exhibit this issue.
The Linux bond's default behavior is to use a fixed MAC address (the primary eth's MAC) regardless of which eth is transmitting. This conflicts with the Catalyst switch's MAC learning expectations.
Resolução
Setting the bonding option fail_over_mac=active resolves this issue. This causes the bond to use the MAC address of the active eth rather than a fixed MAC, resolving the switch MAC identification issue.
- Connect to the affected node and ensure it is logged as root
- Backup the existing bond configuration
cp -p /etc/sysconfig/network/ifcfg-bond0 /etc/sysconfig/network/x-ifcfg-bond.`date -I`
- Edit the file and add the
fail_over_mac=activeoption to the bond configuration:
vi /etc/sysconfig/network/ifcfg-bond0Change from:
BONDING_MODULE_OPTS="primary=eth1 mode=1 miimon=100"To:
BONDING_MODULE_OPTS="primary=eth1 mode=1 miimon=100 fail_over_mac=active"
- Repeat for all bond files as required (bond0, bond3, etc.)
vi /etc/sysconfig/network/ifcfg-bond3 BONDING_MODULE_OPTS="primary=eth4 mode=1 miimon=100 fail_over_mac=active"
- Create the following script to restart network:
cd /tmp vi restart_script.sh
- Copy and paste the following script and save the file:
#!/bin/sh service network stop modprobe -r bonding service network start
- Run the script
chmod +x restart_script.sh nohup ./restart_script.sh
- Verify bond status:
cat /proc/net/bonding/bond0
- Confirm "Currently Active Slave" is correct
- Confirm "MII Status: up"
- Run ping command:
ping -c 100 <target_node_ip>
- It should show 0% packet loss
- Check the switch MAC table (requires network admin):
show mac address-table interface <port>
- It should show only ONE MAC per port
- Repeat the steps 1-9 on affected nodes
- Ensure to bring the grid back up to normal production state(resume backup/maintenance, etc.)