Example Log Entry:
May 20 19:12:20: %EX8PB:2 %MACAGT-2-HASH_COLLISION_LOG: Mac:00:02:e8:d6:58:20/Vlan:203 could not be added to L2 CAM on portpipe 2 linecard 2 due to hash collision. Total number of hash collisions: 30211
May 20 19:12:20: %EX8PB:2 %MACAGT-2-HASH_COLLISION_LOG: Mac:00:02:e8:d6:58:20/Vlan:203 could not be added to L2 CAM on portpipe 3 linecard 2 due to hash collision. Total number of hash collisions: 31979
How does it work:
In the Switch CAM table, there are a specific number of entries allocated for "Host table" which holds a portion for ARP on /32 networks and a specific amount for all other entries.
For example if there are 1024 Index values which point to arrays of 8 memory locations then each index value can hold eight values. All 8 in an array can be ARPs, but in total, across all locations, ARP entries can not exceed the portion dedicated to this function. Different switches have varying values.
When adding an ARP entry for an IP-address to the switch’s CAM, the switching chip calculates an index value (0-1023) using the IP address, and the ARP entry is saved to the location pointed by this hash algorithm.
In certain instances the hashing algorithm wants to store the index in a location which with all memory locations being used, and a hash collision is encountered.
When an IP address encounters a hash collision, its ARP entry will not be added to the CAM. Instead the CPU will have to load it in its software table. When traffic to that IP needs to be forwarded, the switch cannot do it in hardware. That traffic is then forwarded to the CPU and is soft forwarded. This introduces additional load on the CPU. This will tend to introduce latency for the specified path. In certain instances the amount of soft forwarding can exceed the CPU’s ability to process it and lead to packet loss.
Workarounds for hash failures :
Upgrade to software allowing for DUAL HASHING. Specific platforms post release 9.3 have the ability to perform dual hashing. Dual hashing support for both L2 and L3 tables is available. This feature is enabled by default on all those platforms running 9.3. Switch tries to re-hash and re-order the tables to accommodate new entries whenever a hash collision happens.
Add a routing layer. For core switch hash failures. The best way to overcome this limitation is to use a Top-of Rack (TOR) design and enable routing between the TORs and core switches. That way we can reduce the ARP table size on the Core. Add this routing layer between individual hosts and the core will relieve the core from having to learn all individual hosts’ ARP entries.
Reduce ARP timeout. Default is 4 hours. By reducing the length of time ARP’s are retained it allows for more frequent introduction of new ARP entries. This will of course also force all entries to cycle through faster and will increase ARP traffic for the attached networks.
Distribute IP addresses in the connected L3 network. A mapping of ALL possible IP addresses in important subnets to their corresponding hash values can be created but is extremely cumbersome to produce. IP’s can then be redistributed to avoid hash failures. This is the least effective short term fix available.