Azure Local: Network Validation Precheck Fails in Fully Converged Topology when Storage Network Isolation is Configured on Top of Rack Switches
Summary: Deploying an Azure Local instance with the fully converged network topology fails the Network Validation prechecks. This occurs if storage network isolation is configured on the top-of-rack (ToR) switches according to best practices. ...
Symptoms
The Network Validation prechecks fail during deployment of an Azure Local instance with a fully converged network topology.
Cause
The failing precheck attempts to connect from the IP addresses assigned to storage NIC ports 1 to the IP addresses assigned to storage NIC ports 2 on other members of the instance. This connection attempt fails when VLAN network segmentation implemented in the ToR switches separates storage network traffic. This segmentation prevents network traffic originating on storage NIC ports 1 on each member from crossing over to storage NIC ports 2 on each member. This network segmentation strategy adheres to Azure Local network design best practices as documented by Microsoft.
Resolution
Add both storage VLANs to the storage VLAN trunk on each ToR switch that connects the physical NIC ports (pNICs) of the Azure Local members. Normally each ToR switch has only one storage VLAN assigned to the switch port VLAN trunk. Adding both storage VLANs to the switch port VLAN trunk allows the Network Validation precheck to complete successfully.
Additional Information
Azure Local storage network traffic uses high-performance RDMA protocols that are sensitive to network latency. Minimizing switch hops reduces network latency and is highly beneficial for RDMA network traffic. Azure Local network design best practice uses one IP subnet and VLAN for network traffic originating from NIC ports 1 and a different IP subnet and VLAN for network traffic originating from NIC ports 2. This type of network traffic segmentation prevents storage network traffic from unnecessarily flowing between both ToR switches and incurring extra switch hops.
Details can be found in Host network requirements for Azure Local - Azure Local | Microsoft Learn.
The fully converged network topology involves two pNICs that are used for all Azure Local network traffic. These two pNICs have a switch embedded teaming (SET) virtual switch (vSwitch) bound to them. Three virtual NICs (vNICs) are presented to the host partition and connected to the vSwitch in the host partition. Two vNICs are used for storage network traffic, and the third vNIC is used for management network traffic.
A description of the fully converged network topology can be found at the following link: