I have a 4000 series in production with two physical NICs and a group IP.
I have a 6000 series in a DR site with four physical NICs and a group IP.
4000 series NIC1: 10.0.12.2
4000 series NIC2: 10.0.12.4
4000 series Group IP: 10.0.12.3
6000 series NIC1: 10.10.12.5
6000 series NIC2: 10.10.12.6
6000 series NIC3: 10.10.12.8
6000 series NIC4: 10.10.12.9
6000 series Group IP: 10.10.12.7
All the physical NICs on both SANs can ping each other, back and forth without issue.
The 4000 series group IP can ping everything, back and forth without issue.
However, the 6000 series group IP cannot ping the physical NICs on the 4000. It can, however, ping the group IP of the 4000.
Other devices, such as workstations and servers, can ping everything on both sides, physical NICs and both group IPs. It's only the SANs themselves that are having the issue.
When I do a traceroute originating from the 6000 series group IP, destined to the 4000 series physical NICs, I can see it jump from DR switch to DR router, to Primary router, to Primary switch, then it times out.
When I do a traceroute the other way, originating from the 4000 series physical NICs, destined to the 6000 group IP, it times out immediately without making a single hop.
Any ideas or suggestions?
Not sure if you tested pinging/traceroute from the specific eth interface on the source to the specific interface on the target array, could be that you have a missing ACL in the firewall or router. You can also try to failover the PS6000 to see if the problem is on the other controller as well.
To ping out of each of the specific ETH port interfaces use the commands listed below:
ping "-I <source_ETH_IP> <dest IP>"
(that is a –I as in Capital letter “eye”, AND ensure you use the quotes after ping and the end of the command!).
(the dest IP is the other group, including the Group IP, and ALL member eth interfaces, test all IP combinations (Group IP; Member1 eth0, eth1, eth2, etc.; Member2 eth0, eth1, eth2 etc.)
Then do the same from the DR Site to the Production Site.
To traceroute out each of the specific ETH port interface add a switch to choose the interface IP as shown below:
GrpName>support traceroute "-s [ETH port source IP] [destinationIP]"
Yes, I had used the commands you listed to fully ping and trace every different combination of interfaces between the two devices. As described above, the only paths that failed were from the 4000's physical NICs to the 6000's group IP.
Currently my best guess is corrupt or incomplete ARP tables on the 4000's physical NICs. You thoughts?
Is there a way to re-arp on those NICs without causing downtime? Or is the only course of action to reset those NICs? Or reboot the entire SAN? Could I disable one NIC at a time, degrading performance but not causing a complete outage?
Alternatively, is there a way to manually failover from Controller1 to Controller2 on the SAN?
Thanks for your time!
The arp cache is not a user facing command, so you would need support to clear it.
You can also clear it by doing a restart of the array (type "restart" at the CLI (via SSH/Telnet) or in the GUI, ensure you select the member first, then on the Service Tab, at the bottom, select restart). This is also how you manually fail over the controller.
The restart/controller failover should only last 15-45 seconds, so as long as your iSCSI disk timeouts are configured properly per the support document "iSCSI Initiator and Operating System Considerations" (located on the firmware download page), you should be able to do this during a low I/O period without any interuptions. If doing replicaiton, you should pause any outbound and inbound to this group prior to the failover.
I'm hoping I can revive this question and get some more assistance.
I scheduled downtime last night, and upgraded the firmware on both SANs to the latest 5.2.6 Doing this included a restart on both SANs.
I can now successfully use "ping -I" from every interface to every other interface, in both directions.
Now the reason I'm doing this is to set up replication between both SANs.
I set both up as partners. This seems to work fine. If I restart one SAN, the other SAN notices the partner is down and sends me an email notification. So they appear to be able to see each other.
When I try to create replication on a volume from the PS4000 to the 6000, I get the message "Request to replication partner timed out"
When I try to do it the other way, replicate a volume on the 6000 to the 4000, the error I get is "IPC tmeout"
Any ideas here? I have also ran port scans from both ends using nmap, to verify that iSCSI port 3260 is open in both directions, and I get State: Open for every single interface.
Any help you can provide would be greatly appreciated!
Perhaps I missed it, but can you show what the physical topology looks like as it traverses one site to another? Is there any NAT going on at all? Testing via ping will help determine continuity, but obviously is not an absolute for real traffic under the ports needed for replication.
Also, are jumbo frames enabled on each site within the SAN switchgear? Then, as it traverses, does it get knocked down to standard frames? One can modify this, but I believe this is only something that should be done by EqualLogic Technical Support, so you will have to defer to them for further assistance. This is so that they can determine if it is really the issue or not. I'd look into making sure routes are configured properly (both ways). I'd also suggest when you are doing your test replication, do it straight from the GUI of the Group Mgr. Replicating guest iSCSI and such can introduce other issues..
You might find my 5 part series on Replication using an EqualLogic helpful. Part one is: vmpete.com/.../replication-with-an-equallogic-san-part-1
Hello, Sketchy00, thank-you for your reply.
The 4000 is attached directly to a Cisco 3750 switch. That switch uses a Cisco 2821 router to connect via VPN to a colo site.
At the colo site, the 6000 is also connected to a Cisco 3750, and a Cisco 2921 router to connect to the VPN.
Both routers at both sites are configured as 'hubs' with multiple tunnels connecting our remote sites across the province.
Jumbo frames are enabled on the switches at both locations. The Group Manager GUI for both SANs tells me MTU size 9000 bytes on all interfaces (except the management ports).
As I've said, ping from all interfaces in either direction is working fine. I just verified that traceroute as well, from all interfaces in either direction is correct.
Thanks again for any help you can provide. I will begin reading your replication documentation to see if it gives me any ideas.