Equallogic SANs - connectivity issue

Question

I have a 4000 series in production with two physical NICs and a group IP.
I have a 6000 series in a DR site with four physical NICs and a group IP.

4000 series NIC1: 10.0.12.2
4000 series NIC2: 10.0.12.4
4000 series Group IP: 10.0.12.3

6000 series NIC1: 10.10.12.5
6000 series NIC2: 10.10.12.6
6000 series NIC3: 10.10.12.8
6000 series NIC4: 10.10.12.9
6000 series Group IP: 10.10.12.7

All the physical NICs on both SANs can ping each other, back and forth without issue.
The 4000 series group IP can ping everything, back and forth without issue.
However, the 6000 series group IP cannot ping the physical NICs on the 4000.  It can, however, ping the group IP of the 4000.

Other devices, such as workstations and servers, can ping everything on both sides, physical NICs and both group IPs.  It's only the SANs themselves that are having the issue.

When I do a traceroute originating from the 6000 series group IP, destined to the 4000 series physical NICs, I can see it jump from DR switch to DR router, to Primary router, to Primary switch, then it times out.

When I do a traceroute the other way, originating from the 4000 series physical NICs, destined to the 6000 group IP, it times out immediately without making a single hop. 

Any ideas or suggestions?

sketchy00 · Answer

Check the interconnects on your iSCSI switches vmpete.com/.../diagnosing-a-failed-iscsi-switch-interconnect-in-a-vsphere-environment

Joe S586 · Answer

Not sure if you tested pinging/traceroute from the specific eth interface on the source to the specific interface on the target array, could be that you have a missing ACL in the firewall or router. You can also try to failover the PS6000 to see if the problem is on the other controller as well.

To ping out of each of the specific ETH port interfaces use the commands listed below:

ping "-I "

(that is a –I as in Capital letter “eye”, AND ensure you use the quotes after ping and the end of the command!).

(the dest IP is the other group, including the Group IP, and ALL member eth interfaces, test all IP combinations (Group IP; Member1 eth0, eth1, eth2, etc.; Member2 eth0, eth1, eth2 etc.)

Then do the same from the DR Site to the Production Site.

Traceroute Example:

To traceroute out each of the specific ETH port interface add a switch to choose the interface IP as shown below:

GrpName>support traceroute "-s [ETH port source IP] [destinationIP]"

-joe

Davan · Answer

Hi Joe,

Yes, I had used the commands you listed to fully ping and trace every different combination of interfaces between the two devices. As described above, the only paths that failed were from the 4000's physical NICs to the 6000's group IP.

Currently my best guess is corrupt or incomplete ARP tables on the 4000's physical NICs. You thoughts?

Is there a way to re-arp on those NICs without causing downtime? Or is the only course of action to reset those NICs? Or reboot the entire SAN? Could I disable one NIC at a time, degrading performance but not causing a complete outage?

Alternatively, is there a way to manually failover from Controller1 to Controller2 on the SAN?

Thanks for your time!

Joe S586 · Answer

The arp cache is not a user facing command, so you would need support to clear it.

You can also clear it by doing a restart of the array (type "restart" at the CLI (via SSH/Telnet) or in the GUI, ensure you select the member first, then on the Service Tab, at the bottom, select restart). This is also how you manually fail over the controller.

-joe

Joe S586 · Answer

The restart/controller failover should only last 15-45 seconds, so as long as your iSCSI disk timeouts are configured properly per the support document "iSCSI Initiator and Operating System Considerations" (located on the firmware download page), you should be able to do this during a low I/O period without any interuptions. If doing replicaiton, you should pause any outbound and inbound to this group prior to the failover.

-joe

Davan · Answer

Thanks for your response, Joe.  It sounds like I'm going to have to schedule some downtime.

Davan · Answer

Hi Joe,

I'm hoping I can revive this question and get some more assistance.

I scheduled downtime last night, and upgraded the firmware on both SANs to the latest 5.2.6 Doing this included a restart on both SANs.

I can now successfully use "ping -I" from every interface to every other interface, in both directions.

Now the reason I'm doing this is to set up replication between both SANs.

I set both up as partners. This seems to work fine. If I restart one SAN, the other SAN notices the partner is down and sends me an email notification. So they appear to be able to see each other.

When I try to create replication on a volume from the PS4000 to the 6000, I get the message "Request to replication partner timed out"

When I try to do it the other way, replicate a volume on the 6000 to the 4000, the error I get is "IPC tmeout"

Any ideas here? I have also ran port scans from both ends using nmap, to verify that iSCSI port 3260 is open in both directions, and I get State: Open for every single interface.

Any help you can provide would be greatly appreciated!

sketchy00 · Answer

Perhaps I missed it, but can you show what the physical topology looks like as it traverses one site to another? Is there any NAT going on at all? Testing via ping will help determine continuity, but obviously is not an absolute for real traffic under the ports needed for replication.

Also, are jumbo frames enabled on each site within the SAN switchgear? Then, as it traverses, does it get knocked down to standard frames? One can modify this, but I believe this is only something that should be done by EqualLogic Technical Support, so you will have to defer to them for further assistance. This is so that they can determine if it is really the issue or not. I'd look into making sure routes are configured properly (both ways). I'd also suggest when you are doing your test replication, do it straight from the GUI of the Group Mgr. Replicating guest iSCSI and such can introduce other issues..

You might find my 5 part series on Replication using an EqualLogic helpful. Part one is: vmpete.com/.../replication-with-an-equallogic-san-part-1

sketchy00 · Answer

Well, a site to site VPN along with frame sizes might be enough to trip it up. Keep us posted on what else you find. Have you opened a case with EqualLogic support?

The posts are a bit long winded, but thorough. Since that time, helping many others work out their replication issues, it almost has always come down to network matters (routing, NAT, VPN tunneling, etc.)

Davan · Answer

Hello, Sketchy00, thank-you for your reply.

The 4000 is attached directly to a Cisco 3750 switch. That switch uses a Cisco 2821 router to connect via VPN to a colo site.

At the colo site, the 6000 is also connected to a Cisco 3750, and a Cisco 2921 router to connect to the VPN.

Both routers at both sites are configured as 'hubs' with multiple tunnels connecting our remote sites across the province.

Jumbo frames are enabled on the switches at both locations. The Group Manager GUI for both SANs tells me MTU size 9000 bytes on all interfaces (except the management ports).

As I've said, ping from all interfaces in either direction is working fine. I just verified that traceroute as well, from all interfaces in either direction is correct.

Thanks again for any help you can provide. I will begin reading your replication documentation to see if it gives me any ideas.

Joe S586 · Answer

When configuring the partner name, ensure that they are properly entered this on both groups (they are case sensitive, and must match exactly, so SAN_A is not san_a).

Failover the member(s) one at a time, to the other Control module (CM), Once you failover, re-test the ping/traceroute, to test to ensure you have all paths setup correctly. Then try to setup the replication again.

If still having problems, I would suggest that you call into support so they can look at the diag files.

-joe

sketchy00 · Answer

Great catch on the partner name Joe.  I recall that stumping me on one occasion.  Perhaps the OP might have as well.

Davan · Answer

Thanks for the reply, Joe.

When you say Partner Name, do you mean Group Name? When setting up a replication partner it asks for Group Name, Group IP and Description. So I've been using the Group Name to set up replication partners. I have tried using the Member Name, and it seems to work, until I try to configure replication on a volume I get the message "No connection could be established. Verify that the partner IP address is correct". I've tried configuring the partner with all the different interface IPs, with no difference in results.

I have already attempted the failover to the backup controller on each SAN as well. No difference in results.

Joe S586 · Answer

Yes the group name is the partner name

Davan · Answer

Hello, it's me again!

I've just been advised by Cisco support that the jumbo frames on either end of the vpn are indeed being shrunk back to regular size packets as they're sent between sites.

As mentioned above by Sketchy00, could this be the source of my issue? And if so, is the solution to disable jumbo frames on either side?

I have verified with Cisco support that there is no NAT between these endpoints, everything is routed directly.

Thanks for your insight!

EqualLogic

Equallogic SANs - connectivity issue

Was this post helpful?