67 Posts

September 18th, 2011 03:00

What are you hoping to accomplish (fix conflicts, standardize your deployments, tune SAN parameters)?

If your attempting to fine tune then start simple, initially.  Beging to layer your approach (proof of concept).  First I would, abstract out how the network is layed out (production, SAN, management) and how its managed (vlan, routing, ACL's, etc).  Start from there...  

37 Posts

September 19th, 2011 12:00

we are trying to reduce discards and errors on the interfaces so that the iSCSI traffic is more stable, with less drops, etc. we are using MPIO as well, but it's not reliable enough. we have the same setup across many like systems who are not reporting these errors and discards, so we're trying to figure out what setting makes the most sense.

Essentially for the most reliable, stable connection with the highest performance opportunity, we're hoping Dell has done the due diligence and come up with the recommended configuration for these settings. One that results in the least dropped connections and allows the data to move the fastest.

Also trying to find out which NIC is the right to purchase so we can put the iSCSI storage on 2 NICs and the regular LAN access/management traffic on two others.

67 Posts

September 19th, 2011 23:00

Sounds like there are two objectives; stablize iSCSI connection & tune solution for peak performance (COMMs advanced features as well as MPIO function and net performance)?

Let's start with looking to stablize the connection/s by understanding what you have.  What can you tell me about where these errored states are being reported - switchport interface counters, switch logs, OS netsh interface..., Target stats, etc...  Can you also share a depiction of your setup (either explain or general drawing)?

I'll be asking quite a few more questions from the Init to the Target as we progress, but lets start with these questions first.  Bear with me.

37 Posts

September 21st, 2011 07:00

We have a flat network with gigabit switches interconnected in a hub/spoke topology. They are all in the same VLAN on same subnet. Connected to the various switches are Windows Server 2008 R2 servers with 2 NICs. Attached to the various switches are Dell Equallogic SAN arrays using iSCSI connectivity. In this particular situation, the servers are attached to the same switch as the array and are in the same subnet.

Below are the stats for one of the switchports in question. the other is the same pretty much.

 Hardware is GigabitEthernet, address is 000c.db6d.5c8c (bia 000c.db6d.5c8c)

 Configured speed auto, actual 1Gbit, configured duplex fdx, actual fdx

 Configured mdi mode AUTO, actual MDIX

 Member of L2 VLAN ID 1, port is untagged, port state is FORWARDING

 STP configured to ON, priority is level0, flow control enabled

 mirror disabled, monitor disabled

 Not member of any active trunks

 Not member of any configured trunks

 Port name is Server1-NIC1

 IP MTU 10222 bytes

 300 second input rate: 21361776 bits/sec, 1206 packets/sec, 2.15% utilization

 300 second output rate: 45856168 bits/sec, 1684 packets/sec, 4.60% utilization

 39485349681 packets input, 66689472467389 bytes, 0 no buffer

 Received 241303 broadcasts, 914355 multicasts, 39484194023 unicasts

 0 input errors, 0 CRC, 0 frame, 0 ignored

 0 runts, 0 giants

 41517355665 packets output, 144012302456805 bytes, 0 underruns

 Transmitted 96183737 broadcasts, 122737070 multicasts, 41298434858 unicasts

 0 output errors, 0 collisions

---------------------------------------------

here are the netstat output on the server:

C:\>netstat -s

IPv4 Statistics

 Packets Received                   = 1458824526

 Received Header Errors             = 0

 Received Address Errors            = 6427

 Datagrams Forwarded                = 0

 Unknown Protocols Received         = 0

 Received Packets Discarded         = 452371

 Received Packets Delivered         = 3317488819

 Output Requests                    = 1833484949

 Routing Discards                   = 0

 Discarded Output Packets           = 802

 Output Packet No Route             = 1

 Reassembly Required                = 457925

 Reassembly Successful              = 91585

 Reassembly Failures                = 0

 Datagrams Successfully Fragmented  = 0

 Datagrams Failing Fragmentation    = 0

 Fragments Created                  = 0

IPv6 Statistics

 Packets Received                   = 0

 Received Header Errors             = 0

 Received Address Errors            = 0

 Datagrams Forwarded                = 0

 Unknown Protocols Received         = 0

 Received Packets Discarded         = 9114

 Received Packets Delivered         = 9716

 Output Requests                    = 18832

 Routing Discards                   = 0

 Discarded Output Packets           = 0

 Output Packet No Route             = 4

 Reassembly Required                = 0

 Reassembly Successful              = 0

 Reassembly Failures                = 0

 Datagrams Successfully Fragmented  = 0

 Datagrams Failing Fragmentation    = 0

 Fragments Created                  = 0

ICMPv4 Statistics

                           Received    Sent

 Messages                  437364      439713

 Errors                    0           0

 Destination Unreachable   9119        11481

 Time Exceeded             0           0

 Parameter Problems        0           0

 Source Quenches           0           0

 Redirects                 0           0

 Echo Replies              0           428225

 Echos                     428245      0

 Timestamps                0           0

 Timestamp Replies         0           0

 Address Masks             0           0

 Address Mask Replies      0           0

 Router Solicitations      0           0

 Router Advertisements     0           0

ICMPv6 Statistics

                           Received    Sent

 Messages                  9114        9114

 Errors                    0           0

 Destination Unreachable   9114        9114

 Packet Too Big            0           0

 Time Exceeded             0           0

 Parameter Problems        0           0

 Echos                     0           0

 Echo Replies              0           0

 MLD Queries               0           0

 MLD Reports               0           0

 MLD Dones                 0           0

 Router Solicitations      0           0

 Router Advertisements     0           0

 Neighbor Solicitations    0           0

 Neighbor Advertisements   0           0

 Redirects                 0           0

 Router Renumberings       0           0

TCP Statistics for IPv4

 Active Opens                        = 13723304

 Passive Opens                       = 14495802

 Failed Connection Attempts          = 258

 Reset Connections                   = 257731

 Current Connections                 = 1690

 Segments Received                   = 3291832677

 Segments Sent                       = 1812636097

 Segments Retransmitted              = 17947801

TCP Statistics for IPv6

 Active Opens                        = 13

 Passive Opens                       = 9

 Failed Connection Attempts          = 4

 Reset Connections                   = 14

 Current Connections                 = 0

 Segments Received                   = 602

 Segments Sent                       = 591

 Segments Retransmitted              = 11

UDP Statistics for IPv4

 Datagrams Received    = 24746073

 No Ports              = 452332

 Receive Errors        = 39

 Datagrams Sent        = 2461345

UDP Statistics for IPv6

 Datagrams Received    = 0

 No Ports              = 9114

 Receive Errors        = 0

 Datagrams Sent        = 9114

AND more:

C:\>netstat -e

Interface Statistics

                          Received            Sent

Bytes                    4148407736      2097489827

Unicast packets          1930810659      2414008888

Non-unicast packets        92435683          110666

Discards                      30186           30186

Errors                            0               0

Unknown protocols                 0

That's all I can get at this point.

67 Posts

September 22nd, 2011 20:00

Well, there is more than just discard problems (resets, fragmentation, retransmits).  Can you provide the switch stats for these ports, and if possible the switch logs for errored events (purge any company IP data before posting).  I'm wondering if your seeing these issues as a function of resource limitations on the switch (rcv/tx buffer)...  I assume the interfaces (ports, vlans, virtual interfaces) are all set to support JUMBO?  EQL default is 9k and your server is +10K bytes.

Can you tell us what type switch this is and if flow control is enabled on the switch too?

INITIATOR:

       General Server Configuration

       -  IP MTU 10222 bytes

       -  flow control enabled

       General IPv4 Stats

       -  Received Address Errors            = 6427 <-- ?!

       -  Received Packets Discarded         = 452371 <--  Not surprised...

       -  Discarded Output Packets           = 802  <-- timed out...

       -  Reassembly Required                = 457925  < -- Fragmentation of frames?!

       -  Reassembly Successful              = 91585 <-- 5 to 1 correction.  Not good.

       ICMP (IPv4) (Rcv/Trx)

       -  Messages                          437364       439713

       -  Destination Unreachable   9119        11481 <-- ?!

       ICMP (IPv6) (Rcv/Trx)

       -  Messages                  9114        9114

       -  Destination Unreachable   9119        11481

       TCP Statistics for IPv4

        -  Failed Connection Attempts          = 258

        -  Reset Connections                   =  257731  <-- !

        -  Segments Retransmitted              = 17947801 <--  wow!

seen enough on this end...

37 Posts

October 1st, 2011 07:00

Hi. Had serious issues at work this week with san arrays dieing, volumes, dieing, etc. You wouldn't believe it. Anyway, The MTU size on all our server nic cards is set to 9000. The switches have an MTU value of 10218 or something close to that. The reassembly numbers are disconcerting. There must be something wrong here. Is it possible some of that data is due to the traffic that is on the NIC but not iSCSI traffic? Like RDP, SQL connections, etc? We want to segregate them, but right now that's not an option for us unfortunately.

67 Posts

October 2nd, 2011 04:00

Sorry to hear that.  It does sound like fragmentation is the problem.  Look to my previous post and see if you can provide some of that info.  To your question about traffic not completely being iSCSI, yes.

JP

37 Posts

October 6th, 2011 07:00

Thanks. I don't know if flow control is enabled on every switch. I had only ever read it should be on the host side. Should it be on every switch and are there any concerns with that?

Also, I don't understand how smaller traffic (MTU:1500) could be the reason for fragmented data numbers. It would be wasted space, but 9000 or 10218 is enough to cover the 1500 for regular traffic, right?

67 Posts

October 6th, 2011 23:00

Flow Control should only be deployed at edge/access switches to hosts for congestion management.  Never at the core.  As for a smaller frame (MTU 1518) with the switch jumbo frame setting (9k) that should not be a problem.  

Here are some things I have seen with interoping switches (Cisco, Juniper, Brocade, Nortel, Foundry, & Powerconnect) while using jumbo frame...

- COMMs (server nic) driver.  This might be a point to look at.  Set your NIC to MTU 1500 and monitor traffic for a spell and check your counters.  If they stop that's one datapoint closer to helping you understanding.

- switch to switch incompatibility.  I've seen some switches have resource problems where when set to a higher supported size and negotiate to lower, the consequence is allocated resources on the switch regardless on nonusage (i.e. rcv/tx buffer).  This might be an area of interest for you.  Try to dial down your 10219 to 9000.  Check your counters.

- Switch mac table corruption causing frame to be lost.  We found this to be true in some switches when utilization was high.  You might want to monitor your proc usage.  Again just another datapoint to help you understand.

- Switch to switch phy design.  I've actually run across an issue recently where a switch and controller eth interface hand a preamble problem (interframe gap) resulting in packet losses but here was the catcher - more penevolent when Jumbo frames were used.

- not all switches metrics are the same - subject to interpretation.  Meaning 9000 byte size frame is not neccessarily 90000 byte size frame.  Some vendors adjust Only for payload and don't count the header and fcs...  Dial down your switches to the 9000 metric and set your init and Target MTU to under that... 8000 byte size frames.  Check counters... another data point...

This is really going to bite if its just an outdated driver problem...  check to see if you have out of sequence frames.  This would point to the switch fabric.

37 Posts

October 7th, 2011 08:00

so should I type "no flow-control" on our core hub switches? Only the Firewall and other access switches are plugged into the core switches right now.

I've updated the NIC drivers, with no improvement in numbers.

I don't know if Brocade and Foundry switches can control specific MTU size. I think if you turn on jumbo frames it sets the MTU to whichever each supports.

if processor usage is low, is mac table corruption still possible? how do I verify/fix?

67 Posts

October 7th, 2011 22:00

Without understanding how your topology is really layed out, I can only tell you that flowcontrol is typically at the edge to help manage traffic bursts and such to your attached devices/appliances.  

Can you show a topology/depiction of what you have (10,000 foot view)  without giving up any of your trade secrets...  

67 Posts

October 7th, 2011 22:00

Also did you attempt to reduce the MTU of your server and check the counters.  That's relatively easy and quick to do...

37 Posts

October 11th, 2011 08:00

I'm going to wait a few more days, but there are zero errors or discards since I installed latest drivers and management suite from Broadcom site. I saw in an iSCSI document put out there that Dell recommends you get Broadcom's site's driver, not from Dell's support site. Very non-intuitive IMO. Anyway, we'll see.

37 Posts

October 11th, 2011 08:00

Can you point me to an article that discusses more in detail why not to turn on flow control at the core level?

67 Posts

October 11th, 2011 20:00

Glad to see the errors resolve after the driver update.  Told you it was going to bite!  There are many sources for flow control best practices.  Google "flow control at link layer"...  Keep in mind, Core is network layer, make sense?...

No Events found!

Top