Start a Conversation

Unsolved

This post is more than 5 years old

4429

September 24th, 2015 11:00

High Latency for SSD on XtremIO, How do you trouble shoot?

Hi all,

In my case XtremIO is used for Oracle databases and the Performance meter does go into the red for Bandwidth.

I have both FC and iSCSI connections,  the database performance in question is connected via iSCSI.

I know that should be a big red flag, but what I need to prove is that the latency peaks are because the server is over running the 10 gig iSCSI connection and the latency is related to retransmits.  Well that is my theory until proven wrong.

Any one have any idea how to detect on the XtremIO or Server if a lot of network retransmits are taking place?

17 Posts

September 24th, 2015 11:00

I should define high latency in this case over   13,000 us   when I look at last 24 hours I see peaks of 87,000 us.

727 Posts

September 28th, 2015 20:00

Response time shown in the GUI (and CLI) is a calculated number (instead of it being measured). The calculation is such that the number of IOPs is in the denominator. So if you have very low number if IOPs (because of the nature of the workload), you will see artificially high latency numbers. Can you check the IOPS on the array when you see those high latencies?

In any case, scenarios for network retransmit needs to be investigated by taking traces. And performance analysis like this needs to have information about the workload that you are running. Have you tried opening an SR with the Support team?

17 Posts

September 29th, 2015 09:00

I can try opening an SR.  I will look at IOPS, next time I see high latencies.  Can you give me an idea of what would be considered low IOPs?    I just checked and I have a latency spike just over 18,000 us.  And the IOPS at the time was over  15,000 IOPs.

727 Posts

September 29th, 2015 20:00

18,000 us = 18ms of latency with 15,000 IOPs. The performance characteristic will depend on a lot of factors like:

  • Read/write mix
  • IO size
  • Concurrency
  • Number of X-Bricks

I would recommend discussing this with your account team who would have the necessary expertise to look at your workload and give you an idea of the expected performance from your XtremIO array.

51 Posts

September 29th, 2015 22:00

Your oracle databases is on a physical or virtual server?

Do you have any storage virtualisation like vplex?

The database is on linux or windows?

Have you check the latency on the client side?

How is the multipathing? Did you use non routable vlan? What can of switches?

The xtremIO recommandations have been applied on the client side?

If it's a network issue it will be easier to debug it on the client side.

1 Message

October 20th, 2015 07:00

I'm running into what seems like the same type of issue.  Our setup is Oracle 12c grid + 11g home + OL6 + VMWare 6 + Cisco UCS.   Issue seems to be related to ingress tcp congestion based on what I can see from the amount of retransmits from the xbricks using tcpdump/tshark on the OL6 host.

Using btest or vdbench with 32 threads and a 128k block size against a single volume on a 20TB Xbrick reports around 900MB/s throughput with around 2ms latency reported by iostat.   In Oracle I'm using ASM normal redundancy to mirror 2 volumes on separate XBrick clusters. 

Performing a large table scan or rman backup will produce 256k-512k block reads and I'll see a maximum read throughput of around 150MB/s per volume and latency around 10 - 15ms which obviously isn't acceptable.  

Disabling tcp_window_scaling improves thorughput to around 500MB/s per volume though degrades NFS throughput back to my NetApp filers.  Increasing tcp_rmem only degrades the issue further so it seems like a buffer issue elsewhere. 

It has also been found that by reducing the iSCSI sessions to one per Linux 10GbE interface restores the throughput to acceptable levels though this throws red flags as having a single 10GbE connection to each cluster isn't a good idea either.


SR has been opened up for the past few months and I'd assume that we have gone through all best-practice items.  Next steps are trying to attempt a multi-point packet capture to try and narrow down the device or interface that is dropping packets.  

17 Posts

November 2nd, 2015 14:00

urandom,

Our configurations do have pieces that are common to each other.  Thank you for your input.

17 Posts

November 3rd, 2015 14:00

Aglidic you asked.

Your oracle databases is on a physical or virtual server? physical

Do you have any storage virtualisation like vplex?  No

The database is on linux or windows? Linux

Have you check the latency on the client side? I will try.  It is hard to pinpoint, because they are short blips but many.

How is the multipathing? Powerpath.

Did you use non routable vlan? Yes.

What can of switches?  Nexus7000.

The xtremIO recommandations have been applied on the client side?  I believe so, but if you have a specific doc I should reference, I will review client settings.

727 Posts

November 4th, 2015 13:00

You can find the XtremIO recommended settings in XtremIO 2.2.x - 4.0.1 Host Configuration Guide

1 Message

January 19th, 2016 11:00

T_Koopman,

Did you ever resolve this issue?

No Events found!

Top