T_Koopman

17 Posts

4431

September 24th, 2015 11:00

High Latency for SSD on XtremIO, How do you trouble shoot?

Hi all,

In my case XtremIO is used for Oracle databases and the Performance meter does go into the red for Bandwidth.

I have both FC and iSCSI connections, the database performance in question is connected via iSCSI.

I know that should be a big red flag, but what I need to prove is that the latency peaks are because the server is over running the 10 gig iSCSI connection and the latency is related to retransmits. Well that is my theory until proven wrong.

Any one have any idea how to detect on the XtremIO or Server if a lot of network retransmits are taking place?

Responses(10)

T_Koopman

17 Posts

0

September 24th, 2015 11:00

I should define high latency in this case over 13,000 us when I look at last 24 hours I see peaks of 87,000 us.

K

Kumar_A

727 Posts

0

September 28th, 2015 20:00

Response time shown in the GUI (and CLI) is a calculated number (instead of it being measured). The calculation is such that the number of IOPs is in the denominator. So if you have very low number if IOPs (because of the nature of the workload), you will see artificially high latency numbers. Can you check the IOPS on the array when you see those high latencies?

In any case, scenarios for network retransmit needs to be investigated by taking traces. And performance analysis like this needs to have information about the workload that you are running. Have you tried opening an SR with the Support team?

T_Koopman

17 Posts

0

September 29th, 2015 09:00

I can try opening an SR. I will look at IOPS, next time I see high latencies. Can you give me an idea of what would be considered low IOPs? I just checked and I have a latency spike just over 18,000 us. And the IOPS at the time was over 15,000 IOPs.

K

Kumar_A

727 Posts

0

September 29th, 2015 20:00

18,000 us = 18ms of latency with 15,000 IOPs. The performance characteristic will depend on a lot of factors like:

Read/write mix
IO size
Concurrency
Number of X-Bricks

I would recommend discussing this with your account team who would have the necessary expertise to look at your workload and give you an idea of the expected performance from your XtremIO array.

A

Aglidic

51 Posts

1

September 29th, 2015 22:00

Your oracle databases is on a physical or virtual server?

Do you have any storage virtualisation like vplex?

The database is on linux or windows?

Have you check the latency on the client side?

How is the multipathing? Did you use non routable vlan? What can of switches?

The xtremIO recommandations have been applied on the client side?

If it's a network issue it will be easier to debug it on the client side.

_dev_urandom

1 Message

1

October 20th, 2015 07:00

I'm running into what seems like the same type of issue. Our setup is Oracle 12c grid + 11g home + OL6 + VMWare 6 + Cisco UCS. Issue seems to be related to ingress tcp congestion based on what I can see from the amount of retransmits from the xbricks using tcpdump/tshark on the OL6 host.

Using btest or vdbench with 32 threads and a 128k block size against a single volume on a 20TB Xbrick reports around 900MB/s throughput with around 2ms latency reported by iostat. In Oracle I'm using ASM normal redundancy to mirror 2 volumes on separate XBrick clusters.

Performing a large table scan or rman backup will produce 256k-512k block reads and I'll see a maximum read throughput of around 150MB/s per volume and latency around 10 - 15ms which obviously isn't acceptable.

Disabling tcp_window_scaling improves thorughput to around 500MB/s per volume though degrades NFS throughput back to my NetApp filers. Increasing tcp_rmem only degrades the issue further so it seems like a buffer issue elsewhere.

It has also been found that by reducing the iSCSI sessions to one per Linux 10GbE interface restores the throughput to acceptable levels though this throws red flags as having a single 10GbE connection to each cluster isn't a good idea either.

SR has been opened up for the past few months and I'd assume that we have gone through all best-practice items. Next steps are trying to attempt a multi-point packet capture to try and narrow down the device or interface that is dropping packets.

T_Koopman

17 Posts

0

November 2nd, 2015 14:00

urandom,

Our configurations do have pieces that are common to each other. Thank you for your input.

T_Koopman

17 Posts

0

November 3rd, 2015 14:00

Aglidic you asked.

Your oracle databases is on a physical or virtual server? physical

Do you have any storage virtualisation like vplex? No

The database is on linux or windows? Linux

Have you check the latency on the client side? I will try. It is hard to pinpoint, because they are short blips but many.

How is the multipathing? Powerpath.

Did you use non routable vlan? Yes.

What can of switches? Nexus7000.

The xtremIO recommandations have been applied on the client side? I believe so, but if you have a specific doc I should reference, I will review client settings.

K

Kumar_A

727 Posts

0

November 4th, 2015 13:00

You can find the XtremIO recommended settings in XtremIO 2.2.x - 4.0.1 Host Configuration Guide

udevrandom

1 Message

0

January 19th, 2016 11:00

T_Koopman,

Did you ever resolve this issue?

View All

No Events found!