Jason_Zhou

1.2K 消息

8075

2013年3月19日 19:00

【专家问答第九期】VNX上软件设置或硬件问题可能导致性能下降的分析和处理

好消息，新一期的中文“专家问答”活动又将开启啦。

本期活动的技术话题为：VNX上软件设置或硬件问题可能导致性能下降的分析和处理。从2013年3月25日（周一）开始为期两周的时间里，我们将和大家一起讨论和分享关于此话题的经验和心得。

问答将涵盖VNX的Block和File。下面是一些相关的典型场景和会讨论到的常见问题：

• 由于主机软件配置不正确或其他硬件问题导致VNX的LUN频繁切换, SAN访问速度明显下降

• 因为VNX软件配置不合理或其他硬件问题导致VNX部分LUN的SAN访问超时

• 可能对NAS访问性能产生影响的一些VNX File端的设置

• 常见的CIFS访问问题

以往所有已完成的“专家问答”活动可参考这个汇总贴。

本期讨论主题：VNX上软件设置或硬件问题可能导致性能下降的分析和处理

本期持续时间：2013年3月25日 – 4月6日，为期两周。活动结束后，本贴将锁定，有相关的后续问题可开新贴提问。

本期我们邀请到的两位专家是： Davy Sun和John Zhou。

Davy Sun 19年IT从业经验，1994~2003年先后在通信行业以及跨国企业从事。2004年加入EMC售后技术支持部门，主要负责VNX，CX，RPA系列产品的技术支持，包括VNX，CX，RPA的常见故障处理和性能分析。

John Zhou 8年IT技术支持工作经验，2005~2009年从事微软产品的技术支持。2009年初加入EMC全球支持中心，从事Celerra系列存储以及VNX系列存储File方面的技术支持工作。熟悉CIFS环境的故障处理。

和专家交流，与同行畅谈。欢迎大家以回帖的方式就“VNX上软件设置或硬件问题可能导致性能下降的分析和处理”这个主题，来积极提问和踊跃发表自己的意见。期待您的参与！

回复(33)

gbzhuang

29 消息

0

2013年3月28日 06:00

了解了，看来软件和硬件方面都有不少可能的诱因啊。这里你提到了ALUA，能稍微再详细点介绍下ALUA模式以及它的一些设置可能会带来的影响吗？ALUA默认或者推荐情况下，应该都是开启的吧？有需要禁用它的特殊情况吗？再次感谢！

yangbpet

39 消息

0

2013年3月28日 07:00

谢谢专家的详细实用的分享！光是一个文件系统和volume划分的方面，就有这么多的学问。真是收益匪浅！

XuejunS

18 消息

0

2013年3月29日 00:00

ALUA默认或者推荐情况下，应该都是开启的吧？

不一定, 一般要看CX上CONNECTIVITY STATUS里的主机HBA的FAILOVER MODE设置, 如果是4, 才表明开启

有需要禁用它的特殊情况吗？

一般都不要, 虽然可能出现HBA或SWITCH的SFP处于时好时坏的不稳定状态, 还是有可能LUN会频繁切换,这时候

即使停用ALUA, 只使用未出问题的HBA, 如果症状消失, 就查出有问题的部件, 只要找出有问题的部件,再打开ALUA就可以迅速恢复.

XuejunS

18 消息

0

2013年3月29日 00:00

以下是ALUA的好处:

Benefits of CLARiiON and VNX Asymmetric Active/Active (ALUA) feature are as

follows:

1. ALUA avoids unavailability of boot from SAN during path
failure.

Whenever there is a optimal path failure during boot from SAN operation, with
the request-forwarding feature in ALUA, users do not need to explicitly trespass
the LUN to the other SP. If the HBA BIOS settings are configured accordingly, so
that it can issue an I/O to the non-optimal path accessing the peer SP, I/O
will route through the upper redirector to the owner SP and boot the operating
system successfully.

2. Reduces data unavailable situations due to
misconfiguration of the host.

Misconfiguration in a host might occur from user side, when an application
sends an I/O to a non-optimal path, meaning the SP does not own the LUN. In this
case, depending on the failover software installed on the host, the CLARiiON and
VNX storage system does not return an I/O error, instead, due to
request-forwarding feature in ALUA mode, the I/O is routed to the SP for
processing that owns the LUN. This automatically adjusts the optimal path
setting for a LUN, which is beneficial to larger production environments where
chances of misconfiguration are higher.

3. It avoids LUN ownership thrashing situations that may occur in
SnapView and cluster configurations.

ALUA avoids the trespass of LUNs between the two SPs in a one-path-per-SP
host configuration. For both cluster and SnapView congfigurations, with the
introduction of ALUA standard, the LUN will not trespass back and forth between
the two SPs, but will be owned by the SP through which maximum I/O requests for
that LUN are received by a given host in ALUA mode.

4. It supports standard SCSI multipath interfaces.

Host failover software does not need to be concerned with CLARiiON-specific
trespass commands because CLARiiON and VNX implements the ALUA SCSI
standard.

5. It masks certain CLARiiON and VNX back-end
failures.

In case of a back-end failure such as LCC failure, on the SP that is the
current owner of the LUN, by using the request-forwarding feature in ALUA, I/O
is routed through the lower redirector to the peer SP. The I/O acknowledgement
is sent through the SP that owns the LUN, by using the lower director. No
intervention of the failover software is required on the hosts, thus masking
certain back-end failures.

XuejunS

18 消息

0

2013年3月29日 00:00

ALUA的支持:

1. Asymmetric logical unit access (ALUA) mode is a failover mode available with CX3 FLARE Operating Environment version 03.26 and later , CX4 和VNX

2. 现在大多数新版本的OS都支持ALUA, 对于AIX, 要 5.3 SP1 for AIX and later. 具体详细的还是看EMC PRIMUS emc187614.

ALUA的调整:

The failover mode setting applies to a server’s HBA ports, and is configured through the CLARiiON Navisphere GUI or CLI. 更改后必须重新启动.

ALUA的工作原理:

总的来说,就是通过CX,当外部光纤环路出现问题必须TRESPASS时, LUN的SP属主不变,即不做LUN的TRESPASS, 而是让IO走另外一个PEER SP,IO 通过VNX两SP间的内部通道(CMI),这样一来虽然内部通道(CMI)有带宽被占用(CMI带宽很大),但大大减少了LUN TRESPASS的次数, 而且也均衡的SP的LOADING, 不会导致所有的LUN都从出现环路的SP切换到另一SP,导致单SP的LOADING过重.

以下是更详细的英文解释:

In a scenario where two storage processors exist (that is, SP A and SP B), assume that the default owner of the LUN is SP A. In the ALUA mode, the paths to SP A are dead, PowerPath can manage to sendthe I/O via the active paths to SP B. This implies that the LUN is not trespassed thereby leading to SP A continuing to be the owning SP.

After ample I/O has gone down paths to SP B, the array will trespass the LUN. In the case of operation in the PAR mode, PowerPath trespasses the LUN when all paths to the owning SP are dead. Thus, the owning SP would be SP B as a result of an explicit trespass initiated by PowerPath at the host bootup/powermt config phase.

Asymmetric Active/Active introduces a new initiator Failover Mode (Failover
mode 4) where initiators are permitted to send I/O to a LUN regardless of which
SP actually owns the LUN.

Manual trespass

When a manual trespass is issued
(using Navisphere Manager or CLI) to a LUN on a SP that is accessed by a host
with Failover Mode 1, subsequent I/O for that LUN is rejected over the SP on
which the manual trespass was issued. The failover software redirects I/O to the
SP that owns the LUN.

A manual trespass operation causes the
ownership of a given LUN owned by a given SP to change. If this LUN is accessed
by an ALUA host (Failover Mode is set to 4), and I/O is sent to the SP that does
not currently own the LUN, this would cause I/O redirection. In such a
situation, the array based on how many I/Os (threshold of 64000 +/- I/Os) a LUN
processes on each SP will change the ownership of the LUN.

Path, HBA, switch failure

If a host is configured
with Failover Mode 1 and all the paths to the SP that owns a LUN fail, the LUN
is trespassed to the other SP by the host抯 failover software.

With
Failover Mode 4, in the case of a path, HBA, or switch failure, when I/O routes
to the non-owning SP, the LUN may not trespass immediately (depending on the
failover software on the host). If the LUN is not trespassed to the owning SP,
FLARE will trespass the LUN to the SP that receives the most I/O requests to
that LUN. This is accomplished by the array keeping track of how many I/Os a LUN
processes on each SP. If the non-optimized SP processes 64,000 or more I/Os than
the optimal SP, the array will change the ownership to the non-optimal SP,
making it optimal.

SP failure

In case of an SP failure for a host
configured as Failover Mode 1, the failover software trespasses the LUN to the
surviving SP.

With Failover Mode 4, if an I/O arrives from an ALUA
initiator on the surviving SP (non-optimal), FLARE initiates an internal
trespass operation. This operation changes ownership of the target LUN to
the surviving SP since its peer SP is dead. Hence, the host (failover software)
must have access to the secondary SP so that it can issue an I/O under these
circumstances.

Single backend failure

Before FLARE Release 26, if
the failover software was misconfigured (for example, a single attach
configuration), a single back-end failure (for example, an LCC or BCC failure)
would generate an I/O error since the failover software would not be able to try
the alternate path to the other SP with a stable backend.

With
release 26 of FLARE, regardless of the Failover Mode for a given host, when the
SP that owns the LUN cannot access that LUN due to a back-end failure, I/O is
redirected through the other SP by the lower redirector. In this situation, the
LUN is trespassed by FLARE to the SP that can access the LUN. After the failure
is corrected, the LUN is trespassed back to the SP that previously owned the
LUN. See the 揈nabler for masking back-end failures� section for more
information.

born_chen

1.8K 消息

0

2013年3月29日 03:00

一台5700，挂在NFS上面的程序进程总是无故挂起，存储后台看又没发现特殊的，其中某个网络端口偶尔有丢包。

有什么更好的诊断思路么？

A

Anonymous

5 Practitioner

•

274.2K 消息

0

2013年3月30日 21:00

您好，

关于CAVA，一般在2种情况下，会对CIFS共享文件进行病毒扫描：

1. 在写入时扫描

CAVA 会在修改和关闭文件后启动扫描。如果打开了文件但未对其进行修改，则CAVA 不会在文件关闭时执行扫描。

2. 在第一次读取时扫描

CAVA 根据文件的访问时间来确定是否扫描该文件。AV 引擎会将此访问时间与EMC CAVA 服务中存储的参考时间进行对比。如果文件的访问时间早于参考时间，则在CIFS 客户端打开此文件之前，AV 引擎会在读取此文件时对其进行扫描。

在AV 引擎上检测到病毒定义文件更新时，CAVA 将更新“在第一次读取时扫描”访问时间。

由此可见，“在第一次读取时扫描”在每次更新病毒定义后，对于每个文件仅会发生一次。因此，“在写入时扫描”发生的频率一般要高于“在第一次读取时扫描”。

当病毒扫描发生时，文件的访问（读取或写入）会被挂起，直到扫描完成。所以说CAVA对于CIFS访问的性能（尤其是写性能）是有一定影响的。

如果用户环境中的访问数量较大，配置的AV服务器来不及处理扫描请求，用户可能就会比较明显的感觉到访问性能下降的问题。

当正在进行中的病毒扫描请求数量大于在viruschecker.conf配置文件中所定义的lowWaterMark、highWaterMark数值时，CAVA将向VNX发送日志事件。默认lowWaterMark=50，highWaterMark=200。如果我们在VNX日志中频繁发现这样的事件，说明AV服务器不能及时响应病毒扫描请求。

您可使用CAVA Calculator 和CAVA 大小调整工具来确定系统需要的AV 服务器数量。

CAVA Calculator 可为您提供安装前的帮助，并且您可在安装后使用此工具运行假定情形分析。

CAVA 大小调整工具会从正在运行的环境中收集信息，为您提供有关所需CAVA 服务器数量的建议。如果需要考虑容错问题，您至少应在网络上配置两个AV 服务器。如果其中一个AV 服务器离线或VNX 无法访问AV 服务器，则使用两个AV 服务器可确保维持文件扫描功能。

如果网络上有多个AV 服务器，VNX 将以循环调度方式分发扫描任务，以此平衡多个AV 服务器之间的负载。例如，如果一个AV 服务器离线，VNX 将在其他可用的AV 服务器之间分布扫描负载。

更多信息，可以参考“使用VNX™ Event Enabler”PDF文档。

A

Anonymous

5 Practitioner

•

274.2K 消息

0

2013年3月30日 22:00

另外，这里说一下关于CAVA的一个比较重要的参数：

在viruschecker.conf配置文件中，我们可以定义“shutdown=”参数，指定在无任何AV服务器可用时采取的关机操作。

选项中包含下列参数：

shutdown=cifs

— 在无AV 服务器可用时停止CIFS。（所有Windows客户端都将无法访问任何VNX 共享。）

如果严格的数据安全性在您的环境中很重要，您应启用此选项，以在所有AV 服务器均不可用时阻止客户端对文件的访问。如果此选项处于未启用状态，同时所有AV服务器均不可用，客户端将可在无任何病毒检查操作的情况下修改文件。

注意：如果您配置的AV 服务器数量小于2，则应禁用shutdown=CIFS。

shutdown=no

— 在无AV 服务器可用时继续重新尝试列出AV 服务器。此时存在两个水位线（高和低）。在达到每个水位线时，将会发送事件日志。请使用事件日志来在Data Mover 上执行正确的纠正措施，以确保病毒检查功能运行正常。

shutdown=viruschecking

— 在无AV 服务器可用时停止病毒检查。（Windows 客户端可在无病毒检查的情况下访问VNX 共享。）

此参数的默认值为shutdown=no。

用户应当根据实际需求考虑，合理设置该参数。

如果需要严格控制数据安全性，可以考虑设置shutdown=cifs，但这样做可能会导致CIFS共享在无AV服务器可用时出现无法访问的情况。

若需要保证数据的可用性，可以设置shutdown=viruschecking。这样即使AV服务器不可用时，也能保证CIFS共享的访问，但是其上的数据将不会进行病毒检查。

A

Anonymous

5 Practitioner

•

274.2K 消息

0

2013年3月30日 23:00

您好，

关于您提出的这个问题，挂在NFS上面的程序进程总是无故挂起，当问题发生时的具体症状是什么？

是NFS共享完全无法访问？还是访问性能慢？

关于诊断思路，个人认为首先可以查看下NFS客户端的日志，以及VNX的日志，确认有没有NFS相关的报错信息。

如果怀疑是性能问题，可以运行server_stats server_x -monitor nfs-std 或者 nfsOps-std 查看VNX Data Mover层面上NFS共享的访问速率以及响应时间。

如果怀疑是网络问题，可以运行server_tcpdump在NFS客户端连接的Data Mover的网络端口上抓包，进一步分析网络上是否有丢包或者其它问题。

leelijb

60 消息

0

2013年4月1日 05:00

请教两位专家，在对VNX系统做性能上的评估或分析测试时（比如NAS或是block方面的），一般会用到哪些工具或方法？有没有你们自己常用的工具或是方法可以推荐一下的？

XuejunS

18 消息

0

2013年4月1日 19:00

分析CX和VNX BLOCK性能的最好工具是CX和VNX上的ANALYZER软件, 该软件需要购买, 购买激活后,

客户可以自己实时观察存储的IO性能表现, 也可收集数据(NAR 文件格式)后回去生成相关图表, 放在PC上离线分析.

在PC上离线分析必须下载并安装以下软件包:

1. JAVA RUNTIME ENVIRONMENT(最新)

2. UNISPHERE SERVER FOR WINDOWS

3. UNISPHERE CLIENT FOR WIN

Yanhong1

1.6K 消息

0

2013年4月1日 19:00

谢谢专家们的热心解答，也感谢大家热心参与。有一个小建议，大家觉得专家的回答有帮助的话，烦请点击一下页面右上角的那个竖起大拇指的图标，表示“喜欢”和对专家工作的肯定。

A

Anonymous

5 Practitioner

•

274.2K 消息

0

2013年4月1日 22:00

在NAS方面，在图形界面里，我们可以使用Celerra Monitor来看VNX File在过去3个月的性能数据。

在命令行里，我们可以运行server_stats命令实时查看Data Mover的各项性能参数。例如CIFS、NFS的访问速率，以及各项操作的响应时间。

如果怀疑用户网络问题导致性能下降，我们可以运行server_tcpdump命令在Data Mover端口上抓包进行分析。

另外，还有一些其他命令，例如“server_netstat server_2 -i -p tcp”可以用来查看Data Mover的tcp重传率；“.server_config server_2 "printstats scsi"”可以看Data Mover到后端存储的光纤上的繁忙程度以及I/O等待数量，从而判断是否后端存储是性能瓶颈。

zhouzengchao

2 Intern

•

1.4K 消息

0

2013年4月1日 23:00

TCP重传率多少可以接受？对于所有应用都一样吗？还是说只是一个rule of thumb？能否分享一些导致TCP重传的原因以及可能的解决方法？

A

Anonymous

5 Practitioner

•

274.2K 消息

0

2013年4月2日 17:00

您好，

一般认为TCP重传率不应超过0.1%，否则可能导致共享访问性能下降问题。

TCP重传率高，一般是由用户网络问题导致的，而非NAS端的设置问题，例如网络佣塞。对于网络这块知识了解不多，还望论坛内的高手补充。

从NAS这边来讲，我们可以在Data Mover上启用tcp.fastRTO参数，以加快TCP重传过程（将重传超时时间从1.5秒减少到0.5秒），从而改善性能。然而，需要根本的解决TCP重传率高的问题，还是要从用户网络方面排查。

1
2
3

查看全部

找不到事件！

入门级和中端

【专家问答第九期】VNX上软件设置或硬件问题可能导致性能下降的分析和处理