Start a Conversation

Unsolved

This post is more than 5 years old

3285

September 28th, 2016 03:00

UnityVSA crash/reboot?

Hi,

I've installed UnityVSA CE on a server that has a LSI Megaraid card where I created one R5 LUN on SSD. In ESXi 6, I've created a datastore and assign a VMDK to UnityVSA and then, created a pool and, finaly, a block volume that I assigned back to VMware. I'm on the most recent version of the firmware (4.0.1.8404134) and upgraded to it from .8194551 because I had the same issue.

The problem I have, is that each time I migrate a VM from any other datastore to the UnityVSA datastore, either the migration hang with a timeout (if I'm lucky) or, usually, it just hang there at something like 31-36% migrated and then, UnityVSA just reboot by itself. This morning, I was connected on the service account with top running and I had the impression I saw some preparation from the system before the actual reboot (gdb appear in the top list of process, followed by pigz and, after about a minute, the VSA rebooted). Looking at vCenter, I can see that at some point, the CPU topped 100% and top was actually showing a steady load at 32-35 and then, suddenly, started to show load in the thousand (1700-2600).

After the reboot, everything seems fine and there are no events/logs that shows what happened from Unisphere. I'm sure there are logs somewhere but I'll have to dig for those. If I restart a migration of any of the VM, the same will happen again. However, moving those VM to any other datastore that I have on other device are working just great.

The server is a Xeon E3-1220v3 with 32GB of ram. As seen from vCenter, the server is not overloaded in any way. I tested with only UnityVSA running on it and I had the same result.

Anybody had that kind of issue? What can I check to start understanding the problem?

Thank you.

ehfortin

September 29th, 2016 20:00

I tried to get UnityVSA running a couple of months ago and ran into the same problem.

I tried again tonight and as before, any time any datastore shared has disk I/O, UnityVSA restarts.

The WebUI logs do not report any problems.  ESXi shows no issues or appreciable latency (<5ms) with the storage housing UnityVSA or the storage the Unity datastores reside on, but does show repeated disconnection and reconnections to the storage served by Unity.

Looks like I'll have to get familiar with the service console to see if I can figure out what's happening.

30 Posts

September 30th, 2016 11:00

Yesterday, I created a shared NFS on the same volume and assigned it to VMware as another datastore. I can move my VM there without issue. So, it seems to be related to block volume. I actually the same issue with VVol block volume but never tried the NFS VVol so I can't comment on this.

A few months ago (original release of VSA) I was able to use the block volume but I was on slower disks. I may try to create a volume on SAS/NLSAS disks to see if the problem is related to very fast IOPS. Right now, directly on the disks that are used to create the VMDK that is allocated to the VSA, I can nearly saturate a 10 Gbps link so that's a lot of potential IOPS when doing a vMotion through the VSA. That could explain why the CPU is banging on the roof at 100% not long before it decide to reboot.

Will have a look at the service console as well. Right now, it is unusable except as a Unisphere GUI simulator or through NFS.

ehfortin

October 1st, 2016 10:00

Interesting that you only see it with iSCSI.  I've only tried iSCSI so far.

Spent last night working with it trying to figure out what's happening.

The VSA logs show SCSI device reset and then it starts doing a core dump.

2016-10-01T00:33:44+00:00 self kernel: [ 1764.453775] sd 2:0:1:0: [sdb] task abort on host 2, ffff880108029540

2016-10-01T00:33:44+00:00 self kernel: [ 1764.453861] sd 2:0:1:0: [sdb] Failed to get completion for aborted cmd ffff880108029540

2016-10-01T00:33:44+00:00 self kernel: [ 1764.453865] sd 2:0:1:0: [sdb] SCSI device reset on scsi2:1

2016-10-01T00:33:44+00:00 self kernel: [ 1764.454060] sd 2:0:3:0: [sdd] SCSI device reset on scsi2:3

2016-10-01T00:33:48+00:00 self kernel: [ 1768.396861] SIGNAL: From: csx_tim_thr[0]! [18871] To: csx_tim_thr[0]! [18871] Sig: 5

2016-10-01T00:33:48+00:00 self kernel: [ 1768.419103] SIGNAL: From: csx_tim_thr[0]! [18871] To: csx_tim_thr[0]! [18871] Sig: 37

2016-10-01T00:33:48+00:00 self kernel: [ 1768.419235] SIGNAL: From: csx_tim_thr[0]! [18871] To: csx_tim_thr[0]! [18871] Sig: 6

2016-10-01T00:33:48+00:00 self kernel: [ 1768.436484] DUMP (565): Opening dumpfile /usr/bin/perl /EMC/C4Core/tools/save_dump.pl safe 18867

2016-10-01T00:33:49+00:00 self kernel: [ 1768.591366] DUMP (2241): Writing out program headers for dump file

On the ESXi side, I see a bunch of "Pool 0: Blocking due to no free buffers" entries and then:

2016-10-01T00:31:12.092Z cpu0:36305)PVSCSI: 2628: scsi0:3: ABORT ctx=0xb1

2016-10-01T00:31:12.092Z cpu0:36305)PVSCSI: 2628: scsi0:3: ABORT ctx=0x97

2016-10-01T00:31:12.092Z cpu0:36305)VSCSI: 2590: handle 8195(vscsi0:3):Reset request on FSS handle 1050565 (0 outstanding commands) from (vmm0:UnityVSA01)

2016-10-01T00:31:12.092Z cpu3:32897)VSCSI: 2868: handle 8195(vscsi0:3):Reset [Retries: 0/0] from (vmm0:UnityVSA01)

2016-10-01T00:31:12.092Z cpu3:32897)VSCSI: 2661: handle 8195(vscsi0:3):Completing reset (0 outstanding commands)

2016-10-01T00:31:16.526Z cpu2:33130)NMP: nmp_ResetDeviceLogThrottling:3349: last error status from device mpx.vmhba32:C0:T0:L0 repeated 1 times

2016-10-01T00:31:16.528Z cpu1:37288)WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338: Path "vmhba33:C0:T0:L1" Peer SP is hung.

2016-10-01T00:31:16.528Z cpu2:37289)ScsiScan: 836: Path vmhba33:C0:T0:L0 supports REPORT LUNS 0x11

2016-10-01T00:31:16.528Z cpu3:37029)WARNING: VMW_SATP_LIB_CX: satp_lib_cx_otherSPIsHung:338: Path "vmhba33:C0:T0:L0" Peer SP is hung.

I'm not too familiar with parsing those logs, but looks like something is happening on the ESXi side before Unity has problems, which leads me to believe it's something hardware or config related outside of Unity.  The strange part though is that I don't seem to have any problems with other VMs accessing the storage.

What hardware are you running?

I'm using an HP ML10v2, and so far have tried with:

IBM M5015/LSI 9260-8i: Tthough it has a dead battery and performance outside of Unity IS horrible, so that one definitely isn't Unity specific

The onboard B120i in RAID mode: Performance on this is very good with others VMs using it directly, just want to get it shared via iSCSI.  Maybe this controller is just too weak for Unity, but strange that it works fine for other VMs.

30 Posts

October 1st, 2016 14:00

I have similar result. When doing a vmotion, the host is having connectivity issue with the iSCSI datastore coming from VSA. I've installed Rockstor as another VM to do some testing in parallel to make sure it is not coming from the hardware and everything work fine there. I can do the vmotion on it and I have really solid performance with it (over 650-700 MB/sec). This seems to indicate that the host, the 10 Gbps NIC, the LSI 9271-8i raid card and the Intel S3700 SSD are working great. No interruption, no connectivity issue.

I'm running this on a HP ML310e gen8 v2. I don't think the problem is coming from your B120i controller. It is not the most efficient controller but it seems to work reliably. The fact that we have the same issue while I'm running on a different RAID card and that I have the same result as you seems to point to Unity. I tested 3-4 NAS distro and a linux distro offering iSCSI through iSCSITGT and... everything work fine on those. They are not all fast but they work and never skip a beat.

I'll reinstall the first version of Unity that was offered in May as I seems to remember that I had success using it. Will keep you posted.

ehfortin

30 Posts

October 2nd, 2016 17:00

As proposed, I've reinstalled UnityVSA 4.0.0.7329527 (first release at time of the announcement) and everything work great. I've configured a pool on the same exact server, created a block volume on it and pass it back to VMware where I created a datastore on it. I can do vmotion on this datastore and if I run the AJA System Test 2.1 on it, with a 16GB test file, I get 404 MB/sec in write and 433 MB/sec in read. Not as good as directly on the datastore but that was expected. Most of the nas distro that I've tested on that hardware are giving worst results or about the same. I have at least one that is giving better result but only offer NFS (no iSCSI available on this NAS vAppliance).

Hopefully the newest release will be fixed. There is no reason that 4.0.0 doesn't have any connectivity issue while 4.0.1 is on the same hardware.

Anybody else has the same kind of results?

ehfortin

October 3rd, 2016 18:00

Great news, glad it's working for you!

I was hopeful based on your feedback: installed 4.0.0.7329527 again (on ML10v2 / B120i) and gave it a try.  Unfortunately, same result:  when some decent IO hits Unity, it core dumps and restarts.

I knew this happened before for me with the M5015 on 4.0.0.7329527, but wasn't sure if I had tried on the B120i.

30 Posts

October 4th, 2016 07:00

Sorry to hear that it is not working for you.

What is the processor that is installed in the ML10v2? I'm wondering if the problem could be related to some "protection" in the code that would do a reboot when the CPU is too busy (which may cause the connectivity issue). The system seems to do a clean reboot and take the time to start the debugger, create the logs and everything before rebooting.

The B120i seems to be highly linked to the driver which sound like it being software assisted. In the past, I had issue with some of the driver version for this card where everything was really really slow. It is documented on the net. That's why I changed to a more robust raid card that has a good processor onboard. As such, the server processor is not doing anything at all except manage the read/write to the LUN as presented by the LSI card.

Where I'm heading with that is that with the Xeon E3-1220v3, I have the reboot problem with newer version but not with the first release. I may be just below the threshold that would cause the reboot where if you have a processor that is slower, it could continue to trigger the reboot. That's just an hypothesis but as I hear nobody that has unity complaining, I assume the code is solid.

Maybe you should try back the LSI 9260-8i that you have. It should be a better card then the B120i and maybe you will have more success.

What kind of high load are you talking about? Here, I have an SSD R5 backend that is able to sustain at least 5-600 MB/sec in sequential write. Combined with a 10 Gbps network, I can see that it is putting some stress on the processor when sending lot of data at once. Having VSA in the middle is increasing the load as Unity has to manage a lot of stuff as well. So ESX is handling the hardware and having lot of stuff to do and the VM is doing the same at the virtual level. All in all, that's just more work for the underlying hardware. It is not less work on bigger server but there are more resources on E5, depending on the number of cores and the quantity of memory you have. So it may be less an issue on those bigger server.

ehfortin

October 10th, 2016 08:00

The ML10v2 has an i3 4150v3 CPU - only 2 cores (4 with HT).

As for disk load, I don't mean anything very high - just things like deploying VMware vCSA.  Probably less than 20-30MB/s, not sure of the exact IOPs numbers.  All going to a single SSD configured as a R0 array in the B120i.

I did some more experimenting and it seems like I can get UnityVSA to be somewhat stable if I'm not running a VM using Unity storage on the same ESXi host that's running Unity.  This used to work OK for me years ago with other software. 

Unity serving a datastore to the ESXi host it's running on and placing another VM on that datastore results in a core dump.  Unity serving another ESXi host over 1gbps iSCSI - seems OK but didn't test it very long.

I thought the low CPU power or limited number of cores might be related but I watched a lot of stats while waiting for Unity to core dump and never saw any contention on the ESXi side.  CPU usage never approached capacity, no issues with VMwait or CoStop.  Load average reported inside the Unity VM itself was always very high though - 30, then skyrockets once the core dump process starts.

So, still not sure exactly what's happening but maybe it's just a case of the i3 and/or the B120i not being up to the task.  Perhaps the B120i being CPU limited causes it to start to have IO issues when I've got multiple VMs running.

I did have a lot of trouble with the B120i using the newer drivers - 5.5.0-100, 92, 90 before I started trying the UnityVSA.  Once I downgraded to 88 though, performance was great with other VMs.  Maybe there's still some underlying issue with the B120i driver though.

Giving up on it for now, until I get some different hardware at least.

Thanks for all of the ideas!  If nothing else, it gave me a chance to watch a lot of logs and learn a bit.

30 Posts

October 11th, 2016 08:00

Got that problem with drivers after 5.5.0-88 as well. That's why I installed the LSI 9271-8i instead. Version 88 was working great but I wanted 6 Gbps for 8 ports as well as SAS compatibility. The B120i gives 2x 6 Gbps and 2x 3 Gbps and no SAS support.

For now, I'm running on something else as well as I can't afford to have my lab halting because of a reboot of the VSA. I'll keep looking at newer version and hoping for the best. My gear are what they are and everything else is working fine so I won't justify a hardware upgrade to a Xeon E5. Anyway, most Xeon E5 are slower per thread then my E3 and as I'm basically giving the whole E3 to Unity VSA (4 core) while the specs are asking for only 2 vCPU, I'm not sure it would actually change anything. At least, I'm not paying to find out.

Have fun with your other projects

ehfortin

No Events found!

Top